all repos — site @ 72812b762c6aba86db69fc05d6eda26bb7575aab

source for my site, found at icyphox.sh

pages/blog/k8s-at-home.md (view raw)

  1---
  2template:
  3slug: k8s-at-home
  4title: Setting up a multi-arch Kubernetes cluster at home
  5subtitle: My self-hosted infra, given the cloud native™ treatment
  6date: 2021-06-19
  7---
  8
  9**Update 2021-07-11**: It was fun while it lasted. I took down the
 10cluster today and probably won't go back to using it. It was way too
 11much maintenance, and Kubernetes really struggles with just 1GB of RAM
 12on a node. Constant outages, volumes getting corrupted (had to `fsck`),
 13etc. Not worth the headache.
 14
 15I still remember my
 16[Lobste.rs](https://lobste.rs/s/kqucr4/unironically_using_kubernetes_for_my#c_kfldyw)
 17comment, mocking some guy for running Kubernetes for his static blog --
 18it _is_ my highest voted comment after all. But to be fair, I'm not
 19running mine for a static blog. In fact, I'm not even hosting my blog on
 20the cluster; but I digress. Why did I do this anyway? Simply put: I was
 21bored. I had a 4 day weekend at work and with nothing better to do to
 22other than play Valorant, and risk losing my hard earned Bronze 2 -- I
 23decided to setup a K8s cluster. These are the nodes in use:
 24
 25- `fern`: Raspberry Pi 4B (armhf, 4GB, 4 cores)
 26- `jade`: Oracle VM (amd64, 1GB, 1 core)
 27- `leaf`: Oracle VM (amd64, 1GB, 1 core)
 28
 29The Oracle machines are the free tier ones. It's great -- two static
 30public IPs, 50 gigs of boot volume storage on each + up to 100 gigs of
 31block volume storage. All for free.[^1] Great for messing around.
 32
 33[^1]: No, this is not an advertisement.
 34
 35Since my RPi is behind a CG-NAT, I'm running a Wireguard mesh that looks
 36something like this:
 37
 38![wireguard mesh](https://cdn.icyphox.sh/1Xkvh.png)
 39
 40Wireguard is fairly trivial to set up, and there are tons of guides
 41online, so I'll skip that bit.
 42
 43## setting up the cluster
 44
 45I went with plain containerd as the CRI. Built v1.5.7 from source on all
 46nodes.
 47
 48I considered running K3s, because it's supposedly "lightweight". Except
 49it's not really vanilla Kubernetes -- it's more of a distribution. It
 50ships with a bunch of things that I don't really want to use, like
 51Traefik as the default ingress controller, etc. I know components can be
 52disabled, but I couldn't be arsed. So, `kubeadm` it is.
 53
 54```
 55kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=192.168.4.2
 56```
 57
 58Since I'm going to be using Flannel as the CNI provider, I set the pod
 59network CIDR to Flannel's default. We also want the Kube API server to
 60listen on the Wireguard interface IP, so specify that as well.
 61
 62Now, the `kubelet` needs to be configured to use the Wireguard IP, along
 63with the correct `resolv.conf` on Ubuntu hosts (managed by
 64`systemd-resolvd`)[^2]. This can be set via the `KUBELET_EXTRA_ARGS`
 65environment variable, in `/etc/default/kubelet`, for each node:
 66
 67```shell
 68# /etc/default/kubelet
 69
 70KUBELET_EXTRA_ARGS=--node-ip=192.168.4.X --resolv-conf=/run/systemd/resolve/resolv.conf
 71```
 72
 73[^2]: I hate systemd with such passion.
 74
 75Nodes can now be `kubeadm join`ed to the control plane. Next, we setup
 76the CNI. I went with Flannel because it has multi-arch images, and is
 77pretty popular. However, we can't just apply Flannel's manifest -- it
 78must be configured to use the `wg0` interface. Edit `kube-flannel.yaml`:
 79
 80```patch
 81...
 82      containers:
 83      - args:
 84        - --ip-masq
 85        - --kube-subnet-mgr
 86+       - --iface=wg0
 87...
 88```
 89
 90If everything went well, your nodes should now show as `Ready`. If not,
 91well ... have fun figuring out why. Hint: it's almost always networking.
 92
 93Make sure to un-taint your control plane so pods can be scheduled
 94on it:
 95
 96```
 97kubectl taint nodes --all node-role.kubernetes.io/master-
 98```
 99
100Finally, set the `--leader-elect` flag to `false` in your control
101plane's
102`/etc/kubernetes/manifests/kube-{controller-manager,scheduler}.yaml`.
103Since these are not replicated, leader election is not required. Else,
104they attempt a leader election, and for whatever reason -- fail.
105Horribly.[^3]
106
107[^3]: https://toot.icyphox.sh/notice/A8NOeVqMBsgu5DWLZ2
108
109## getting the infrastructure in place
110
111The cluster is up, but we need to set up the core components -- ingress
112controller, storage, load balancer, provisioning certificates, container
113registry, etc.
114
115### MetalLB
116
117The `LoadBalancer` service type in Kubernetes will not work in a bare
118metal environment -- it actually calls out to the respective cloud
119provider's proprietary APIs to provision a load balancer.
120[MetalLB](https://metallb.universe.tf/) solves this by well, providing
121an LB implementation that works on bare metal. 
122
123In essence, it makes one of your nodes attract all the traffic,
124assigning each `LoadBalancer` service an IP from a configured address
125pool (not your node IP). In my case:
126
127![jade loadbalancer](https://cdn.icyphox.sh/zuy96.png)
128
129However, this assumes that our load balancer node has a public IP. Well
130it does, but we're still within our Wireguard network. To actually
131expose the load balancer, I'm running Nginx. This configuration allows
132for non-terminating SSL passthrough back to our actual ingress (up
133next), and forwarding any other arbitrary port.
134
135```nginx
136stream {
137    upstream ingress443 {
138        server 192.168.4.150:443;
139    }
140
141    upstream ingress80 {
142        server 192.168.4.150:80;
143    }
144
145    server {
146        listen 443;
147        proxy_pass ingress443;
148        proxy_next_upstream on;
149    }
150    server {
151        listen 80;
152        proxy_pass ingress80;
153        proxy_next_upstream on;
154    }
155}
156```
157
158DNS can now be configured to point to this node's actual public IP, and
159Nginx will forward traffic back to our load balancer.
160
161### Nginx Ingress Controller
162
163Once MeltalLB is setup, `ingress-nginx` can be deployed. Nothing of note
164here; follow their [docs](https://kubernetes.github.io/ingress-nginx/deploy/).
165Each ingress you define will be exposed on the same `LoadBalancer` IP.
166
167### Longhorn
168
169Storage on bare metal is always a pain in the wrong place. Longhorn is
170pretty refreshing, as it literally just works. Point it to your block
171volumes, setup a `StorageClass`, and just like that -- automagic PV/C
172provisioning. Adding block volumes can be done via the UI, accessed by
173portforwarding the service:
174
175```
176kubectl portforward service/longhorn-frontend -n longhorn-system 8080:80
177```
178
179There's just one catch -- at least, in my case. They don't have armhf
180images, so all their resources need:
181
182```yaml
183nodeSelector: kubernetes.io/arch=amd64
184```
185
186Consequently, all pods using a PVC can only run on non-armhf nodes. This
187is a bummer, but I plan to switch the RPi over to a 64-bit OS
188eventually. This cluster only just got stable-ish -- I'm not about to
189yank the control plane now.
190
191### cert-manager
192
193Automatic certificate provisioning. Nothing fancy here. Follow their
194[docs](https://cert-manager.io/docs/installation/kubernetes/).
195
196## application workloads
197
198We did _all_ of that, for these special snowflakes. I'm currently
199running:
200
201- [radicale](https://radicale.org): CalDAV/CarDAV server
202- [registry](https://github.com/distribution/distribution): Container
203  registry
204- [yarr](https://github.com/nkanaev/yarr): RSS reader
205- [fsrv](https://github.com/icyphox/fsrv): File host service
206- [znc](https://znc.in): IRC bouncer
207
208I'm in the process of moving [Pleroma](https://pleroma.social) and
209[lms](https://github.com/epoupon/lms/) to the cluster. I'm still
210figuring out cgit.
211
212## closing notes
213
214That was a lot! While it's fun, it certainly feels like a house of
215cards, especially given that I'm running this on very low resource
216machines. There's about 500 MB of RAM free on the Oracle boxes, and about
2172.5 GB on the Pi.
218
219All things said, it's not terribly hard to run a multi-arch cluster,
220especially if you're running arm64 + amd64. Most common tools have
221multi-arch images now. It's just somewhat annoying in my case -- pods
222using using a PVC can't run on my Pi.
223
224Note that I glossed over a bunch of issues that I faced: broken cluster
225DNS, broken pod networking, figuring out how to expose the load
226balancer, etc. Countless hours (after the 4 days off) had to be spent
227solving these. If I had a penny for every time I ran `kubeadm reset`,
228I'd be Elon Musk.
229
230Whether this cluster is sustainable or not, is to be seen. However, it
231is quite nice to have your entire infrastructure configured in a single
232place: https://github.com/icyphox/infra
233
234```
235~/code/infra
236▲ k get nodes
237NAME   STATUS   ROLES                  AGE     VERSION
238fern   Ready    control-plane,master   7d11h   v1.21.1
239jade   Ready    <none>                 7d11h   v1.21.1
240leaf   Ready    <none>                 7d11h   v1.21.1
241```