all repos — site @ 78cb7b1bfecf6c3fd4264707a98990f324b982aa

source for my site, found at icyphox.sh

pages/blog/k8s-at-home.md (view raw)

  1---
  2template:
  3url: k8s-at-home
  4title: Setting up a multi-arch Kubernetes cluster at home
  5subtitle: My self-hosted infra, given the cloud native™ treatment
  6date: 2021-06-19
  7---
  8
  9I still remember my
 10[Lobste.rs](https://lobste.rs/s/kqucr4/unironically_using_kubernetes_for_my#c_kfldyw)
 11comment, mocking some guy for running Kubernetes for his static blog --
 12it _is_ my highest voted comment after all. But to be fair, I'm not
 13running mine for a static blog. In fact, I'm not even hosting my blog on
 14the cluster; but I digress. Why did I do this anyway? Simply put: I was
 15bored. I had a 4 day weekend at work and with nothing better to do to
 16other than play Valorant, and risk losing my hard earned Bronze 2 -- I
 17decided to setup a K8s cluster. These are the nodes in use:
 18
 19- `fern`: Raspberry Pi 4B (armhf, 4GB, 4 cores)
 20- `jade`: Oracle VM (amd64, 1GB, 1 core)
 21- `leaf`: Oracle VM (amd64, 1GB, 1 core)
 22
 23The Oracle machines are the free tier ones. It's great -- two static
 24public IPs, 50 gigs of boot volume storage on each + up to 100 gigs of
 25block volume storage. All for free.[^1] Great for messing around.
 26
 27[^1]: No, this is not an advertisement.
 28
 29Since my RPi is behind a CG-NAT, I'm running a Wireguard mesh that looks
 30something like this:
 31
 32![wireguard mesh](https://x.icyphox.sh/zgELS.png)
 33
 34Wireguard is fairly trivial to set up, and there are tons of guides
 35online, so I'll skip that bit.
 36
 37## setting up the cluster
 38
 39I went with plain containerd as the CRI. Built v1.5.7 from source on all
 40nodes.
 41
 42I considered running K3s, because it's supposedly "lightweight". Except
 43it's not really vanilla Kubernetes -- it's more of a distribution. It
 44ships with a bunch of things that I don't really want to use, like
 45Traefik as the default ingress controller, etc. I know components can be
 46disabled, but I couldn't be arsed. So, `kubeadm` it is.
 47
 48```
 49kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=192.168.4.2
 50```
 51
 52Since I'm going to be using Flannel as the CNI provider, I set the pod
 53network CIDR to Flannel's default. We also want the Kube API server to
 54listen on the Wireguard interface IP, so specify that as well.
 55
 56Now, the `kubelet` needs to be configured to use the Wireguard IP, along
 57with the correct `resolv.conf` on Ubuntu hosts (managed by
 58`systemd-resolvd`)[^2]. This can be set via the `KUBELET_EXTRA_ARGS`
 59environment variable, in `/etc/default/kubelet`, for each node:
 60
 61```shell
 62# /etc/default/kubelet
 63
 64KUBELET_EXTRA_ARGS=--node-ip=192.168.4.X --resolv-conf=/run/systemd/resolve/resolv.conf
 65```
 66
 67[^2]: I hate systemd with such passion.
 68
 69Nodes can now be `kubeadm join`ed to the control plane. Next, we setup
 70the CNI. I went with Flannel because it has multi-arch images, and is
 71pretty popular. However, we can't just apply Flannel's manifest -- it
 72must be configured to use the `wg0` interface. Edit `kube-flannel.yaml`:
 73
 74```patch
 75...
 76      containers:
 77      - args:
 78        - --ip-masq
 79        - --kube-subnet-mgr
 80+       - --iface=wg0
 81...
 82```
 83
 84If everything went well, your nodes should now show as `Ready`. If not,
 85well ... have fun figuring out why. Hint: it's almost always networking.
 86
 87Make sure to un-taint your control plane so pods can be scheduled
 88on it:
 89
 90```
 91kubectl taint nodes --all node-role.kubernetes.io/master-
 92```
 93
 94Finally, set the `--leader-elect` flag to `false` in your control
 95plane's
 96`/etc/kubernetes/manifests/kube-{controller-manager,scheduler}.yaml`.
 97Since these are not replicated, leader election is not required. Else,
 98they attempt a leader election, and for whatever reason -- fail.
 99Horribly.[^3]
100
101[^3]: https://toot.icyphox.sh/notice/A8NOeVqMBsgu5DWLZ2
102
103## getting the infrastructure in place
104
105The cluster is up, but we need to set up the core components -- ingress
106controller, storage, load balancer, provisioning certificates, container
107registry, etc.
108
109### MetalLB
110
111The `LoadBalancer` service type in Kubernetes will not work in a bare
112metal environment -- it actually calls out to the respective cloud
113provider's proprietary APIs to provision a load balancer.
114[MetalLB](https://metallb.universe.tf/) solves this by well, providing
115an LB implementation that works on bare metal. 
116
117In essence, it makes one of your nodes attract all the traffic,
118assigning each `LoadBalancer` service an IP from a configured address
119pool (not your node IP). In my case:
120
121![jade loadbalancer](https://x.icyphox.sh/HriXv.png)
122
123However, this assumes that our load balancer node has a public IP. Well
124it does, but we're still within our Wireguard network. To actually
125expose the load balancer, I'm running Nginx. This configuration allows
126for non-terminating SSL passthrough back to our actual ingress (up
127next), and forwarding any other arbitrary port.
128
129```nginx
130stream {
131    upstream ingress443 {
132        server 192.168.4.150:443;
133    }
134
135    upstream ingress80 {
136        server 192.168.4.150:80;
137    }
138
139    server {
140        listen 443;
141        proxy_pass ingress443;
142        proxy_next_upstream on;
143    }
144    server {
145        listen 80;
146        proxy_pass ingress80;
147        proxy_next_upstream on;
148    }
149}
150```
151
152DNS can now be configured to point to this node's actual public IP, and
153Nginx will forward traffic back to our load balancer.
154
155### Nginx Ingress Controller
156
157Once MeltalLB is setup, `ingress-nginx` can be deployed. Nothing of note
158here; follow their [docs](https://kubernetes.github.io/ingress-nginx/deploy/).
159Each ingress you define will be exposed on the same `LoadBalancer` IP.
160
161### Longhorn
162
163Storage on bare metal is always a pain in the wrong place. Longhorn is
164pretty refreshing, as it literally just works. Point it to your block
165volumes, setup a `StorageClass`, and just like that -- automagic PV(C)
166provisioning. Adding block volumes can be done via the UI, accessed by
167portforwarding the service:
168
169```
170kubectl portforward service/longhorn-frontend -n longhorn-system 8080:80
171```
172
173There's just one catch -- at least, in my case. They don't have armhf
174images, so all their resources need:
175
176```yaml
177nodeSelector: kubernetes.io/arch=arm
178```
179
180Consequently, all pods using a PVC can only run on non-armhf nodes. This
181is a bummer, but I plan to switch the RPi over to a 64-bit OS
182eventually. This cluster only just got stable-ish -- I'm not about to
183yank the control plane now.
184
185### cert-manager
186
187Automatic certificate provisioning. Nothing fancy here. Follow their
188[docs](https://cert-manager.io/docs/installation/kubernetes/).
189
190## application workloads
191
192We did _all_ of that, for these special snowflakes. I'm currently
193running:
194
195- [radicale](https://radicale.org): CalDAV/CarDAV server
196- [registry](https://github.com/distribution/distribution): Container
197  registry
198- [yarr](https://github.com/nkanaev/yarr): RSS reader
199- [fsrv](https://github.com/icyphox/fsrv): File host service
200- [znc](https://znc.in): IRC bouncer
201
202I'm in the process of moving [Pleroma](https://pleroma.social) and
203[lms](https://github.com/epoupon/lms/) to the cluster. I'm still
204figuring out cgit.
205
206## closing notes
207
208That was a lot! While it's fun, it certainly feels like a house of
209cards, especially given that I'm running this on very low resource
210machines. There's about 500 MB of RAM free on the Oracle boxes, and about
2112.5 GB on the Pi.
212
213All things said, it's not terribly hard to run a multi-arch cluster,
214especially if you're running arm64 + amd64. Most common tools have
215multi-arch images now. It's just somewhat annoying in my case -- pods
216using using a PVC can't run on my Pi.
217
218Note that I glossed over a bunch of issues that I faced: broken cluster
219DNS, broken pod networking, figuring out how to expose the load
220balancer, etc. Countless hours (after the 4 days off) had to be spent
221solving these. If I had a penny for every time I ran `kubeadm reset`,
222I'd be Elon Musk.
223
224Whether this cluster is sustainable or not, is to be seen. However, it
225is quite nice to have your entire infrastructure configured in a single
226place: https://github.com/icyphox/infra
227
228```
229~/code/infra
230▲ k get nodes
231NAME   STATUS   ROLES                  AGE     VERSION
232fern   Ready    control-plane,master   7d11h   v1.21.1
233jade   Ready    <none>                 7d11h   v1.21.1
234leaf   Ready    <none>                 7d11h   v1.21.1
235```