E13 Capital & Technology Blog

"My Pod doesn't see the clients' IP addresses": Kubernetes External Traffic Policy Caveats

Wed, 21 Aug 2024 00:00:00 +0000

Yesterday I had an interesting conversation with a friend who asked me how he could configure his Kubernetes deployment so that the application running in his Pods is able to see the IP addresses of the clients that issue TCP requests. Since his application was running behind a LoadBalancer service, I pointed him to the official Kubernetes docs on the topic that basically advise to set the Service’s .spec.externalTrafficPolicy to Local. This will lead to requests being served from the node they arrived at and consequently preserves the clients’ IP addresses. I didn’t forget to mention that this may lead to an imbalance in how traffic is routed to his Pods (as the documentation also mentions on another page). When he asked why that was the case I had to think for a second and ended up illustrating it to him with an example:

Imagine your cluster has 3 nodes and your application is deployed with 6 replicas, i.e. 6 Pods. The Pods are spread across the nodes in the following way:

1 Pod on Node 1
2 Pods on Node 2
3 Pods on Node 3

In the usual case you want incoming traffic to be distributed equally among the Pods, 1/6th of all requests to each node. Now with an external load balancer fronting your Kubernetes Service of type LoadBalancer, that external load balancer usually knows nothing about Pods but only about nodes (that’s true e.g. for an AKS Load Balancer and presumably for GKE, EKS and on-premise LBs, too). Your external load balancer will consequently balance traffic equally across your Kubernetes nodes, 1/3rd of all requests to each node. From there, a component called kube-proxy takes over and distributes the traffic to the matching Pods of the Service.

With the default external traffic policy of Cluster, kube-proxy will take into account all Pods of the whole cluster and distribute traffic equally among them, no matter where they run. This is illustrated in the following diagram:

1/6th of all requests goes to each Pod. Good, that’s what we want. But now that we want to reveal the clients’ IP addresses to the application running inside of the Pods, we change the traffic policy to Local as explained in the documentation. This will lead to a significant change of how traffic flows within your cluster. With all traffic from the external load balancer still being balanced equally among all nodes (because what does LB know about Kubernetes traffic policies, anyway?), kube-proxy will no longer forward it to Pods outside of the node that it’s running on, leading to the following traffic flow:

As you can see, now Pod 1 has to handle 1/3rd of all traffic, Pods 2 and 3 still handle 1/6th each and Pods 4, 5 and 6 only handle 1/9th. So Pod 1 has to handle 3 times as much traffic as Pods 4, 5 and 6. This is a huge imbalance and may lead to your application behaving very differently depending on which node handles a request.

What You Can Do About It

There’s multiple things you can do to preserve client IP addresses while still balancing traffic equally:

Use a Pod Topology Spread Constraint to run more or less the same amount of Pods on each node. This, of course, only makes sense if all the nodes have more or less the same resources in terms of CPU cores, RAM and network connectivity (depending on which one’s important to your application).
Use an ingress controller: Usually, ingress controllers allow for a much more fine-grained load-balancing behaviour, e.g. ingress-nginx can be configured to use a different load-balancing algorithm per Ingress resource.
Depending on your Kubernetes cluster provider you may be able to use the Gateway API which provides limited ways to influence the weighting of backends.

Kubernetes is your Operations Development Kit

Fri, 10 May 2024 00:00:00 +0000

Ever since I first dipped my toes into the Kubernetes waters, there were people arguing against Kubernetes along the lines of “you can run your stuff on a single EC2 instance much cheaper and simpler”. Here I want to lay out why I believe this is a short-sighted and incomprehensive perspective. Hear me out! 😉

Most people think of Kubernetes as a mere container orchestrator, a component so deep down in your application’s operational stack that developers don’t have to think about it as the target platform for the applications they build. Just build a container from your code and run it. And yes, that’s what Kubernetes does: it runs your containers. If Deployment/Pod is where you stop using Kubernetes’ machinery, then you are likely better off running your containers on an EC2 instance indeed. But I highly doubt this is where anyone stops even in a development or integration environment. I will give you a real-world example of a customer engagement where we were asked to create an offer for deploying and running a certain business application. The ask by the customer was explictly to have the application running on virtual machines.

The application the customer needed to operate was a critical part of a multi-tenant messaging bus so it had to be highly available and be properly monitored for outages. The application itself needs access to a database and an AMQP broker. This is what we came up with:

2 VMs for each instance of the application itself for high availability/failover scenarios
2 VMs for the database (primary + replica)
2 VMs for the AMQP broker (deployed as cluster for high availability)
1 VM for monitoring
1 VM for logging

Step 1 - The Path Towards Containers

That’s 8 VMs for running and monitoring a single business application. What’s the potential to drive down the cost for this setup? Sure, start with consolidating applications onto one VM. For our architecture proposal, we decided to put the monitoring and logging components onto the same VM. We then put the DB and the AMQP broker on the same VM, too. What’s the consequence of this consolidation? 37,5% percent cost reduction (minus 3 VMs). Good. But honestly, we separated the VMs by application domain on purpose to begin with. One of the reasons is better isolation for security purposes, e.g. for the case where one of the instances gets compromised. You simply reduce the potential to move laterally across the infrastructure.

How do I properly isolate the applications to keep fulfilling this security goal when they’re sharing the same VM? I insert an isolation layer. How do I do this in Linux? Using containers! That’s step 1.

Step 2 - Kubernetes to the Rescue

Now I have several VMs running that in turn run several containers each. Great! But I still need to properly manage network traffic flowing between each instance of my landscape. With containers I would use the runtime’s networking features to do this, probably create several container networks and allow traffic to flow from certain parts of the landscape towards other parts (e.g. let the each business application instance open a TCP port to the database). I’ll probably have to do some host-side iptables/nftables tweaking, too.

Next challenge: Deploying all these containers. Since I need to run multiple containers across several VMs, a solution such as Docker Compose isn’t feasible, anymore. I will have to start scripting my own deployment machinery. But even now that I have all the containers running, I still need to deploy additional services such as a load balancer/reverse proxy to balance traffic between the two business application instances. I need to build a way to automatically fail over as soon as one of the VMs goes down. I need to manage access to the VMs for different roles, i.e. create user accounts, deploy SSH keys etc. etc.

But there’s more: Maybe the application needs access to some sort of secret store, e.g. Hashicorp Vault or cloud-native solutions such as AWS/GCP KMS or Azure Key Vault. That’s another thing I need to manually set up or better build some kind of custom automation for.

At this point I assume you see where this is leading: Running an application in production rarely means spinning up a single VM and putting a JAR file onto it. There’s a lot of auxiliary components at play. And this is where we will now take a step back and see what the operational challenges are that we need to solve in this specific scenario:

Orchestrate multiple containers across multiple machines (deploy, auto-restart, update, undeploy)
Manage network traffic flowing between containers and between machines
Balance traffic between instances and reverse-proxy services and manage failover scenarios
Manage machine access for different roles and users
Manage secret access from container instances

Experienced, senior ops people will have built or bought the proper tooling to do all of this for them over the many years they’ve been working in the space. They will have every tool and every process at hand to solve all of the problems stated above. These are not new problems, after all.

But what if there was a software out there that solved all of these challenges in a declarative, standardized way so that every ops person in the world could easily understand any environment operated by that software to a certain degree? A software that provides a common API with standardized syntax and semantics? An operational development kit if you will, flexible enough to adapt to the myriad of different operational environments out there.

Well, this operational development kit is Kubernetes! See for yourself:

Challenge	Kubernetes API
Orchestrate containers	Deployments/StatefulSets/DaemonSets
Manage network traffic	NetworkPolicies
Balance traffic/reverse-proxy/failover	Service/LoadBalancer/Ingress
Manage machine access	RBAC (Role, ClusterRole, RoleBinding and ClusterRoleBinding)
Manage secret access	RBAC + Secrets + Secrets Store CSI Driver

Kubernetes provides all the building blocks for the operational challenges you’re facing, anyway, out of the box. The overhead it brings in terms of operational complexitiy (it’s not an easy task to keep a Kubernetes cluster up and running in production) is easily compensated by the simplicity of managing workloads running on it. So easy, in fact, that many development teams within companies can be handed a kubeconfig file and manage their applications themselves. I know because I’ve been on such a team in the past.

Kubernetes shifts work I’d be doing myself in a classical VM-based environment to software operators running in the cluster so that I can focus on more important work. It makes my application landscape transparent and reproducible if I add GitOps to the mix and store all the infrastructure in a repository. That’s the real power of Kubernetes and why I like it so much.

The commoditization of Kubernetes

Fri, 23 Jun 2023 00:00:00 +0000

There’s so many rants out there about Kubernetes and container environments in general and the most recent statements by Kelsey Hightower just fueled these so I want to share why I believe Kubernetes and the cloud-native way to run apps these days is a good thing.

Back in the days when I wanted to run a server application and expose it to the Internet I rented a root server (or used one I already had), copied all the artifacts onto it using scp and started the app/web server in the background, using screen or maybe a systemd service. For an upgrade of the app I stopped and restarted the app server after updating the artifacts. Simpe workflow.

Nowadays my usual workflow is to make a container image out of the app and run it in Kubernetes. For a completely new environment I spin up a Kubernetes server, install an ingress controller, use GitOps (i.e. install Flux), encrypt Secrets with SOPS or connect to a Vault (installing external-secrets operator), install Prometheus and Grafana, setup Slack notifications for Grafana alerts and a couple of other things.

What does this change in workflows and technology tell us, I wonder? Are we all adding unnecessary overhead to our production environments? Is Kubernetes complete overkill? I don’t think so and here’s why:

In my opinion the new workflow represents two things: First, the mindset of what it takes to run an application reliably and securely has changed. People are much more aware of what it means to run an application in production. Users don’t accept considerable downtimes; adversaries have become much more efficient and effective. Occasionally spinning up your regular Tomcat and exposing it to the Internet doesn’t work, anymore. Second, the technology needed to spin up a production environment that deserves the name has become a commodity. Kubernetes plays a huge part in this commoditization. It doesn’t take days or weeks to get a decent environment up and running, it takes minutes to a couple of hours now.

Part of that commoditization are the very well-defined APIs that Kubernetes ships with and that allow it to be extended through e.g. CRDs. Containers of course have also shifted deployment processes left and generally simplified things a lot.

So does Kubernetes add overhead? Of course it does. Is it unnecessary? No way! The commoditization of production deployments is a good thing. Now, software engineers with little background in system operations now have all the tools at hand to run their apps reliably and securely. There is a learning curve but it’s nowhere near as steep as it was back in the days.

Kelsey is right. Kubernetes will likely go away in the future but not in the sense that some people seem to understand it: It will become even more commoditized, to the degree that most people don’t have to think about it. The new mindset will stick, though, and that’s good, for users and operators alike.

The Story of a GitHub Actions Workflow

Sat, 19 Nov 2022 00:00:00 +0000

Discuss this post

This is the story of a seemingly simple task of creating a GitHub Actions workflow that … escalated quickly. I hope you people can learn from my mistakes and do better (or quicker).

You’ll find the tl;dr version here.

Over at Weaveworks we try to automate as many engineering processes as possible. That’s especially true for the tedious work of releasing a new version of one of the components we build. One of these components is a Kubernetes controller running as part of Weave GitOps Enterprise, the enterprise version of our OSS Weave GitOps. The controller is basically shipped in a container image and a Helm chart wrapping all the necessary manifests, Deployments, Services etc.

What we had already setup was a GitHub Actions workflow that would build and push a new container image version whenever a Git tag was pushed to the repository, nice and easy and a pretty standard workflow. However, after that image was pushed we still had to go ahead and manually update the chart version and the image version used within it. The chart building and publishing again was already properly automated.

So inbetween two tasks I was working on I wanted to spend an hour or two building a workflow that would bump the chart version and the app version within the chart whenever a new container image was pushed. It should then create a PR with those changes so we can still verify it. Sounds like a very low-hanging fruit, right? That’s what I thought, too.

Version 1

This is the initial version I came up with. First, the trigger:

name: Update app in chart
on:
  registry_package:
    types:
      - published

Simple, right? GitHub Actions provides a nice https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#registry_package that triggers a workflow whenever something is pushed to the package registry. Spoiler alert: This didn’t work without changes to other workflows. More on that later. Let’s look at the single job within that workflow:

jobs:
  update-chart:
    if: ${{ github.event.registry_package.name == 'pipeline-controller' }}
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v3
      - name: bump app version
        uses: mikefarah/yq@v4.30.4
        with:
          cmd: yq -i '.appVersion = "${{ github.event.registry_package.package_version.container_metadata.tag.name }}"' charts/pipeline-controller/Chart.yaml

Easy; set the new app version from the image that triggered the workflow. We will see later on that like it is here it may set the appVersion to an empty string. More on that later.

      - name: get chart version
        id: get_chart_version
        uses: mikefarah/yq@v4.30.4
        with:
          cmd: yq '.version' charts/pipeline-controller/Chart.yaml
      - name: increment chart version
        run: echo ${{ steps.get_chart_version.outputs.result }} awk -F. -v OFS=. '{print $1,++$2,0}'
      - name: update chart version
        uses: mikefarah/yq@v4.30.4
        with:
          cmd: yq -i '.version = "${{ steps.get_chart_version.outputs.result }}"' charts/pipeline-controller/Chart.yaml

These 3 steps above were supposed to extract the existing chart version, increase the minor version, set the patch version to ‘0’ and store the new version in the Chart.yaml. However, there’s two bugs in there, can you spot them?

      - name: Create Pull Request
        id: cpr
        uses: peter-evans/create-pull-request@v4
        with:
          commit-message: |
            Update app version in chart            
          committer: GitHub 
          author: ####### REDACTED ######
          branch: update-chart
          title: Update app version in chart
      - name: Check output
        run: |
          echo "Pull Request Number - ${{ steps.cpr.outputs.pull-request-number }}"
          echo "Pull Request URL - ${{ steps.cpr.outputs.pull-request-url }}"

Straightforward, create a PR from the changes so we can review and merge them. Turns out, a PR created like that couldn’t be merged with the repo settings we had in place.

Almost every single step in that workflow has bugs. But were we able to spot them before actually merging the new workflow into main? No, because I yet have to find a way to test a workflow without actually merging and running it. Please let me know if you know of any! So we went ahead and merged that workflow file, created a new Git tag and waited until a new image version was pushed for the workflow to be triggered.

Not Running At All

The first we observed was that the workflow wasn’t even triggered at all. We already knew that you couldn’t just trigger a workflow from another workflow but what we didn’t know was that this behaviour is carried forward even for transitive actions such as an image push. We changed the other workflow pushing the new image to the registry to use a personal access token and that fixed that. The workflow was running now.

Lesson #1: When you want a workflow to be triggered by a new image version being pushed to GitHub’s registry, make sure to not use the default workflow token for pushing that image. Otherwise workflows listening the push event won’t run.

Version 2

The next thing we noticed was that the workflow was triggered 3 times. We had no clue why but decided to fix the other issues first. One of these was the step incrementing the chart version not working. This was a simple syntax error as we forgot a pipe character:

-        run: echo ${{ steps.get_chart_version.outputs.result }} awk -F. -v OFS=. '{print $1,++$2,0}'
+        run: echo ${{ steps.get_chart_version.outputs.result }} | awk -F. -v OFS=. '{print $1,++$2,0}'

Easy! Next!

Version 3

Next we discovered that the new chart version set by the workflow was wrong. It didn’t bump the version at all. Turns out the step setting the new version referenced the wrong step:

         with:
           cmd: yq '.version' charts/pipeline-controller/Chart.yaml
       - name: increment chart version
+        id: inc_chart_version
         run: echo ${{ steps.get_chart_version.outputs.result }} | awk -F. -v OFS=. '{print $1,++$2,0}'
       - name: update chart version
         uses: mikefarah/yq@v4.30.4
         with:
-          cmd: yq -i '.version = "${{ steps.get_chart_version.outputs.result }}"' charts/pipeline-controller/Chart.yaml
+          cmd: yq -i '.version = "${{ steps.inc_chart_version.outputs.result }}"' charts/pipeline-controller/Chart.yaml
       - name: Create Pull Request
         id: cpr
         uses: peter-evans/create-pull-request@v4

Version 4

Finally we wanted to find out why the workflow was triggered 3 times so I added a debug step that would just dump the complete event:

     if: ${{ github.event.registry_package.name == 'pipeline-controller' }}
     runs-on: ubuntu-latest
     steps:
+      - name: dump event
+        run: echo ${{ toJson(github.event) }}
       - name: Checkout
         uses: actions/checkout@v3
       - name: bump app version

This didn’t work because the run syntax wasn’t correct but it did dump the event nevertheless. The reason for the multiple triggering was actually kind of simple: We pushed a multi-arch container image comprised of a AMD64 image and an ARM64 manifest. Another manifest list manifest ties these together then. For each of the manifests pushed, a registry_package event is emitted.

So we went ahead and added another condition to the job run:

Version 5

 jobs:
   update-chart:
-    if: ${{ github.event.registry_package.name == 'pipeline-controller' }}
+    if: ${{ github.event.registry_package.name == 'pipeline-controller' && github.event.registry_package.package_version.container_metadata.tag.name != '' }}
     runs-on: ubuntu-latest
     steps:
       - name: dump event

Now the update-chart job is only run for the event carrying the new image tag.

Lesson #2: When using the registry_package event as a trigger make sure to use proper conditions when reacting to multi-arch image pushes.

Version 6

Now the workflow was running only once (it still shows up 3 times but the other 2 are skipped) but the new chart version still wasn’t set. Turns out I didn’t understand how you carry command outputs from one step to another. After reading up on this in the docs we fixed that:

           cmd: yq '.version' charts/pipeline-controller/Chart.yaml
       - name: increment chart version
         id: inc_chart_version
-        run: echo ${{ steps.get_chart_version.outputs.result }} | awk -F. -v OFS=. '{print $1,++$2,0}'
+        run: echo NEW_CHART_VERSION=$(echo ${{ steps.get_chart_version.outputs.result }} | awk -F. -v OFS=. '{print $1,++$2,0}') >> $GITHUB_OUTPUT
       - name: update chart version
         uses: mikefarah/yq@v4.30.4
         with:
-          cmd: yq -i '.version = "${{ steps.inc_chart_version.outputs.result }}"' charts/pipeline-controller/Chart.yaml
+          cmd: yq -i '.version = "${{ steps.inc_chart_version.outputs.NEW_CHART_VERSION }}"' charts/pipeline-controller/Chart.yaml
       - name: Create Pull Request
         id: cpr
         uses: peter-evans/create-pull-request@v4

Lesson #3: Use GITHUB_OUTPUT for carrying command output from one step to another.

Version 7

Now the commit from the PR looked good but no CI checks were run. One more time the constraint of “a workflow can’t trigger another workflow with the default GitHub token” kicked in. Fixing this was easy:

         id: cpr
         uses: peter-evans/create-pull-request@v4
         with:
+          token: ${{ secrets.GHCR_TOKEN }}
           commit-message: |
             Update app version in chart             
           committer: GitHub

Lesson #4: When creating a PR using the default workflow token, no CI checks are run. You need to create a personal access token.

Version 8

Woohoo, we got it! After creating what felt like a million Git tags to trigger the workflow over and over again and cluttering Git history with another million commits fixing the workflow, it was kicked off as expected, the PR looked fine and all CI checks were running.

But, oh no, GitHub didn’t allow us to merge the PR because the commit wasn’t signed. Duh! One more time:

Version 9

     steps:
       - name: Checkout
         uses: actions/checkout@v3
+      - name: Import GPG key for signing commits
+        uses: crazy-max/ghaction-import-gpg@v3
+        with:
+          gpg-private-key: ${{ secrets.GPG_PRIVATE_KEY }}
+          passphrase: ${{ secrets.GPG_PASSPHRASE }}
+          git-user-signingkey: true
+          git-commit-gpgsign: true

This additional step led to the commits created by the create-pull-request action to be signed and the PR to finally be in a mergeable state. Hooray!

The Final Version

The icing on the cake was a little change to make the PR more comprehensible and basically document what it does in the description:

           committer: GitHub 
           author:  ###### REDACTED ######
           branch: update-chart
-          title: Update app version in chart
+          title: Update app version to ${{ github.event.registry_package.package_version.container_metadata.tag.name }} in chart
+          body: |
+            This PR bumps the minor chart version by default. If it is more appropriate to bump the major or the patch versions, please amend the commit accordingly.
+
+            The workflow that this PR was created from is "${{ github.workflow }}".
       - name: Check output
         run: |
           echo "Pull Request Number - ${{ steps.cpr.outputs.pull-request-number }}"

This was the story of a seemingly very simple workflow we thought wouldn’t take more than 1 or 2 hours and turned out to take around a full day.

The Lessons

Lesson #1: When you want a workflow to be triggered by a new image version being pushed to GitHub’s registry, make sure to not use the default workflow token for pushing the image. Otherwise workflows listening to the push event won’t run. Related documentation

Lesson #2: When using the registry_package event as a trigger make sure to use proper conditions when reacting to multi-arch image pushes. I created a PR for adding this info to the documentation that hopefully gets merged soon.

Lesson #3: Use GITHUB_OUTPUT for carrying command output from one step to another. Related documentation

Hosting Mastodon identities at your own domain

Wed, 16 Nov 2022 00:00:00 +0000

Discuss this post

EDIT January 11, 2022: In previous versions of this article I advertised the try_files directive which made the solution vulnerable to path traversal attacks. Using the return directive and sending a 301 redirect fixed this. Thanks to Penple for making me aware of this vulnerability.

With Mastodon being all the rage right now and people massively moving over, new opportunities arise. One of these is that Mastodon allows you to take ownership of your identity using the WebFinger protocol. This way you can have an identitiy like me@example.org without actually having to host your own Mastodon server (or instance in Mastodon lingo).

Maarten Balliauw has already posted on how to achieve this but with a little caveat:

“this approach works much like a catch-all e-mail address. @anything@yourdomain.com will match, unless you add a bit more scripting to only show a result for resources you want to be discoverable.”

I went ahead and solved this by tweaking the nginx configuration of one of my servers slightly (caveat here is you need access to the web server’s configuration):

server {
    listen 80;

    location = /.well-known/webfinger {
        absolute_redirect off;
        return 301 $uri/$arg_resource;
    }

A WebFinger requests URL looks similar to this: https://home.e13.dev/.well-known/webfinger?resource=acct:makkes@home.e13.dev. Now whenever a request comes in at that URL nginx sends an HTTP 301 redirect pointing to /.well-known/webfinger/acct:makkes@home.e13.dev which in turn returns the contents of the requested file (if it exists). So the only thing to do is to create that file with the WebFinger details in it and store it at that location in nginx’s web root.

This mitigates the “catch-all” limitation and only serves the identity or identities you want it to.

Taking it home — Kubernetes on bare-metal

Wed, 09 Nov 2022 00:00:00 +0000

To learn how Kubernetes works you should run your own Kubernetes cluster on bare-metal hardware.

Discuss this post

In the world that I live in Kubernetes is all the rage. This is the world of professional software development and deployment where medium- and large-sized companies are trying to reduce cost and complexity of their IT platforms while at the same time becoming faster at making changes to the software that they run as services to either their internal or external customers. I’ve been on the side of development teams consuming Kubernetes myself and I was impressed and delighted by its concept of “desired state” represented by simple manifest files that me and my team were maintaining for the applications that we built. Later I switched roles and became a Kubernetes engineer myself, now helping platform teams delivering Kubernetes to development teams. If you’re eager to learn how Kubernetes works internally and what a complex system it is that makes it so simple to deliver applications then this blog post is for you. Because I deeply believe that in order to learn how Kubernetes works you should run your own Kubernetes cluster on bare-metal hardware.

Taking first steps with Kubernetes is easier today than it has ever been: My favorite project for quickly spinning up a cluster is kind, Kubernetes in Docker. Run kind create cluster and after a couple of seconds your cluster is ready to go. There’s various alternatives out there, too, with microk8s, k3s and minikube being the most prominent ones. This got me started easily and quickly with Kubernetes development back when I switched roles. However, later on, when I was involved in more complex product development around Kubernetes, building controllers and maintaining an enterprise-grade Kubernetes distribution at D2iQ, I needed to get more intimate with the internals. I wanted to understand all the intricacies of it, what happened under the hood when I ran kubectl apply -f my-awesome-app.yaml, how traffic is ingested into a cluster and further routed to the right container, how DNS works in the cluster, what all the possible ways were to provide persistent storage to containers, how a cluster is properly secured from unauthorized access etc. etc.

At that point I figured I needed to run my own cluster at home on bare-metal hardware and dig really deep into the details of keeping a Kubernetes cluster up 24/7, serving applications to the Internet and the internal home network in a secure fashion. That was nearly 3 years ago when Raspberry Pis were still affordable enough that I could just grab a handful and get going. I ordered 4 Rpi 4s with 4 GByte of RAM in addition to the various older RPis I already owned, the awesome 8-slot transparent cluster case from C4 Labs, a cheap 8-port Ethernet switch, a couple of Cat 6 Ethernet cables and a 6-port USB power adapter.

Setting Goals

I quickly figured I needed to set clear expectations of how the cluster would be used so I set myself some goals:

It should run on a separate network, isolated from the rest of my home network for security purposes.
It should be possible to expose services from inside the cluster to my home network but not the Internet.
It should be possible to expose services from inside the cluster to the Internet.
The API server should be reachable from inside the cluster’s LAN as well as from inside my home LAN but not from the Internet.
It doesn’t need to be highly available so running a single control-plane node is good enough as a start.

From these goals I derived a couple of designations for each of the nodes on the cluster network:

1 router for bridging the cluster network and my home LAN.
1 control-plane node for both etcd and the Kubernetes control-plane components.
3 worker nodes.
1 machine for providing storage to the cluster using NFS.

The Final Architecture

The final architecture of my Kubernetes bare-metal cluster

In the image above you see all the components that currently make up my home Kubernetes cluster. Everything in the 10.0.0.0/24 LAN is pretty standard with one node serving as control plane and 3 others serving as workers. All of the Kubernetes nodes are running an LTS Ubuntu version and are manually provisioned. I built some scripting around setting up default firewall rules, SSH access and a couple of other configuration items. Automating the node provisioning is still on my list. An additional node (running Debian, I don’t recall why) has an SSD attached and serves it over NFS. More on that later.

Kubernetes

As one of my goals was to learn Kubernetes the hard way (not Kelsey Hightower style, though), I used kubeadm to get the cluster going and that’s still the tool I use to maintain it, e.g. when upgrading the K8s version. The configuration doesn’t deviate too much from kubeadm’s defaults which is good enough for my needs.

Even though I’m the only user of that cluster at the moment, I did want to make it “tenant-aware” in the sense that there’s a rather simple way to manage users. In the beginning I just created certificates for each user manually but I moved on and now user management is offloaded to a Keycloak instance I’m running on a hosted server. Configuring Kubernetes’ API server for OpenID Connect isn’t extremely complicated but you need figure out the right knobs. Here’s an excerpt from my kubeadm configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kubeadm-config
  namespace: kube-system
data:
  ClusterConfiguration: |
    apiServer:
      certSANs:
      - apiserver.cluster.home.e13.dev
      extraArgs:    
[...]
        authorization-mode: Node,RBAC
        oidc-client-id: k8s-apiserver
        oidc-groups-claim: groups
        oidc-issuer-url: https://##REDACTED##/realms/e13
        oidc-username-claim: email
[...]

For client-side OIDC support I have installed the kubelogin kubectl plugin. After having set these up I created some RoleBindings to provide the respective users/groups access to API resources (the RoleBinding manifests are all maintained in Git, more on that later).

Upgrading to the latest Kubernetes version is probably the most tedious task at the moment as I haven’t automated any of that so it’s mostly following the upgrade guide.

Network

The 10.0.0.0/24 network is a simple switched network using a cheap tp-link 8-port gigabit switch. The other network, 10.11.12.0/24 is my home LAN for all devices that need Internet connectivity, the Playstation 4, Echo devices, smartphones and laptops. We have Ethernet outlets in each room of our house and a 24-port gigabit switch in the basement. For wireless connectivity I have several wifi APs running in the house that operate on the same network. A MikroTik hEX router together with a VDSL modem provides Internet access. It serves IP addresses for Ethernet and wifi devices, acts as router and DNS server. It provides DDNS capabilities capabilities out of the box and I’m using a DNS CNAME entry to get traffic from outside into the network. You’ll see it in action when accessing home.e13.dev (nothing fancy there, though).

Traffic Out

As you can see in the architecture diagram above, another Raspi (“rpi0”. I’m too lazy to come up with a fancy naming scheme so all Raspis are just enumerated.) serves as router between the home LAN and the cluster LAN. It has two physical Ethernet interfaces (one provided through a USB-to-Ethernet adapter) and a MACVLAN interface. A pretty good explanation of the different virtual networking options you have on Linux is provided over at the Red Hat Developer portal. Creating a MACVLAN interface with NetworkManager is pretty simple:

$ nmcli c add ifname veth0 autoconnect yes save yes type macvlan dev eth1 mode bridge
$ nmcli con modify macvlan-veth0 ipv4.dhcp-hostname "rpi0-1"

I’m not sure if there’s a way to incorporate the second command into the first one but this is good enough for my needs. Now two of the interfaces are part of the home LAN (that provides Internet access) and the third one is part of the cluster LAN. The home LAN interfaces just use DHCP to get their IP configuration from the MikroTik router.

To the cluster LAN rpi0 serves as DHCP and DNS server using the awesome dnsmasq. Dnsmasq automatically serves the host it’s running on as default route. The domain of all cluster nodes is set by dnsmasq using the domain=cluster.home.e13.dev parameter. Now to make rpi0 actually work as a NAT gateway for the cluster LAN hosts, the Linux firewall (aka iptables) needs to be properly configured. This was the hardest part for me as I’m not at all proficient in iptables. I would rather defer to your favorite search engine for finding out how to do that instead of giving potentially wrong advice. Suffice to know that my setup works (though it might not be the most efficient or secure).

Traffic In

Now the cluster nodes have Internet access through rpi0 but we also want to connect to services running in the cluster, e.g. a Grafana instance or any other web application deployed in Kubernetes. The usual way to expose a service in Kubernetes is to create a LoadBalancer type Service resource. If you’re running Kubernetes on one of the major cloud providers this is all you need to do to get a public IP address or hostname assigned to the service. On bare metal, though, this is not the case. This is where MetalLB enters the stage. Running in a cluster it takes care of assigning IP addresses and setting up the network layer of the nodes to direct traffic to those IP addresses to the right pods. On my cluster I’m using the (simpler) Layer 2 mode for advertising services and I set aside a part of the 10.0.0.0/24 address space to MetalLB (which I excluded from dnsmasq’s DHCP server for assignment).

Next, traffic coming from outside of the cluster network needs to be proxied to each LoadBalancer IP address. For this to work I created my own little transport layer proxy configured simply through YAML files. It also ships with a service-announcer tool that generates l4proxy configuration files based on Kubernetes LoadBalancer-type Service resources it finds on the cluster. L4proxy then just binds to a configured interface and proxies the connections to one of the LoadBalancer services’ IP addresses.

L4proxy runs on both home LAN interfaces so that I can selectively forward traffic from either of the two home LAN interfaces on rpi0. Each of these interfaces has a specific dedication: One is only reachable from the home LAN (the one that has 10.11.12.32 assigned to it in the diagram above) so that I can constrain e.g. my smart home Grafana instance to LAN machines. The other interface receives traffic forwarded from the MikroTik Internet router that forwards all traffic directed at the DDNS domain to rpi0’s interface (10.11.12.51 in the diagram).

Now that we have all the network shenanigans behind us we need to let Kubernetes know about the incoming traffic and where to direct it. As I said above MetalLB picks up LoadBalancer Services but there’s no need to create those yourself when you’re using an ingress controller. I opted for ingress-nginx, mainly for its simplicity. It creates a LoadBalancer service and directs traffic based on Ingress resources. You can read all about Ingresses in the wonderful Kubernetes documentation.

IngressClass configuration with ingress-nginx

I have two instances of ingress-nginx running on the cluster, one for external traffic and one for internal traffic. Two different IngressClass resources, “ingress-nginx” and “ingress-nginx-internal” let each Ingress choose whether it should be exposed internally or externally. This is what the Helm values look like for the internal ingress-nginx controller:

    controller:
      electionID: ingress-controller-internal-leader
      ingressClass: nginx-internal
      ingressClassResource:
        name: internal-nginx
        enabled: true
        default: false
        controllerValue: "k8s.io/internal-nginx"

One important thing I only figured out later on is that I needed to set the electionID parameters of each Helm release to a different value so that both instances don’t conflict with each other for leader election.

DNS

There is actually one last thing left to do: resolve host names defined in the Ingress resources to either the IP address of the internally facing rpi0 interface or the publicly facing ISP-assigned IP address of the MikroTik router. For internal services I merely maintain a list of static DNS entries on the MikroTik router. Each internal service, e.g. grafana.cluster.home.e13.dev is backed by a CNAME entry in turn resolving to the internal rpi0 interface. By using a CNAME I don’t have to change all DNS entries whenever that interface’s IP address changes. For externally facing services I maintain DNS entries at my DNS provider. Those also are just CNAME entries resolving to the DDNS name of my MikroTik.

Storage

I’m running a couple of stateful applications on my cluster, e.g. Grafana and some internal applications backed by SQL databases. This state needs to be persisted somewhere. In my search for a simple yet production-ready solution I chose to bet on NFS because it is very simple to set up and PersistentVolume provisioning in Kubernetes is easy to get using the Kubernetes NFS Subdir External Provisioner. The latter provides all resources to get going quickly. All my stateful data is backed by a PV provisioned from NFS at the moment. Before you do this on your own cluster, though, be aware of the following caveats:

NFS is inherently insecure: It doesn’t provide transit encryption of traffic or access control mechanisms. This guide by the linux documentation project provides details on the security aspects of NFS.
I found that NFS-backed PVs respond pretty badly to unscheduled node restarts. When a node goes down unexpectedly the pods can’t be automatically moved to another one because they are stuck in Terminating state until I restore the node. I haven’t found a solution to this, yet.
When the NFS server goes down, NFS mounts on nodes might get stuck without any ability to restore them other than rebooting the node. I managed to mitigate this a little by instructing the provisioner to use soft mounts. Those have a couple of drawbacks, though, so you might want to understand the implications before doing that yourself.

I would never serve any serious production data from NFS shares but it’s good enough for my home setup, especially since all the other solutions out there seem to require a lot more work to get setup and they consume more resources on the cluster nodes.

At the moment the NFS storage has no backup. I’m manually creating DB backups of all the PostgreSQL databases from time to time but all other data might get lost once the NFS disk dies. This is something I still need to improve.

Day 2 Operations: GitOps/Flux

Given that the cluster setup is a little flaky, especially with only one control plane node, I wanted to operate it with the assumption that it might go down any day. (The disk will die some day!) This led me to store all the Kubernetes resources in Git and having Flux manage them for me. This way, I can easily restore all the applications from that Git repo in case I need to setup a new cluster.

Takeaways

I did learn an awful lot in the last couple of years operating this cluster. I had downtimes for the strangest reasons, I replaced the CNI provider once while the cluster was running, I lost data by accidentally deleting a PV with a Delete ReclaimPolicy and I probably forgot a couple of other issues I ran into (and very likely caused myself). As you can see from the list above running your own Kubernetes cluster at home and using it for anything serious is a lot of upfront work. It also is a lot of regular maintenance work. You need to keep the OS on each node up-to-date, you need to update Kubernetes from time to time, exchange dying nodes, restore data after disk failures. You’ll occasionally be opening your browser only to see that your app is down for some strange reason.

For me that was the whole purpose of the exercise and it helps me improve in my day-to-day job as a Kubernetes engineer and Flux maintainer.

Running a Docker registry on Kubernetes (in kind)

Fri, 06 Nov 2020 00:00:00 +0000

In the last weeks I have been working a lot on supporting Kubernetes in air-gapped environments, i.e. environments that don’t have any access to the internet. Many companies prefer to run their IT infrastructure in such a way to minimize the attack vector against it and be able to tightly control what’s running on their clusters. Part of these setups naturally is a Docker registry that runs on that air-gapped infrastructure and in order to properly reproduce such a scenario, I had to run a Docker registry on my kind cluster as well and I thought sharing the manifests may help anyone out there get setup faster next time. Running a Docker registry may be even more important given the new position that Docker Inc. has put us into.

TL;DR ⏳

When trying to run a custom Docker registry on kind, you will face some obstacles: The registry has to be reachable from outside of the cluster (to push images) and from each cluster node (by kubelet). Plus, the CA certificate of the registry has to be advertised to each cluster node as well. Jump down for the TL;DR steps.

Getting there 🚶

My first idea was to just create a Secret, a Deployment and a ClusterIP Service exposing the deployment. To be able to push images to the running registry I just had to add registry.registry.svc to my /etc/hosts file with the address 127.0.0.1 and do a kubectl -n registry port-forward svc/registry 1443. From then on I was able to tag an image with the registry.registry:1443/ prefix and push it to the newly created registry. 🥳

$ docker tag nginx:1.19.4 registry.registry.svc:1443/nginx:1.19.4
$ docker push registry.registry.svc:1443/nginx:1.19.4
The push refers to repository [registry.registry.svc:1443/nginx]
7b5417cae114: Layer already exists
aee208b6ccfb: Layer already exists
2f57e21e4365: Layer already exists
2baf69a23d7a: Pushed
d0fe97fa8b8c: Pushed
1.19.4: digest: sha256:34f3f875e745861ff8a37552ed7eb4b673544d2c56c7cc58f9a9bec5b4b3530e size: 1362
$ k run nginx --image=registry.registry.svc:1443/nginx:1.19.4
pod/nginx created
$ k get pod nginx
NAME    READY   STATUS         RESTARTS   AGE
nginx   0/1     ErrImagePull   0          13s

Whoops, that didn’t work so well. So a pod that would reference the image I just pushed into the internal registry has issues pulling it. Let’s look at the details:

$ k describe pod nginx
[...]
Events:
  Type     Reason     Age               From               Message
  ----     ------     ----              ----               -------
  Normal   Scheduled  16s               default-scheduler  Successfully assigned default/nginx to kind-control-plane
  Normal   BackOff    15s               kubelet            Back-off pulling image "registry.registry.svc:1443/nginx:1.19.4"
  Warning  Failed     15s               kubelet            Error: ImagePullBackOff
  Normal   Pulling    3s (x2 over 16s)  kubelet            Pulling image "registry.registry.svc:1443/nginx:1.19.4"
  Warning  Failed     3s (x2 over 16s)  kubelet            Failed to pull image "registry.registry.svc:1443/nginx:1.19.4": rpc error: code = Unknown desc = failed to pull and unpack image "registry.registry.svc:1443/nginx:1.19.4": failed to resolve reference "registry.registry.svc:1443/nginx:1.19.4": failed to do request: Head https://registry.registry.svc:1443/v2/nginx/manifests/1.19.4: dial tcp 127.0.0.1:1443: connect: connection refused
  Warning  Failed     3s (x2 over 16s)  kubelet            Error: ErrImagePull

Look closely at the From column of the events. It’s the kubelet service that’s unable to pull the image and when you think about it, it makes total sense that it can’t because kubelet isn’t run inside of the cluster but rather directly on each node. So somehow I needed to make the registry available to each node.

Trying Harder 💪

Enter the NodePort service type which makes a service available externally via the IP addresses of cluster nodes. This service also helps us killing two birds with one stone: We can push images to the registry into the cluster as well as pull images from inside of the cluster (i.e. the kubelet). So I created a kind cluster exposing the service’s port to the host using the extraPortMappings configuration option, changed /etc/hosts to let kind-control-plane point to 127.0.0.1 and change the ClusterIP service to be a NodePort service:

$ kind create cluster --config=- <kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30443
    hostPort: 30443
    listenAddress: "127.0.0.1"
    protocol: tcp
EOF
Creating cluster "kind" ...
[...]
$ k create -f docker-registry.yaml
namespace/registry created
secret/registry created
deployment.apps/registry created
service/registry created
$ docker push kind-control-plane:30443/nginx:1.19.4
[...]
$ k run nginx --image=kind-control-plane:30443/nginx:1.19.4
pod/nginx created
$ k describe pod nginx
[...]
  Normal   Pulling    7s (x3 over 66s)   kubelet            Pulling image "kind-control-plane:30443/nginx:1.19.4"
  Warning  Failed     7s (x3 over 50s)   kubelet            Error: ErrImagePull
  Warning  Failed     7s (x2 over 38s)   kubelet            Failed to pull image "kind-control-plane:30443/nginx:1.19.4": rpc error: code = Unknown desc = failed to pull and unpack image "kind-control-plane:30443/nginx:1.19.4": failed to resolve reference "kind-control-plane:30443/nginx:1.19.4": failed to do request: Head https://kind-control-plane:30443/v2/nginx/manifests/1.19.4: x509: certificate signed by unknown authority

Oh well, that is somehow expected. I created a self-signed certificate to back the registry’s HTTPS transport so somehow I now had to make kubelet aware of the CA certificate.

The last step 🏁

To make kubelet (or rather containerd) aware of the new CA certificate, I had to copy it into the Docker container that’s running the cluster node (this is a single-node cluster, after all):

$ docker cp /tmp/tls.crt kind-control-plane:/usr/local/share/ca-certificates/
$ docker exec -t kind-control-plane update-ca-certificates
Updating certificates in /etc/ssl/certs...
1 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...
done.
$ k run nginx --image=kind-control-plane:30443/nginx:1.19.4
pod/nginx created
$ k get pod nginx -w
NAME    READY   STATUS              RESTARTS   AGE
nginx   0/1     ContainerCreating   0          0s
nginx   1/1     Running             0          2s

Et voilà! The table is set. An improvement to having to have the CA certificate file laying around in my filesystem, I just extraced it from the Secret in the cluster.

The Complete Rundown 🏎

Download the Docker registry manifest

Install the registry and configure the cluster node:

$ kind create cluster --config=- <kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30443
    hostPort: 30443
    listenAddress: "127.0.0.1"
    protocol: tcp
EOF
$ k create -f docker-registry.yaml
$ k -n registry get secret registry -o jsonpath='{.data.tls\.crt}'|base64 -d|docker exec -i kind-control-plane sh -c "cat - > /usr/local/share/ca-certificates/registry-ca.crt && update-ca-certificates && systemctl restart containerd.service"

Make the service available with the node’s name (the grep makes sure we’re not adding a 2nd entry):

grep -E ' kind-control-plane( |$)' /etc/hosts || echo '127.0.0.1 kind-control-plane' | sudo tee -a /etc/hosts

Push an image and create a test pod:

$ docker pull nginx:1.19.4
$ docker tag nginx:1.19.4 kind-control-plane:30443/nginx:1.19.4
$ docker push kind-control-plane:30443/nginx:1.19.4
$ k run nginx --image=kind-control-plane:30443/nginx:1.19.4
$ k get pod nginx -w

Ansible delegation madness: delegate_to and variable substitution

Fri, 19 Jul 2019 00:00:00 +0000

This is going to be a short piece but I really want to share this because 1) I have to talk! It cost me several hours today to get a grip on this and 2) I couldn’t find any explanation of this Ansible behaviour on Stack Overflow or anywhere else (I actually posted this on SO to make sure it’s now there). By the way, I was reminded today that it can save you several hours of bug tracking, experimenting and general hair-tearing if you just know to ask. the right. question.

Here at Mesosphere (and especially in the Cluster Ops team I’m in) we use Ansible a lot for various stuff related to spinning up/down and maintaining clusters. We build tools around making all of the operations of DC/OS (and other) clusters insanely easy. Since I’ve joined the company recently coming more from an application developer background and mostly developing tools in Go here I’m not the most proficient Ansible user on this planet. So what I had to achieve today was to run some tasks on all of the cluster’s nodes and some tasks only on one special node. What I came up with looked a bit like this:

01 - hosts: all
02  name: Test Play
03  gather_facts: false
04
05  tasks:
06      - name: Create output directory
07        tempfile:
08            state: directory
09            suffix: diag
10        register: output_dir
11
12      - name: Create API resources directory
13        file:
14            path: "{{ output_dir.path }}/api-resources"
15            state: directory
16        delegate_to: "{{groups['control-plane'][0]}}"
17        run_once: yes
18        register: api_resources_dir

The intent of this playbook was to create temporary directories on every node (for storing some command output) and on one and only one host this temporary directory should contain a directory named api-resources. When I ran this playbook, though, that host ended up with two temporary directories, one of which had the same name as the temporary directory on another host (and the latter was surprisingly (or not) the one that conducted the delegation).

What happened here?

Turns out, the expression {{ output_dir.path }} in the second task is evaluated before the task is delegated to the other node. Therefore, the node creates the api-resources directory in another directory as the one that is created in the first task.

What’s the correct way to do this?

The correct way is to first figure out what you’re doing wrong and why. That took me 90% of the time today. It’s probably just a matter of Ansible experience and of not just applying the same pattern (using delegate_to) you’ve seen elsewhere. Interestingly enough, I figured out the correct question only after I found the answer to my problem: “How do I run a task on one specific node?”. But when you think that delegate_to is the right solution you don’t even arrive at asking that question (again).

There’s this nice thing called when in Ansible that comes in handy here. Here’s the corrected playbook:

 1	- hosts: all
 2	  name: Test Play
 3	  gather_facts: false
 4
 5	  tasks:
 6	      - name: Create output directory
 7	        tempfile:
 8	            state: directory
 9	            suffix: diag
10	        register: output_dir
11
12	      - name: Create API resources directory
13	        file:
14	            path: "{{ output_dir.path }}/api-resources"
15	            state: directory
16	        when: inventory_hostname == groups['control-plane'][0]
17	        register: api_resources_dir

Nice and slick. I hope this post will save someone a bad day.

Have a great one!

O'Reilly Software Architecture Conference: My ping from London

Sun, 04 Nov 2018 00:00:00 +0000

I attended O’Reilly’s Software Architecture Conference in London this October and I thought I’d share my personal wrap-up of the most striking talks I’ve heard there. So buckle up for a tiny race through three days of talks and workshops:

sarahjwells from Financial Times gave a great advice on how to fight code rot in your microservice architecture: Consider building overnight to fight code rot and keep services live and healthy. This is great advice since there may be services in your environment that you’ll probably won’t touch for a few months and if you don’t constantly keep them building some developer having to fix a bug in one service will have a hard time fixing outdated dependencies and stuff first.

I especially enjoyed lizrice’s keynote on container security: Scan your container images for security vulnerabilities and consider using seccomp in your containers.

crichardson simply stated: Microservices shall not be the goal, that’s an anti-pattern. Yeah, for those of you who didn’t grasp that, already, probably.

I also attended allenholub’s talk on choreographing microservices (in contrast to orchestrating them). Especially enjoyable was his opinion on delivery: I deploy the most simple implementation and if nobody complains I’m done. So true on so many levels, especially in an enterprise environment where I work.

nikhilbarthwal shed some light on real-world FaaS. My insight from his talk: FaaS instances are auto-scaled but your DB probably isn’t. As I followed the Twitter stream, though, his opinions very passionately discussed and disputed. I liked his balanced plea for a hybrid world of FaaS and “old-school” microservices.

stilkov presented the most common software architect’s types; the one that sticked to me most is the Disillusioned Architect that abstracts everything away. Stefan pointed to the term ‘Architecture Astronauts’ coined by Joel Spolsky.

mikebroberts’ keynote was especially enlightening when he talk about the four levels of adopting serverless:

Serverless operations (env. reporting, Lambda as shell scripts, Slack bots, deployment automation)
Cron jobs, Serverless offline tasks
Serverless activities (message processing, isolated microservices)
Serverless ecosystems (websites, web applications, serverless data pipelines)

Really great!

Less technical career advice for architects was given by JetBrains’ trisha_gee: Everyone is an architect these days, ask questions and then LISTEN to the answers, be open to change your mind, do pair programming not only with developers but probably with a business analyst, answer Stack Overflow questions. The last one is… so… great. Lay aside some time for your team to constantly be active on Stack Overflow and it will change your attitude towards people and technologies and you will learn A LOT!

Pivotal’s cdavisafc talked about getting rid of the request-response paradigm in your software architecture. The punch line: There’s a major difference between old-style messaging (aka ESB) and event logs like Kafka (e.g. no queues, event log as single source of truth, loosely coupled data): The former is anti-agile while the latter is agile.

Thanks, O’Reilly for getting all those people (and me) to London. Perhaps we’ll see again next year.