Kubernetes Oops to Kubernetes Ops: Mistakes, Learnings, and Improved Practices
In my first professional experience with Kubernetes (K8s), I was tasked with improving our ecommerce store site performance and SEO. We decided to migrate our React single-page app from client-side rendering to server-side rendering, gradually migrating pages and routing traffic dynamically using a reverse proxy. To deploy the new applications, we explored containerization and chose AWS’s Elastic Kubernetes Service (EKS) for hosting via Kubernetes.
Looking back, I made several mistakes due to my limited understanding and inexperience with both the technology and mature cloud deployment processes. Coupling multiple new technologies into a single project was my first mistake, as it added complexity to the implementation, maintenance, and team onboarding. Additionally, some mistakes were a result of misinformation or outdated information since EKS and Kubernetes were relatively new in 2018.
During this time, I documented my journey through blog posts, hoping to help others facing similar challenges. However, I now question whether I unintentionally spread misinformation instead of providing accurate guidance.
My initial mistakes revolved around setting up the architecture of our Kubernetes clusters. While I understood the importance of having separate staging and production environments, I wrongly set up a separate cluster for each deployment. We had a staging and production cluster for both the React app and the reverse proxy, totaling four clusters. In retrospect, there were better approaches, such as utilizing separate clusters for for each environment that housed all the deployments(one for staging and one for production). This would have provided a simpler starting point, had I better under the concepts of containerized workloads and orchestration platforms.
As a newcomer to Docker and containerization, my first step was learning how to build a Next.js Docker container and publish it to AWS ECR, as explained in Setting Up a Next.js Docker Container and Publishing it to AWS ECR. Fortunately, this process was straightforward and hard to mess up.
In Deploying a New Docker Image to an EKS Cluster on Codeship, I discussed our automated CI/CD pipeline for deploying changes to our staging cluster. However, our method of rolling out application changes to Kubernetes was flawed. I had a limited understanding of Kubernetes resource files and only used them for initial setup, neglecting to update or use them as templates. Instead, we relied on environment variables set in the Docker container during build time. Our deployment approach involved using the kubectl set image command to update a deployment’s container image as a patch, which had drawbacks such as the lack of version control and limitations in the deployment workflow. In subsequent sections, I will share iterative learnings and improvements for deploying changes. Another significant area of naivety in our Kubernetes clusters was the lack of observability, as we hadn’t set up logging or APM infrastructure, relying solely on pod logs and CloudWatch.
One positive aspect of my initial experience with Kubernetes was the use of eksctl, a tool by Weaveworks, which saved us time and ensured the correct initial setup and configuration of our Kubernetes clusters. I discussed this tool in Setting Up AWS Elastic Kubernetes Service (EKS) and Deploying.
After my initial Kubernetes journey, I joined a very early-stage startup where I was the sole engineer. In hindsight, I made the naive decision to introduce Kubernetes as our initial method for running applications in the cloud. Considering the scale and requirements of the startup, starting with something simpler like Elastic Container Service would have sufficed, as we only had a few microservices.
Fortunately, we enlisted the help of a software consultancy company to augment our team and expedite product development. The consultancy’s DevOps department had more expertise in Kubernetes and assisted in setting up our initial CI/CD pipelines, imparting valuable knowledge along the way.
Firstly, they guided us in establishing a more appropriate Kubernetes architecture using well-scoped namespaces to separate environments, such as staging and production. While further fault tolerance and isolation could have been achieved by separating environments into clusters, it wasn’t necessary given our small scale.
The next significant learning point revolved around leveraging Kubernetes resource templates. Instead of using the kubectl set image command to update deployments, the consultancy introduced Kubernetes templates that our CI/CD pipeline would update and apply to deploy changes. They used a simple tool called sed to replace values in the templates. However, we could have further improved this process by utilizing open-source tools specifically designed for managing Kubernetes templates. Helm, a popular project for managing Kuberentes packages and templating Kubernetes resources, could have been an excellent choice.
Lastly, the consultancy provided guidance on setting up open-source observability tools on Kubernetes. We implemented Prometheus and Grafana to monitor application health metrics through a dashboard. We explored both Loki and Graylog for logging tools in Kubernetes and ultimately chose Graylog due to Loki’s limited functionality at the time. It’s worth noting that Loki has since made significant advancements and may now serve as a suitable built-in logging tool integrated with Grafana.
In general, by seeking guidance from experts, I was able to avoid many issues during this phase. However, there was still much to learn, additional tools to explore, and room for further improvements.
Lastly, I want to share my recent learnings from working at my current company, Brex. When I joined in January 2020, Brex already had a Foundation organization consisting of multiple teams responsible for maintaining our build and release process, as well as our cloud infrastructure, including multiple Kubernetes clusters. In January 2021, I transitioned into this organization and gained invaluable knowledge from my experienced colleagues, particularly regarding software releases and Kubernetes cluster management.
Initially, we used Helm to template and create new releases for each microservice’s Kubernetes resources. We would then utilize our custom deployment manager to apply updated Helm charts to the cluster and roll out changes. However, we eventually transitioned to a GitOps approach, leveraging the Flux open-source project. GitOps allows us to easily track and reproduce the cluster’s current state by maintaining all Kubernetes resources in a Git repository. This repository serves as the single source of truth for our Kubernetes cluster, and Flux operators continuously monitor the repository for updates, automatically reconciling and applying changes to the cluster. This approach provides better visibility into the expected state of the cluster and facilitates troubleshooting any discrepancies between the GitOps state and the cluster state. It’s worth noting that while we no longer rely on Helm for the GitOps approach, we still utilize its templating engine. Our CI/CD process generates finalized Kubernetes resources and Flux-compliant Kustomizations, which are then managed by another custom release management service.
Once again, learning from knowledgeable teammates and collaborating on these systems has been immensely valuable. I’ve gained significant insights from our build platform, release platform, and cloud infrastructure teams, who continuously enhance the maturity of our clusters. The rapid growth of the company and its underlying infrastructure has contributed to my own professional growth and expanded expertise.
Reflecting on these learnings, if I were starting a new company, I would likely reconsider using Kubernetes initially, despite my current comfort with it. There are simpler services like Amazon Elastic Container Service (ECS) for deploying containerized workflows on AWS. The cognitive overhead for developers to run their software on Kubernetes has been an important realization. Most developers may not possess the knowledge or time to fully understand the intricacies of Kubernetes. For smaller companies without the necessary resources to abstract and maintain these systems, it could potentially slow down developers or introduce unnecessary issues. However, as a company scales and has multiple software projects to deploy, Kubernetes becomes a suitable choice because of its ability to leverage platform characteristics across all applications.