Use NodeSelector/Affinity and Taints/Tolerations to Support Dedicated Nodes in a Multi-tenant Kubernetes Cluster

Depending on the organization's requirements, some applications may be considered critical and need to maintain a minimum uptime level. This post focuses on how to ensure that these critical applications get a dedicated group of nodes to run on, without having resource contention from other noisy applications, in a multi-tenant Kubernetes cluster.

Dynamically Creating a Collection of Maps with Terraform 0.12 “for” Expression

challenge was that there is a large number of possible resources to be created, and not every environment needed the same number of resources. Initially, I was thinking to define all possible variables upfront, and use some internal logic to enable/disable them in the various accounts - but it would be unwieldly. Then I discovered the for expression in Terraform 0.12.

How I Root Caused a CPU Bottleneck in a RDS Database

An application was behaving very sluggishly, and I decided to take a look at it to identify and fix the cause of it. The problem was narrowed down to the RDS database taking a long time to respond to requests, and the mitigation action decided by the team was to "get a bigger instance." However, my analytical mind wanted to really understand the root of the problem to ensure that throwing more compute at the problem will actually solve it.

Backup Before Modifying a Production AWS RDS Database Managed by Terraform

Periodic changes to production cloud resources should be expected as the cloud offers elasticity to scale in/out with demands. Although some changes are riskier than others, the AWS RDS processes for applying (and rolling back) these changes have been battle-tested. Despite this, it is always good for organizations to have their own backup and restore strategies before riskier changes are applied - after all, the data does belong to them. In this post, I'll propose several methods to backup production AWS RDS databases that are managed via Terraform, as well as their considerations.

Prometheus Operator – Interactions Between the kube-prometheus-stack Kubernetes Resources

The aim of Prometheus Operator is to provide Kubernetes native deployment and management of Prometheus and related monitoring components. The kube-prometheus-stack helm chart (formerly named prometheus-operator) contains just one helm command to set everything up. However, it leaves out specific details about the underlying implementation. In this post, I'll take a deeper look what happens under the hood when the kube-prometheus-stack helm chart is installed in a Kubernetes cluster.

Kubernetes Default RBAC ClusterRole Resource Permissions

Kubernetes has several methods to authorize requests to the API server, namely Node, Attribute-based access control (ABAC), Role-based access control (RBAC), and Webhook. While reading the RBAC documentation on Default ClusterRoles, I found the descriptions vague - probably generalized by the author(s) so as to remain relevant across the various Kuberenetes versions. However, I wanted a quick reference guide on the exact resources and permissions each of them had (e.g. for "pod" resource, the "edit" ClusterRole has X, Y and Z permissions). Hopefully the following list helps others who are looking for something similar.

How to ssh into Containers in AWS EKS

I was experimenting how I could expose applications in AWS Elastic Kubernetes Service (EKS) via Kubernetes Service resources and AWS load balancers. Out of curiosity, I also wanted to know if I could ssh into containers in EKS without using "kubectl exec" or any container runtime commands (e.g. "docker attach"). One scenario would be when I need to access the container's filesystem to extract a log/config file, but 1) I do not have EKS cluster admin role for more permissive actions, and 2) the kubectl environment is exposed via a structured CI/CD pipeline and is non-interactive. As I could not find any concrete examples/tutorials, here are my implementation setup and steps.