Written by Lao Yu.
Background
Kubernetes is a platform that helps manage complex workloads that need to be deployed on a large number of nodes, bin packing pods onto nodes, with the benefit of cost saving and ease of management. Over the past years, Kubernetes has proven itself as a stable and reliable platform for this job.
But where do the nodes come from? Before Kubernetes schedules a pod onto a node, the node has to be available in the cluster. Given that node provisioning is cloud-dependent, there is no in-tree node provisioning available for EKS. Thus, how do we make nodes available to EKS in the first place?
In this post, we’ll first examine the history of node provisioning on EKS, before explaining why we’re so excited about Karpenter, which has been adopted as a CNCF project.
A brief history of node scheduling on EKS.
Let’s look back into the past and see how nodes were provisioned EKS previously.
Managed node group with auto-scaling group
EKS Managed node group is the first generation of construct AWS provides to manage nodes. It provisions nodes and handles their lifecycle. However, it does not scale nodes automatically.Â
Cluster Autoscaler
The next tool is the cluster Autoscaler. It runs inside the cluster, monitors the pending pods, and scales the EC2 auto-scaling group to spin up new nodes. Then, the Kubernetes scheduler decides which node the pod should run on.
However, it has a few issues:
The speed. Cluster Autoscaler is slow as it responds only after pods become pending.
It gets even slower if there are thousands of nodes. It can take minutes before a node gets spun up.
To counter that, a pattern is to proactively utilize cluster-overprovisioner to spin up extra nodes. This adds significant unnecessary waste.
Also, if a spot instance is involved, a separate component, the node termination handler, has to be introduced to handle the disruption of spot termination events.
We have witnessed the size of a Kubernetes cluster grow constantly, and we see clusters with hundreds or even thousands of nodes constantly. The cost of those underutilized nodes has increased significantly over the years. There is a constant struggle to balance the availability of nodes against cost-saving measures.
While all the tools mentioned above have worked previously, there is a disconnect between how pods and nodes are managed. Pods are managed by a scheduler, and nodes are managed by various tools. The disconnection between the scheduler and node provisioner causes endless headaches for Kubernetes users, e.g. slowness of node provisioning, nodes not fully utilized due to not being removed proactively, etc.
What is Karpenter?
How can we go about managing pods and nodes in a more efficient way? How about combining the pod scheduler and node provisioner together? Enter Karpenter.
The third generation of node scaling is Karpenter. Karpenter has the following functionality:[1]
Watching for pods that the Kubernetes scheduler has marked as unschedulable
Evaluating scheduling constraints (resource requests, node selectors, affinities, tolerations, and topology spread constraints) requested by the pods
Provisioning nodes that meet the requirements of the pods
Disrupting the nodes when the nodes are no longer needed
The essential utility of Karpenter is combining the pod scheduler and the node provisioner, streamlining the process of scheduling pods and provisioning nodes. Provisioning speed is significantly improved; If a new pod is created that cannot be scheduled, Kaprenter will pick it up immediately and spin up a new node. With the introduction of the Bottlerocket AMI (an AMI specifically designed for EKS) nodes spin up even faster, reducing the overall latency. From our experience, starting a pod on a brand-new node takes less than one minute. This represents a significant improvement from current processes.
By considering pod and node together, Karpenter also saves costs by consolidating pods into fewer nodes when a particular node is underutilized. Experience tells us this feature saves about half of the nodes after migrating to Karpenter, significantly reducing EC2 cost.
Karpenter Concepts
To create nodes in Karpenter, we must first create EC2NodeClass and NodePool Kubernetes resources.
An EC2NodeClass is similar to the launch template of an auto scaling group. It defines AMI, subnets, security groups, instance profiles, block device setting, etc.
kind: EC2NodeClass
metadata:
name: default-nodeclass
spec:
amiFamily: AL2
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
volumeSize: 50Gi
volumeType: gp3
role: eks-node-role
securityGroupSelectorTerms:
- tags:
Name: eks-NodeGroup
subnetSelectorTerms:
- tags:
Name: eks-vpc-private-*
A NodePool defines instance family, instance size, billing (on-demand or spot), etc. It is similar to an auto-scaling group. But it also has Kubernetes-specific information relevant to pod scheduling, including taints, node expiry times, and consolidation policies.
Each of the requirements in NodePool will be added as a label onto the provisioned nodes, allowing the information to be consumed when managing individual nodes.
kind: NodePool
metadata:
name: default-1
spec:
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 360h0m0s
limits:
cpu: "2400"
template:
metadata:
labels:
creator: karpenter
eks.amazonaws.com/capacityType: ON_DEMAND
spec:
nodeClassRef:
name: default-nodeclass
requirements:
- key: karpenter.k8s.aws/instance-category
operator: In
values:
- c
- m
- r
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values:
- "6"
- key: karpenter.k8s.aws/instance-cpu
operator: In
values:
- "8"
- "16"
- "32"
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: topology.kubernetes.io/zone
operator: In
values:
- ap-southeast-2a
- ap-southeast-2b
- ap-southeast-2c
Karpenter Workflow
With the above concepts defined, let’s look into how Karpenter works.
The first function of Karpenter is to schedule a new pod and provision a new node if required.
First, Karpenter looks for pods waiting to be scheduled and checks if there is a node available for them, considering all the constraints, e.g. node selector, taint, toleration, pod topology spread, etc.
If no node is immediately available, Karpenter will use the batching window to try bin-packing pods created within the window (by default, up to 10 seconds if consecutive pods are created with a less than 1-second gap). Once the batching window passes, Karpenter will decide how many nodes to launch.
Karpenter will then decide which instance to launch. It will first discover all the instance types, and then sort by cost. It will combine information about pod demand and the resources of the available instance types. With that information collected, Karpenter will launch the instance. It also uses the EC2 fleet API to minimize possible disruptions.
Karpenter continuously monitors the utilisation of nodes:
Suppose a node is underutilized and pods aren’t blocked from rescheduling. In that case, Karpenter will proactively kill the pod and node and let the new pod schedule onto another underutilized node, improving the overall utilization and reducing the cost. From our experience, savings on large clusters can reach up to 50% as a result of this consolidation.
Each NodePool has an expiration time, indicating when a node should be recycled. This is particularly helpful if there are compliance requirements, e.g. patching or AMI availability requirements.
Karpenter will also monitor drift and disruption of nodes:
Drift happens in scenarios such as when a new AMI is available for an AMI family or when the NodePool/EC2NodeClass configuration changes. Karpenter will keep monitoring the drift between nodes and configuration, and replace the node if needed.
Similarly, if an instance is disrupted, e.g. through spot or on-demand instance termination, Karpenter will monitor the SQS queue which receives the EC2 events, and act accordingly by provisioning new nodes and rescheduling the displaced pods.
To put it in a single diagram, it would look like this:
Karpenter Setup
In production, Karpenter is usually set up with a Helm chart, along with an IAM role, disruption event rule, and SQS queue.
Karpenter's setup requires special care, given that it is usually the first pod to run on the EKS cluster before anything else. It has to spin up while no other node or CoreDNS is available.
One way to achieve this is to create a managed node group specifically for the Karpenter controller pods. However, this introduces the extra maintenance task of managing node groups, which defeats the purpose of introducing Karpenter.
Running Karpenter pods on EKS Fargate makes more sense. With Fargate, Karpenter controller pods can be set up without a single node in the cluster, making it ideal for bootstrapping workloads like Karpenter.
One caveat concerns the CoreDNS, which will not be set up before Karpenter. The solution is to make sure Karpenter's DNS policy is set to “Default,” using the underlying hosting for DNS resolution before reaching out to the AWS EC2 API endpoint. This solution works on EKS Fargate as well.
Another caveat concerns deploying Karpenter: because it is the first workload up and running on the cluster, ArgoCD/Flux usually isn’t available yet. Thus, Karpenter has to be deployed externally—but not ArgoCD/Flux, which depends on the existence of nodes.
Monitoring of Karpenter
Karpenter creates many metrics for scraping, which is helpful for gaining insight into its state. Dashboards can be created with these metrics.
A few events we’re particularly interested in, based on our experience:
Pod startup latency - We want Karpenter for speed. Thus this is the first thing we want to monitor.
Nodes per nodepool - This is to keep track of the number of nodes provisioned by Karpenter and see if there is any drift where more and more nodes are created unexpectedly.Â
Node utilization - Ideally we want to fully utilize each node provisioned. It can be tracked as well.
The abundance of metrics provided by Karpenter enables us to look deep into its behaviour, thus allowing us to fine tune it for different workloads.
Karpenter Best Practices
From our day-to-day operations, there are a few best practices we follow to maximize the benefit of Karpenter’s features:
Create spot node pools alongside on-demand node pools. Then use capacity spread along with pod topology spread to make efficient use of spot instances along with on-demand nodes.
Critical workloads such as Istio ingress controllers should have their own node pool. This is to make sure there won’t be any noisy neighbors on the same node.
If x86_64 and ARM nodes co-exist in the same cluster, then set them up on separate node pools with taints, allowing applications to choose their desired architecture.
From the operational perspective:
Karpenter is still early in its development, and the development team releases updates frequently. Try to keep up with them rather than waiting for big bang updates.
Karpenter does not publish reference IAM policies with every version update. Thus, it is important to test out the functionality of Karpenter before promoting it to production.
If necessary, set up flow schema to avoid API server throttling during pod scheduling.
Final thoughts
Although it takes significant engineering effort to implement initially, Karpenter drastically simplifies the operation of Kubernetes by unifying the pod scheduling and node provisioning. This can bring big cost savings, in terms of both raw compute and administrative overhead.
If you’re looking for any help on Kubernetes, reach out to us! We’ve helped multiple customers successfully adopt Karpenter with great success as well as set up various Kubernetes platforms. We are looking forward to helping you to set up a scalable and cost-efficient platform with our expertise.Â