Well Architected SaaS Infrastructure when using EKS Platform

Written By Praveen Kumar Patidar

Introduction

Companies building SaaS platforms often encounter unpredictable usage patterns and fluctuating demand, making it challenging to allocate resources efficiently. Without a clear understanding of peak loads and user activity trends, they risk overprovisioning, which leads to unnecessary costs, or underprovisioning, which can degrade performance and user experience. To remain competitive, it is crucial to implement cost control measures while ensuring a seamless and high-quality service. Achieving this requires a well-architected infrastructure that dynamically scales, optimises resources, and provides granular monitoring to maintain service reliability, security, and performance at all times.

Challenges in SaaS Platform Infrastructure

SaaS platforms must efficiently scale their infrastructure while managing costs. The primary challenges include, but are not limited to:

Scalability – Handling unpredictable demand without overprovisioning resources.
High Availability – Ensuring workloads remain available even during failures.
Security & Compliance – Protecting multi-tenant environments from unauthorised access.
Cost Optimization – Using the right mix of instance types and autoscaling to reduce costs.
Infrastructure Automation – Reducing operational overhead with dynamic provisioning.

How This Blog Helps

In this blog, we will explore how AWS EKS, along with the Karpenter and Kubernetes Event Driven Autoscaling (KEDA, can address these challenges through:

Namespace Isolation using Network Policies
Dynamic Node Provisioning with Karpenter (NodeClass & NodePool)
Multi-AZ NodePools for High Availability
Autoscaling with KEDA and Custom Metrics (CloudWatch, SQS)
Security Best Practices using Hardened Images and Security Groups

These strategies enable SaaS platforms to create a scalable, secure, highly available, and cost-effective cloud-native infrastructure.

Isolation of workload in EKS using Network Policies

The Workload's logical isolation can be achieved using Networking Policies. The isolation ensures that workloads running for different clients (e.g., client1, client2) cannot communicate unless explicitly allowed.

Restricting Communication Between Namespaces Example

Kubernetes supports Network Policies, enabling detailed control over traffic flow within a cluster. For SaaS platforms, enforcing these policies is essential to ensure logical isolation between different tenants, workloads, or environments, preventing unauthorised communication and improving security.

Kubernetes does not implement Network Policies natively; it depends on Container Network Interface (CNI) plugins to enforce these policies. Popular CNI solutions such as Calico, Weave Net, and Cilium provide the required networking capabilities. In AWS EKS, networking is managed through EKS Add-Ons, including VPC-CNI for pod networking and CoreDNS for service discovery. This setup allows for seamless integration and effective policy enforcement.

Manage networking add-ons for Amazon EKS clusters - Amazon EKS

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all-except-allowed
  namespace: client1_ns
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              clientName: client1

This policy ensures that only pods within the client1_ns namespace can communicate with client1 workloads. The network policies can be further detailed to allow communication from a common or admin namespace, ensuring that management workloads can connect for compliance and monitoring purposes.

Dedicated Infrastructure Using Karpenter

After implementing logical isolation through networking policies, the platform should also offer node-level physical isolation for different clients, if necessary. The architecture must be designed carefully to prevent any overuse by clients or the platform itself.

Karpenter, an open-source and high-performance Kubernetes cluster autoscaler that dynamically provisions nodes based on your workload requirements, enhancing efficiency and optimising costs.

The Karpenter controllers actively monitor the incoming workload request and based on pre-defined NodePools, it will launch from zero. For the client, this will also give the extra option to start with zero and scale on demand.

Karpenter NodeClass Example

apiVersion: karpenter.k8s.aws/v1beta1
kind: NodeClass
metadata:
  name: on-demand-nodes
spec:
  amiFamily: AL2
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  role: "KarpenterNodeRole"

A NodeClass in Karpenter defines AWS-specific static properties of a node, such as Security Groups, IAM Roles, and AMI Type. For example, the NodeClass in the previous example specifies an Amazon Linux 2 AMI along with security groups and subnets tagged for my-cluster. This serves as the foundational configuration for node provisioning.

On the other hand, a NodePool defines the logical properties of nodes, which are determined by the workload requirements. These include attributes like Instance Types, Tags, and Taints. A NodePool references a NodeClass to inherit its core node definition, allowing multiple NodePools to share a common infrastructure blueprint while customising their scaling behavior.

By leveraging NodeClass and NodePool together, SaaS platforms can design multi-tenant architectures that enforce workload isolation at different levels. This enables separation based on AMI types, security groups, IAM roles, and instance configurations, ensuring a secure, scalable, and cost-optimised infrastructure.

Karpenter NodePool Example with Taints and Tolerations

apiVersion: karpenter.k8s.aws/v1beta1
kind: NodePool
metadata:
  name: scalable-nodepool
spec:
  template:
    spec:
      nodeClassRef:
        name: on-demand-nodes
      requirements:
        - key: "node.kubernetes.io/instance-type"
          operator: In
          values: ["m5.large", "m5.xlarge"]
      taints:
        - key: "workload-type"
          value: "batch"
          effect: "NoSchedule"
  disruption:
    consolidateAfter: 30s
  limits:
    cpu: 1000

This NodePool supports m5.large and m5.xlarge instances.
It uses taints, ensuring only workloads with the toleration can be scheduled.
Auto-consolidates unused nodes after 30s.

Multi-AZ NodePools for High Availability

To ensure high availability and balanced workload distribution for clients, Karpenter can dynamically manage NodePool configurations. For workloads requiring low-latency, high-speed interconnectivity, Karpenter can provision nodes within the same Availability Zone (AZ), minimising network overhead and optimising performance.

This approach enhances the scalability and resilience of a SaaS platform, allowing it to offer tiered subscription models. Clients can choose between high-availability (HA) deployments, where workloads are distributed across multiple AZs for fault tolerance, or non-HA deployments, optimised for cost efficiency while remaining within a single AZ. This flexibility enables SaaS providers to align infrastructure costs with customer needs while maintaining reliability and performance.

Example below using NodePools across multiple Availability Zones (AZs) ensures that workloads remain available even if an AZ fails.

apiVersion: karpenter.k8s.aws/v1beta1
kind: NodePool
metadata:
 name: mutl-az-nodepool
spec:
  template:
    spec:
      subnetSelectorTerms:
        - tags:
            karpenter.sh/discovery: my-cluster
      requirements:
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]

This allows Karpenter to provision nodes in multiple AZs.

Optimising Cost with Spot Instances and Flexible Instance Selection

One of the most effective ways to reduce infrastructure costs in a SaaS platform is by leveraging Spot Instances. These instances can be up to 90% cheaper than On-Demand pricing, making them ideal for non-critical or fault-tolerant workloads like batch processing, CI/CD jobs, and AI/ML training. Karpenter dynamically provisions Spot Instances, ensuring workloads run efficiently while significantly lowering compute expenses.

To further optimise costs and performance, Karpenter allows defining multiple instance families based on workload needs. For example, compute-heavy applications can use c6a or c7g instances, while memory-intensive workloads run on r6g or r7g. This flexible approach ensures workloads are scheduled on the most cost-effective instances available at any given time, improving resource utilisation and cost efficiency.

Handling Spot Instance interruptions gracefully is crucial for maintaining availability. Karpenter integrates with Pod Disruption Budgets (PDBs) and uses Taints and Tolerations to schedule non-critical workloads on Spot Instances while keeping critical workloads on On-Demand nodes. Additionally, when Spot capacity is unavailable, Karpenter can automatically fall back to On-Demand or Reserved Instances, ensuring continued availability without manual intervention.

By combining Spot Instances, instance flexibility, and intelligent workload scheduling, SaaS providers can achieve a highly scalable, resilient, and cost-effective infrastructure while maintaining performance and reliability.

apiVersion: karpenter.k8s.aws/v1alpha1
kind: NodePool
metadata:
  name: cost-optimised-nodepool
spec:
  template:
    spec:
      nodeClassRef:
        name: my-nodeclass
      requirements:
        - key: "karpenter.k8s.aws/instance-category"
          operator: In
          values: ["c", "m", "r"] # Allow compute, memory, and general-purpose instances
        - key: "karpenter.k8s.aws/capacity-type"
          operator: In
          values: ["spot", "on-demand"] # Prefer Spot, but allow On-Demand as fallback
  limits:
    cpu: "1000"
    memory: "2000Gi"
  disruption:
    consolidateAfter: "30s"

Event-driven Autoscaling Autoscaling with KEDA

Kubernetes Event-driven Autoscaling (KEDA) is a lightweight component designed specifically for simplifying application autoscaling. It implements event-driven autoscaling to adjust the scale of your application based on demand, sustainably and cost-effectively, including the ability to scale down to zero.

KEDA allows scaling configuration based on external metrics, such as CloudWatch and SQS message counts. For clients who have uncertain usage patterns, they can take advantage of the custom scaling of Pods using KEDA.

KEDA ScaledObject for SQS Queue size example

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: cloudwatch-scaledobject
  namespace: production
spec:
  scaleTargetRef:
    kind: Deployment
    name: my-app
  triggers:
    - type: aws-cloudwatch
      metadata:
        namespace: AWS/SQS
        metricName: ApproximateNumberOfMessages
        dimensionName: QueueName
        dimensionValue: my-queue
        awsRegion: us-east-1
        targetValue: "10"

This automatically scales (up and down) pods when the SQS queue message count exceeds 10.

Security & Compliance for Nodes

The blog addressed some security aspects of maintaining workload isolation from other clients. However, for enhanced security measures such as compliance and AWS endpoint security, Karpenter can be utilised. Custom images, the selection of Security Groups, and Subnet configurations can ensure that workloads are deployed with the necessary external security.

SaaS platform can leverage these security levels in its offerings.

Hardened Images(AMIs) Selection for Compliance In NodeClass

spec:
  amiFamily: "Bottlerocket"

Using Bottlerocket AMI ensures security compliance with a minimal attack surface.

Applying Security Groups to NodeClass

spec:
  securityGroupSelectorTerms:
    - tags:
        Name: karpenter-secure-group

This ensures only approved security groups are used for nodes.

Conclusion

Building a scalable, cost-effective, and resilient SaaS platform presents unique challenges, including unpredictable demand, cost optimisation, workload isolation, and high availability. This blog explored how Karpenter in Amazon EKS helps tackle these challenges by dynamically provisioning nodes based on workload needs, optimising infrastructure costs using Spot Instances, and ensuring seamless scaling with NodePools.

By implementing Network Policies and leveraging multi-AZ deployments, SaaS providers can enhance security, workload isolation, and fault tolerance. Additionally, integrating KEDA allows platforms to scale workloads efficiently based on custom metrics, ensuring resource utilisation aligns with demand. With these strategies, SaaS platforms can offer flexible, high-performance, and cost-optimised services while maintaining reliability and security.