Kubernetes HPA VPA Cluster Autoscaler

Kubernetes Autoscaling: HPA VPA Cluster Autoscaler Guide

Master Kubernetes autoscaling with HPA, VPA, and Cluster Autoscaler. Complete guide with real-world examples, configuration best practices, troubleshooting tips, and cost optimization strategies for production workloads.

Introduction to Kubernetes Autoscaling: Matching Resources to Demand Automatically

Kubernetes autoscaling is the automated process of dynamically adjusting compute resources allocated to your applications based on real-time demand metrics, enabling your infrastructure to automatically scale up during traffic spikes handling millions of additional requests without manual intervention, scale down during low-traffic periods reducing cloud costs by 40-70% without impacting performance, maintain consistent application response times regardless of load variability, eliminate the need for capacity planning guesswork and manual scaling operations that waste engineering time, and ensure optimal resource utilization preventing both under-provisioning that causes outages and over-provisioning that wastes thousands of dollars monthly on unused capacity sitting idle.

In modern cloud-native architectures running on Kubernetes, autoscaling is not a luxury optimization feature to implement “eventually when we have time” it is a fundamental capability that directly impacts your application reliability, operational costs, developer productivity, and competitive advantage in markets where user experience and infrastructure efficiency determine success or failure. Companies that implement effective autoscaling report 50-70% reduction in infrastructure costs, 99.9%+ uptime during unpredictable traffic surges, 80% reduction in time spent on capacity planning and manual scaling operations, and the ability to handle viral traffic spikes that would have caused complete outages with static capacity.

However, Kubernetes autoscaling is significantly more complex than simply "turning on autoscaling" with default settings and hoping for the best. Kubernetes provides three distinct autoscaling mechanisms that operate at different levels of infrastructure abstraction and serve different purposes: Horizontal Pod Autoscaler (HPA) scales the number of pod replicas running your application up and down based on CPU, memory, or custom metrics, Vertical Pod Autoscaler (VPA) adjusts the CPU and memory resource requests and limits for individual pods, and Cluster Autoscaler adds or removes entire worker nodes from your cluster. Using these mechanisms effectively requires understanding what each autoscaler does, when to use which autoscaler (or combinations of them), how to configure metrics and thresholds correctly, how to avoid configuration conflicts and scaling thrashing, and how to test autoscaling behavior before production deployment.

This comprehensive technical guide teaches you everything you need to know about implementing production-grade Kubernetes autoscaling successfully, covering: fundamental autoscaling concepts and when each autoscaler should be used, complete HPA implementation guide with CPU, memory, and custom metrics, VPA configuration for automatic resource optimization, Cluster Autoscaler setup and node pool management, best practices for combining multiple autoscalers safely, common pitfalls and anti-patterns that break autoscaling, advanced patterns like predictive autoscaling and KEDA event-driven scaling, real-world architecture examples from production deployments, monitoring and troubleshooting autoscaling decisions, and how platforms like Atmosly simplify autoscaling through AI-powered recommendations analyzing your actual workload patterns to suggest optimal configurations, automatic detection of autoscaling issues and misconfigurations causing scaling failures or cost waste, integrated cost intelligence showing exactly how autoscaling changes impact your cloud bill in real-time, and intelligent alerting when autoscaling isn't working as expected.

By mastering the autoscaling strategies explained in this guide, you'll transform your Kubernetes infrastructure from static capacity requiring constant manual adjustment and frequent over-provisioning to dynamic elasticity automatically matching compute resources to actual demand, reducing cloud costs by 40-70% while simultaneously improving reliability and performance, eliminating manual capacity planning work that consumes hours of engineering time weekly, confidently handling unpredictable traffic spikes without midnight emergency responses, and gaining the operational efficiency needed to scale your business faster.

Understanding Kubernetes Autoscaling: Three Mechanisms, Different Purposes

Kubernetes provides three distinct autoscaling mechanisms that operate at different levels of your infrastructure stack. Understanding the differences, use cases, and interactions between these autoscalers is critical to implementing effective autoscaling:

Horizontal Pod Autoscaler (HPA): Scaling Pod Replica Count

What it does: HPA automatically increases or decreases the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed metrics like CPU utilization, memory usage, or custom application metrics.

When to use HPA:

  • Stateless applications where adding more pod replicas increases capacity linearly (web servers, API services, microservices)
  • Applications with variable traffic patterns experiencing daily, weekly, or event-driven load spikes
  • Services that benefit from horizontal scaling rather than vertical scaling (most modern cloud-native apps)
  • Workloads with well-defined scaling metrics like HTTP request rate, queue depth, or custom business metrics

How it works: HPA queries the Metrics Server (or custom metrics API) every 15 seconds by default, calculates the desired replica count based on target metric values, and adjusts the replica count of the target deployment. The basic formula is: desiredReplicas = ceil[currentReplicas * (currentMetricValue / targetMetricValue)]

Key configuration parameters:

  • minReplicas: Minimum number of replicas (prevents scaling to zero accidentally)
  • maxReplicas: Maximum number of replicas (cost safety limit)
  • metrics: List of metrics to scale on (CPU, memory, custom metrics)
  • behavior: Scaling velocity controls (how fast to scale up/down)

Example HPA manifest for CPU-based scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: frontend-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70  # Scale when average CPU exceeds 70%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
      policies:
      - type: Percent
        value: 50  # Scale down maximum 50% of pods at once
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100  # Can double pod count at once
        periodSeconds: 15
      - type: Pods
        value: 5  # Or add 5 pods, whichever is smaller
        periodSeconds: 15
      selectPolicy: Max  # Use the policy that scales fastest

Critical success factors for HPA:

  • Resource requests must be defined: HPA calculates CPU/memory utilization as percentage of requests, so missing requests breaks HPA completely
  • Metrics Server must be installed: HPA requires Metrics Server for resource metrics (CPU/memory)
  • Applications must handle horizontal scaling: Stateful apps, apps with local caches, or apps expecting fixed replica counts may not work with HPA
  • Load balancing must distribute traffic evenly: Uneven traffic distribution causes some pods to hit limits while others idle

Vertical Pod Autoscaler (VPA): Right-Sizing Pod Resources

What it does: VPA automatically adjusts CPU and memory requests and limits for pods based on historical and current resource usage patterns, ensuring pods have sufficient resources without massive over-provisioning.

When to use VPA:

  • Applications with unpredictable resource requirements where setting fixed requests is difficult
  • Stateful applications that cannot scale horizontally (databases, caches, monoliths)
  • Continuous resource optimization automatically adjusting requests as application behavior changes over time
  • Initial sizing of new applications where you don't yet know optimal resource requests

How it works: VPA analyzes actual resource consumption over time (typically 8 days of history), calculates recommended resource requests using statistical models, and either provides recommendations or automatically updates pod resources by evicting and recreating pods with new values.

VPA operating modes:

  • "Off" mode: Generate recommendations only, no automatic changes (safest for testing)
  • "Initial" mode: Set resource requests only when pods are created, never update running pods
  • "Recreate" mode: Actively evict pods to update resources (causes brief downtime per pod)
  • "Auto" mode: VPA chooses between Initial and Recreate based on situation

Example VPA manifest for a database:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: postgres-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: postgres
  updatePolicy:
    updateMode: "Recreate"  # Automatically update pods
  resourcePolicy:
    containerPolicies:
    - containerName: postgres
      minAllowed:
        cpu: 500m
        memory: 1Gi
      maxAllowed:
        cpu: 8000m
        memory: 32Gi
      controlledResources: ["cpu", "memory"]
      mode: Auto

Critical VPA limitations and considerations:

  • VPA and HPA conflict on CPU/memory metrics: Cannot use both on same metrics for same deployment (causes scaling battles)
  • VPA requires pod restarts: Updating resources requires pod recreation, causing brief unavailability unless using RollingUpdate
  • VPA recommendations need time to stabilize: Requires 8+ days of data for accurate recommendations
  • VPA doesn't handle burst traffic well: Based on historical averages, may not provision for sudden spikes

Cluster Autoscaler: Adding and Removing Nodes

What it does: Cluster Autoscaler automatically adds worker nodes to your cluster when pods cannot be scheduled due to insufficient resources, and removes underutilized nodes to reduce costs.

When to use Cluster Autoscaler:

  • Cloud environments (AWS, GCP, Azure) where nodes can be provisioned dynamically
  • Variable cluster load where node count needs to change over time
  • Cost optimization removing idle nodes during low-traffic periods
  • Batch job workloads requiring burst capacity temporarily

How it works:

  1. Scale-up trigger: Cluster Autoscaler detects pods in Pending state due to insufficient node resources
  2. Node group selection: Evaluates configured node pools/groups to find best fit for pending pods
  3. Node provisioning: Requests new nodes from cloud provider (typically takes 1-3 minutes)
  4. Scale-down detection: Identifies nodes running below utilization threshold (default 50%) for 10+ minutes
  5. Safe eviction check: Ensures pods can be safely rescheduled elsewhere before removing node
  6. Node removal: Cordons node, drains pods gracefully, deletes node from cloud provider

Example Cluster Autoscaler configuration for AWS EKS:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
      - image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0
        name: cluster-autoscaler
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
        - --balance-similar-node-groups
        - --skip-nodes-with-system-pods=false
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        - --scale-down-utilization-threshold=0.5

Cluster Autoscaler best practices:

  • Use node pools with different instance types: General-purpose, compute-optimized, memory-optimized pools for different workloads
  • Set Pod Disruption Budgets (PDBs): Prevents Cluster Autoscaler from removing nodes hosting critical pods
  • Configure appropriate scale-down delay: Balance cost savings against scaling thrashing
  • Use expanders strategically: "least-waste" minimizes cost, "priority" gives control over node selection
  • Set cluster-autoscaler.kubernetes.io/safe-to-evict annotations: Control which pods block node scale-down

HPA Deep Dive: Advanced Horizontal Pod Autoscaling Patterns

Scaling on Multiple Metrics Simultaneously

Production applications rarely scale optimally on a single metric. HPA v2 supports multiple metrics with intelligent decision-making:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-service
  minReplicas: 5
  maxReplicas: 100
  metrics:
  # Scale on CPU utilization
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  # Scale on memory utilization
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  # Scale on custom metric: HTTP requests per second
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"  # 1000 requests/second per pod

How HPA handles multiple metrics: HPA calculates desired replica count for each metric independently, then chooses the maximum (most conservative) replica count. This ensures scaling up if ANY metric crosses threshold.

Custom Metrics Scaling for Business Logic

CPU and memory are infrastructure metrics, but scaling should often be based on actual business metrics: requests per second, queue depth, job processing rate, active connections, etc.

Implementing custom metrics scaling requires:

  1. Expose custom metrics from your application (typically via /metrics endpoint in Prometheus format)
  2. Deploy Prometheus Adapter or similar custom metrics API server to make metrics available to HPA
  3. Create HPA referencing custom metrics

Example: Scaling based on SQS queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-worker-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: External
    external:
      metric:
        name: sqs_queue_depth
        selector:
          matchLabels:
            queue_name: processing-queue
      target:
        type: AverageValue
        averageValue: "30"  # 30 messages per pod

This configuration maintains approximately 30 messages per pod. If queue depth is 300 and there are 5 pods, HPA scales to 10 pods (300 / 30 = 10).

Configuring Scaling Velocity and Stabilization

Default HPA behavior scales up and down aggressively, potentially causing scaling thrashing where pod count oscillates rapidly. The behavior section provides fine-grained control:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300  # Wait 5 minutes before scaling down
    policies:
    - type: Percent
      value: 25  # Scale down maximum 25% at once
      periodSeconds: 60
    - type: Pods
      value: 5  # Or remove 5 pods, whichever is smaller
      periodSeconds: 60
    selectPolicy: Min  # Use the slower (more conservative) policy
  scaleUp:
    stabilizationWindowSeconds: 0  # Scale up immediately
    policies:
    - type: Percent
      value: 100  # Can double pod count
      periodSeconds: 15
    - type: Pods
      value: 10  # Or add 10 pods
      periodSeconds: 15
    selectPolicy: Max  # Use the faster (more aggressive) policy

Stabilization window: HPA looks back over this time period and uses the highest recommended replica count (for scale-up) or lowest (for scale-down). This prevents rapid oscillations.

Policies: Define maximum scaling velocity as either percentage or absolute pod count. Multiple policies allow different behaviors at different scales.

selectPolicy:

  • Max: Use the policy that scales most aggressively (typically for scale-up)
  • Min: Use the policy that scales most conservatively (typically for scale-down)
  • Disabled: Disable scaling in this direction entirely

VPA Best Practices and Common Pitfalls

VPA Recommendation Analysis Before Enabling Auto Mode

Never enable VPA in "Recreate" or "Auto" mode immediately in production. Start with "Off" mode to analyze recommendations:

# Create VPA in recommendation-only mode
kubectl apply -f vpa-off-mode.yaml

# Wait 24-48 hours for initial recommendations

# Check VPA recommendations
kubectl describe vpa 

# Look at the output:
# Status:
#   Recommendation:
#     Container Recommendations:
#       Container Name:  app
#       Lower Bound:     # Minimum resources needed (very conservative)
#         Cpu:     100m
#         Memory:  256Mi
#       Target:          # Recommended values (most accurate)
#         Cpu:     500m
#         Memory:  1Gi
#       Uncapped Target: # Recommended without maxAllowed limits
#         Cpu:     750m
#         Memory:  1.5Gi
#       Upper Bound:     # Maximum resources that might be needed (conservative)
#         Cpu:     1000m
#         Memory:  2Gi

Interpreting VPA recommendations:

  • Lower Bound: Minimum resources needed during lowest usage periods (too aggressive for production)
  • Target: Sweet spot recommendation (use this for requests)
  • Upper Bound: Maximum resources needed during peak usage (consider for limits)
  • Uncapped Target: What VPA would recommend without your maxAllowed constraints

Combining VPA and HPA Safely

VPA and HPA conflict if both try to scale based on CPU or memory metrics. Safe combination patterns:

Pattern 1: VPA for requests, HPA for replica count on different metrics

# VPA manages resource sizing
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: app
      controlledResources: ["cpu", "memory"]  # VPA manages both
---
# HPA scales replica count on custom metrics only
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second  # Custom metric, not CPU/memory
      target:
        type: AverageValue
        averageValue: "1000"

Pattern 2: VPA for CPU, HPA for memory (or vice versa)

# VPA manages only CPU
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: app
      controlledResources: ["cpu"]  # VPA only manages CPU
---
# HPA scales on memory only
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: memory  # HPA only scales on memory
      target:
        type: Utilization
        averageUtilization: 80

VPA UpdateMode Selection Guide

Use "Off" mode when:

  • Initially deploying VPA to understand recommendations before making changes
  • You want manual control over resource updates
  • Application cannot tolerate any pod restarts

Use "Initial" mode when:

  • Application scales horizontally frequently (HPA) and you want VPA to size new pods correctly
  • Cannot tolerate restarts of existing pods but want new pods sized correctly
  • Using VPA primarily for initial sizing of new applications

Use "Recreate" or "Auto" mode when:

  • Application handles pod restarts gracefully (has multiple replicas, reconnection logic)
  • Want continuous optimization of resource allocation over time
  • Cost optimization is critical and you're willing to accept brief interruptions

Cluster Autoscaler Configuration and Optimization

Node Group Strategy: Multiple Pools for Different Workloads

Production clusters should have multiple node groups optimized for different workload types:

Example multi-pool strategy:

# General-purpose node pool (most workloads)
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: general-purpose
spec:
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]  # Prefer spot, fallback to on-demand
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["t3.medium", "t3.large", "t3.xlarge"]
  labels:
    workload-type: general
  taints:
  - key: workload-type
    value: general
    effect: NoSchedule
  limits:
    resources:
      cpu: 1000
      memory: 1000Gi
---
# Memory-optimized pool (databases, caches)
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: memory-optimized
spec:
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["on-demand"]  # Always on-demand for stateful workloads
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["r5.xlarge", "r5.2xlarge", "r5.4xlarge"]
  labels:
    workload-type: memory-intensive
  taints:
  - key: workload-type
    value: memory-intensive
    effect: NoSchedule
  limits:
    resources:
      cpu: 500
      memory: 2000Gi
---
# Compute-optimized pool (CPU-intensive jobs)
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: compute-optimized
spec:
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot"]  # Spot instances for batch jobs
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["c5.2xlarge", "c5.4xlarge", "c5.9xlarge"]
  labels:
    workload-type: compute-intensive
  taints:
  - key: workload-type
    value: compute-intensive
    effect: NoSchedule
  limits:
    resources:
      cpu: 1000
      memory: 500Gi

Workload-to-pool assignment using node selectors and tolerations:

# General web service → general-purpose pool
apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
spec:
  template:
    spec:
      nodeSelector:
        workload-type: general
      tolerations:
      - key: workload-type
        operator: Equal
        value: general
        effect: NoSchedule
---
# Redis cache → memory-optimized pool
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  template:
    spec:
      nodeSelector:
        workload-type: memory-intensive
      tolerations:
      - key: workload-type
        operator: Equal
        value: memory-intensive
        effect: NoSchedule
---
# Data processing job → compute-optimized pool
apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing
spec:
  template:
    spec:
      nodeSelector:
        workload-type: compute-intensive
      tolerations:
      - key: workload-type
        operator: Equal
        value: compute-intensive
        effect: NoSchedule

Preventing Cluster Autoscaler from Removing Critical Nodes

Cluster Autoscaler respects several mechanisms to prevent inappropriate node removal:

1. Pod Disruption Budgets (PDBs):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: frontend-pdb
spec:
  minAvailable: 3  # Always keep at least 3 replicas running
  selector:
    matchLabels:
      app: frontend

2. Cluster Autoscaler annotations on pods:

apiVersion: v1
kind: Pod
metadata:
  name: critical-pod
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"  # Never evict this pod
spec:
  # ... pod spec

3. Node annotations to prevent scale-down:

# Prevent specific node from being scaled down
kubectl annotate node  cluster-autoscaler.kubernetes.io/scale-down-disabled="true"

4. System pods with restrictive PDBs: Cluster Autoscaler by default doesn't remove nodes running kube-system pods unless --skip-nodes-with-system-pods=false

Optimizing Scale-Down Behavior

Scale-down is more complex than scale-up because removing nodes risks disrupting running workloads. Key configuration parameters:

--scale-down-enabled=true  # Enable scale-down (default true)
--scale-down-delay-after-add=10m  # Wait 10 min after scale-up before considering scale-down
--scale-down-delay-after-delete=10s  # Wait 10s after deleting node before next scale-down
--scale-down-delay-after-failure=3m  # Wait 3min after scale-down failure before retry
--scale-down-unneeded-time=10m  # Node must be unneeded for 10min before removal
--scale-down-utilization-threshold=0.5  # Node must be <50% utilized to be considered for removal
--max-node-provision-time=15m  # If node doesn't become ready in 15min, consider failed

Recommended production values:

  • scale-down-unneeded-time: 10-15 minutes (prevents thrashing during temporary load dips)
  • scale-down-utilization-threshold: 0.5-0.6 (50-60% - balance cost savings against stability)
  • scale-down-delay-after-add: 10-15 minutes (don't immediately remove recently added nodes)

Advanced Autoscaling Patterns and Strategies

Predictive Autoscaling Using KEDA and Cron-Based Scaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with event-driven scaling based on external event sources:

Example: Scale based on Azure Service Bus queue:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: queue-processor-scaler
spec:
  scaleTargetRef:
    name: queue-processor
  minReplicaCount: 2
  maxReplicaCount: 100
  triggers:
  - type: azure-servicebus
    metadata:
      queueName: processing-queue
      namespace: production
      messageCount: "30"  # 30 messages per pod

Cron-based predictive scaling for known traffic patterns:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: business-hours-scaler
spec:
  scaleTargetRef:
    name: api-service
  minReplicaCount: 5
  maxReplicaCount: 50
  triggers:
  # Scale up Monday-Friday 8 AM (start of business day)
  - type: cron
    metadata:
      timezone: America/New_York
      start: 0 8 * * 1-5  # 8 AM Mon-Fri
      end: 0 18 * * 1-5   # 6 PM Mon-Fri
      desiredReplicas: "20"
  # Scale to minimum on weekends and nights
  - type: cpu
    metricType: Utilization
    metadata:
      value: "70"

Zone-Aware Autoscaling for High Availability

Ensure autoscaling maintains high availability by distributing pods across multiple availability zones:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: frontend
spec:
  replicas: 6  # Will be managed by HPA
  template:
    spec:
      topologySpreadConstraints:
      - maxSkew: 1  # Maximum difference of 1 pod between zones
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels:
            app: frontend
      # Pod spec...
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: frontend-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend
  minReplicas: 6  # Minimum 2 per zone across 3 zones
  maxReplicas: 60  # Will spread across zones automatically

Cost-Optimized Autoscaling with Spot Instances

Combine Cluster Autoscaler with spot instances for cost savings up to 70-90%:

# Node pool using spot instances with on-demand fallback
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: cost-optimized
spec:
  requirements:
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot", "on-demand"]  # Prefer spot, fallback on-demand
    weight: 100  # Prefer spot heavily
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["t3.medium", "t3a.medium", "t2.medium"]  # Multiple similar types
  labels:
    cost-optimized: "true"
  # Spread across multiple instance types and zones for spot availability
  ttlSecondsAfterEmpty: 30  # Remove empty nodes quickly
  ttlSecondsUntilExpired: 604800  # Recycle nodes weekly (Spot termination protection)
---
# Deployment configured for spot instance interruptions
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  replicas: 10
  template:
    metadata:
      annotations:
        cluster-autoscaler.kubernetes.io/safe-to-evict: "true"  # Allow eviction
    spec:
      nodeSelector:
        cost-optimized: "true"
      tolerations:
      - key: karpenter.sh/capacity-type
        operator: Equal
        value: spot
        effect: NoSchedule
      # Application must handle graceful shutdown on SIGTERM
      terminationGracePeriodSeconds: 30

Monitoring and Troubleshooting Autoscaling

Key Metrics to Monitor for Autoscaling Health

HPA Metrics:

# Current replica count vs desired
kube_horizontalpodautoscaler_status_current_replicas{name="frontend-hpa"}
kube_horizontalpodautoscaler_status_desired_replicas{name="frontend-hpa"}

# Metric value that triggered scaling
kube_horizontalpodautoscaler_status_current_metrics_value

# HPA scaling events
rate(kube_horizontalpodautoscaler_scaling_events_total[5m])

# HPA unable to scale (at max replicas)
kube_horizontalpodautoscaler_status_condition{status="true",condition="ScalingLimited"}

VPA Metrics:

# VPA recommendation vs current usage
vpa_containerrecommendation_target{container="app"}
container_memory_working_set_bytes{container="app"}

# VPA updates performed
rate(vpa_recommender_recommendation_updates_total[5m])

Cluster Autoscaler Metrics:

# Unschedulable pods triggering scale-up
cluster_autoscaler_unschedulable_pods_count

# Nodes added/removed
rate(cluster_autoscaler_scaled_up_nodes_total[10m])
rate(cluster_autoscaler_scaled_down_nodes_total[10m])

# Failed scale operations
cluster_autoscaler_failed_scale_ups_total
cluster_autoscaler_errors_total

# Scale-up time duration
histogram_quantile(0.95, rate(cluster_autoscaler_scale_up_duration_seconds_bucket[10m]))

Common Autoscaling Problems and Solutions

Problem 1: HPA stuck at minimum replicas despite high load

Symptoms:

kubectl get hpa
# NAME          REFERENCE       TARGETS          MINPODS   MAXPODS   REPLICAS
# frontend-hpa  Deployment/app  /70%    3         50        3

Root causes and fixes:

  • Metrics Server not installed: kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
  • Missing resource requests: Add CPU/memory requests to pod spec
  • Metrics Server certificate errors: Check logs: kubectl logs -n kube-system deployment/metrics-server
  • Custom metrics API not configured: Deploy Prometheus Adapter or similar

Problem 2: VPA causing frequent pod restarts

Symptoms: Pods restarting every few hours, brief service interruptions

Root causes and fixes:

  • VPA in Recreate mode with unstable recommendations: Increase --recommendation-margin-fraction to require larger changes before updates
  • Insufficient stabilization time: VPA needs 8+ days of data for stable recommendations
  • Conflicting with HPA on same metrics: Use VPA for CPU, HPA for custom metrics only
  • Fix: Switch to "Initial" mode or increase update thresholds

Problem 3: Cluster Autoscaler not adding nodes despite pending pods

Symptoms:

kubectl get pods
# NAME                        READY   STATUS    RESTARTS   AGE
# frontend-abc123             0/1     Pending   0          10m

kubectl describe pod frontend-abc123
# Events:
#   Warning  FailedScheduling  pod didn't trigger scale-up (no node group can accommodate this pod)

Root causes and fixes:

  • Pod resource requests exceed largest node size: Reduce requests or add larger node pool
  • Node affinity/selector doesn't match any node pool: Fix nodeSelector or add matching node pool
  • Node pool at maximum size: Increase node group max size
  • Cluster Autoscaler not running: Check: kubectl get pods -n kube-system -l app=cluster-autoscaler
  • Cloud provider quota limits: Check cloud provider console for quota

Problem 4: Autoscaling thrashing (rapid scale up/down cycles)

Symptoms: Pod count oscillating rapidly, nodes added then removed quickly

Root causes and fixes:

  • Target metric threshold too sensitive: Increase metric target (e.g., 70% → 80% CPU)
  • Insufficient stabilization window: Increase HPA behavior.scaleDown.stabilizationWindowSeconds
  • Too aggressive scale-down: Increase Cluster Autoscaler --scale-down-unneeded-time
  • Uneven load distribution: Fix load balancer configuration or add affinity rules

How Atmosly Simplifies Kubernetes Autoscaling

While Kubernetes provides powerful autoscaling capabilities, configuring and maintaining them correctly requires deep expertise, continuous monitoring, and constant optimization. Atmosly simplifies Kubernetes autoscaling through AI-powered automation and intelligent recommendations:

AI-Powered Autoscaling Recommendations

Atmosly analyzes your actual workload patterns, resource utilization, and traffic profiles to automatically recommend optimal autoscaling configurations:

  • Optimal HPA thresholds: Based on your application's actual behavior, not generic defaults
  • Right-sized VPA limits: Analyzing historical usage to prevent over-provisioning
  • Node pool optimization: Recommending instance types and sizes that match your workload characteristics
  • Cost-performance trade-offs: Showing exactly how different autoscaling configurations impact both costs and reliability

Automatic Detection of Autoscaling Issues

Atmosly continuously monitors your autoscaling configuration and behavior, automatically detecting:

  • HPA misconfiguration: Missing metrics, incorrect thresholds, conflicting autoscalers
  • VPA causing instability: Excessive pod restarts, recommendation instability
  • Cluster Autoscaler failures: Pending pods not triggering scale-up, premature scale-down
  • Scaling thrashing: Rapid oscillations wasting money and causing instability
  • Resource waste: Over-provisioned resources not being used

Integrated Cost Intelligence

Atmosly provides real-time visibility into how autoscaling decisions impact your cloud costs:

  • Cost attribution per autoscaler: See exactly how much each HPA/VPA/Cluster Autoscaler configuration costs
  • What-if analysis: Preview cost impact of configuration changes before applying them
  • Savings opportunities: Identify workloads that could benefit from autoscaling but don't have it configured
  • Spot instance recommendations: Identify workloads safe to run on spot instances for 70-90% cost savings

Simplified Troubleshooting with AI Copilot

When autoscaling isn't working as expected, Atmosly's AI Copilot provides instant diagnosis:

  • Natural language queries: Ask "Why isn't my frontend scaling up?" instead of debugging metrics
  • Correlated analysis: Automatically correlates HPA metrics, resource usage, events, and logs
  • Root cause detection: Identifies exact reason for scaling failures (missing metrics, quota limits, misconfigurations)
  • Actionable remediation: Provides specific kubectl commands or configuration changes to fix issues

Learn more about how Atmosly can optimize your Kubernetes autoscaling at atmosly.com

Autoscaling Production Checklist

Before deploying autoscaling to production, validate these critical requirements:

Pre-Deployment Validation

  • ✓ Metrics Server deployed and healthy: kubectl top nodes returns data
  • ✓ Resource requests defined for all pods: No pods missing CPU/memory requests
  • ✓ Pod Disruption Budgets configured: Prevent autoscaling from breaking availability
  • ✓ Testing in staging environment: Validate autoscaling behavior under load before production
  • ✓ Monitoring dashboards created: Track HPA/VPA/Cluster Autoscaler metrics
  • ✓ Alerting configured: Get notified when autoscaling fails or behaves unexpectedly
  • ✓ Documentation written: Team knows how autoscaling is configured and how to troubleshoot
  • ✓ Runbooks prepared: Clear procedures for common autoscaling issues

HPA-Specific Checklist

  • ✓ Appropriate metrics selected: Use custom metrics for business logic, not just CPU
  • ✓ Thresholds validated under load: Load test to ensure scaling triggers at right time
  • ✓ Scaling velocity configured: Prevent thrashing with appropriate stabilization windows
  • ✓ Min/max replicas set appropriately: Minimum high enough for availability, maximum prevents runaway costs
  • ✓ Multiple metrics configured: Scale on CPU AND memory AND custom metrics for robustness

VPA-Specific Checklist

  • ✓ Started in "Off" mode initially: Analyzed recommendations before enabling auto-updates
  • ✓ Min/max allowed resources configured: Prevent VPA from setting extreme values
  • ✓ No VPA/HPA conflicts: Both don't scale on same metrics
  • ✓ Application handles restarts gracefully: No data loss or extended downtime from pod recreation
  • ✓ Sufficient replicas for rolling updates: VPA won't break availability by restarting all pods simultaneously

Cluster Autoscaler Checklist

  • ✓ Multiple node pools configured: Different instance types for different workload types
  • ✓ Node pool quotas sufficient: Cloud provider limits allow required node count
  • ✓ Appropriate scale-down delay: Won't remove nodes prematurely
  • ✓ Critical pods protected: safe-to-evict annotations prevent premature node removal
  • ✓ Expander configured: Strategy matches priorities (cost vs capacity)
  • ✓ Scale-up time acceptable: New nodes provision within acceptable timeframe (typically 2-5 min)

Conclusion: Achieving Elastic Kubernetes Infrastructure

Effective Kubernetes autoscaling transforms your infrastructure from static capacity requiring constant manual adjustment and frequent over-provisioning to dynamic elasticity automatically matching resources to actual demand. By mastering HPA for horizontal scaling, VPA for resource optimization, and Cluster Autoscaler for node management and understanding how to combine them safely you can achieve 40-70% cost reduction while simultaneously improving reliability and performance.

The key principles to remember:

  • Start simple, iterate toward complexity: Begin with basic HPA on CPU, add custom metrics and VPA as you gain confidence
  • Monitor continuously: Autoscaling isn't “set and forget” monitor metrics and adjust configurations as applications evolve
  • Test thoroughly before production: Load test autoscaling behavior in staging to catch issues before customer impact
  • Prevent conflicts: Ensure HPA, VPA, and Cluster Autoscaler work together rather than fighting each other
  • Balance cost and stability: Aggressive autoscaling saves money but risks thrashing; conservative autoscaling wastes money but ensures stability

Modern platforms like Atmosly eliminate much of the complexity and ongoing maintenance burden of Kubernetes autoscaling through AI-powered recommendations, automatic issue detection, integrated cost intelligence, and intelligent troubleshooting enabling teams to achieve optimal autoscaling configurations in hours rather than months of trial and error.

Ready to optimize your Kubernetes autoscaling with Atmosly? Start with the basics (HPA with CPU metrics), monitor the results, iterate based on actual behavior, and leverage modern platforms that simplify configuration and troubleshooting so you can focus on building products rather than tuning infrastructure.

Frequently Asked Questions

What is the difference between HPA and VPA in Kubernetes?
Horizontal Pod Autoscaler (HPA) scales the number of pod replicas in a deployment based on metrics like CPU, memory, or custom metrics, increasing or decreasing the pod count to match demand. Vertical Pod Autoscaler (VPA) adjusts the CPU and memory resource requests and limits for individual pods, right-sizing resources rather than changing replica count. HPA is best for stateless applications that benefit from horizontal scaling, while VPA is ideal for stateful applications that cannot scale horizontally or for optimizing resource allocation automatically.
Can I use HPA and VPA together on the same deployment?
Yes, but with important restrictions. You cannot use both HPA and VPA on the same metrics (CPU or memory) for the same deployment as they will conflict and cause scaling battles. Safe combination patterns include: using VPA to manage resource requests while HPA scales based on custom metrics like request rate, or using VPA for CPU sizing while HPA scales on memory utilization. Always ensure they operate on different metrics to avoid conflicts.
How long does it take for Kubernetes Cluster Autoscaler to add new nodes?
Cluster Autoscaler typically provisions new nodes within 2-5 minutes from detecting pending pods, though timing varies by cloud provider. AWS EKS typically takes 2-3 minutes, GCP GKE takes 1-2 minutes, and Azure AKS takes 2-4 minutes. The process includes: detecting pending pods (15-30 seconds), selecting appropriate node group (5-10 seconds), requesting node from cloud provider (1-3 minutes), and node initialization and joining cluster (30-60 seconds). You can optimize this with warm pools or Karpenter for faster provisioning.
What metrics should I use for HPA scaling instead of just CPU?
While CPU is common, production applications should scale on metrics that better represent actual load: request rate (requests per second per pod), queue depth (messages per pod for worker services), latency percentiles (p95 or p99 response time), active connections, custom business metrics (active users, transactions per second), or memory utilization for memory-intensive applications. Use the Kubernetes metrics API with Prometheus Adapter to expose custom metrics to HPA. Scaling on multiple metrics simultaneously ensures HPA responds to various load patterns.
Why is my HPA showing 'unknown' for target metrics?
HPA showing 'unknown' targets indicates it cannot retrieve metrics, typically caused by: Metrics Server not installed or unhealthy (check kubectl get deployment metrics-server -n kube-system), missing resource requests in pod spec (HPA calculates utilization as percentage of requests), custom metrics API not configured for custom metrics, or insufficient RBAC permissions for HPA to read metrics. Fix by installing Metrics Server, ensuring all pods have resource requests defined, and checking HPA controller logs for specific errors.