Kubernetes OOMKilled Errors

How to Fix Kubernetes OOMKilled Errors (2025)

Complete guide to Kubernetes OOMKilled errors: understand what OOMKilled means, learn 8 common causes (insufficient limits, memory leaks, load spikes), systematic debugging with Prometheus, memory profiling techniques, calculate optimal limits, fix leaks, and how Atmosly detects leaks automatically with cost-aware recommendations.

Introduction to OOMKilled: When Memory Limits Are Exceeded

OOMKilled (Out Of Memory Killed) is one of the most disruptive and difficult-to-diagnose errors in Kubernetes, occurring when a container exceeds its configured memory limit and the Linux kernel's Out Of Memory (OOM) killer forcefully terminates the process to protect the node from complete memory exhaustion that could crash the entire system. Unlike graceful application shutdowns where your code can clean up resources, save state, and exit cleanly, OOMKilled is violent and immediate the kernel sends SIGKILL (signal 9) which cannot be caught, handled, or ignored by the application, instantly terminating the process mid-execution potentially corrupting data, losing in-flight transactions, breaking database connections without cleanup, and leaving the application in inconsistent state requiring recovery procedures when it restarts.

The OOMKilled problem manifests in several frustrating ways that impact production services: containers repeatedly crash and enter CrashLoopBackOff as Kubernetes restarts them, they hit the memory limit again, get OOMKilled again, creating an infinite loop of failures, deployments fail to roll out because new pods immediately get OOMKilled while trying to initialize, applications experience periodic crashes under load when memory usage spikes above limits during traffic bursts or batch processing jobs, databases lose connections and corrupt indexes when killed mid-transaction without proper shutdown, stateful applications lose cached data and warm-up state requiring minutes to rebuild after restart degrading performance, and customer-facing services become unreliable with random crashes that users experience as 500 errors, timeouts, or complete unavailability.

Understanding OOMKilled requires understanding how Linux memory management and Kubernetes resource limits interact: containers run in Linux cgroups (control groups) that enforce resource limits including memory, when a container's memory usage (specifically the "working set" which excludes reclaimable page cache) reaches the limit configured in resources.limits.memory in the pod specification, the cgroup memory controller triggers, Linux kernel's OOM killer selects a process to terminate based on OOM score (memory usage, OOM adjustment values), the selected process (usually your application's main process with highest memory consumption) receives SIGKILL terminating immediately, the container exits with exit code 137 (128 + 9 where 9 is the SIGKILL signal number), and Kubernetes sees the terminated container with reason "OOMKilled" and restarts it according to the restart policy, typically leading to CrashLoopBackOff if the problem persists.

This comprehensive guide teaches you everything about OOMKilled errors from root causes to permanent solutions, covering: what OOMKilled means technically and how to confirm it happened versus other container termination reasons, understanding Kubernetes memory metrics (working set bytes, RSS, cache, swap) and which one triggers OOMKill, the difference between memory requests and limits and how each affects scheduling versus runtime behavior, the 8 most common root causes of OOMKilled including insufficient limits, memory leaks, configuration errors, and unexpected load, systematic debugging methodology to identify whether you need bigger limits or to fix a memory leak, analyzing memory usage patterns over time using Prometheus queries, calculating optimal memory requests and limits based on actual usage plus appropriate headroom for spikes, fixing memory leaks in applications through profiling and code analysis, implementing memory-efficient practices in application code, and how Atmosly's AI-powered memory analysis automatically detects OOMKilled events within 30 seconds, retrieves historical memory usage metrics from Prometheus showing the usage trend leading up to the OOMKill, determines whether memory was growing linearly over time (indicating leak) or spiked suddenly (indicating load or configuration issue), calculates statistically optimal memory limits based on p95 or p99 usage over 7-30 day period plus configurable safety margin, identifies application memory leak patterns by analyzing continuous growth without plateau, shows exact cost impact of increasing memory limits ("adding 512Mi to this pod costs $15/month per replica × 10 replicas = $150/month"), and provides one-command kubectl fix to adjust limits appropriately—reducing troubleshooting time from hours of manual metric analysis, heap dump examination, and trial-and-error limit adjustments to immediate diagnosis with data-driven optimal configuration recommendations.

By mastering OOMKilled troubleshooting through this guide, you'll be able to diagnose memory issues in minutes instead of hours, distinguish between legitimate need for more memory versus fixable memory leaks, optimize memory allocation balancing cost against reliability, prevent OOMKills through proper limit configuration and application tuning, and leverage AI-powered analysis to automate the entire investigation process.

What is OOMKilled? Technical Deep Dive into Linux OOM Killer

How the Linux OOM Killer Works

The Out Of Memory killer is a Linux kernel mechanism protecting system stability when memory is completely exhausted. When a container in Kubernetes exceeds its memory limit, here's exactly what happens:

  1. Memory Usage Grows: Your application allocates more memory (malloc in C, new objects in Java/Python/Node.js, slice/map growth in Go)
  2. Approaching Limit: Container memory working set approaches the limit configured in resources.limits.memory (e.g., 512Mi)
  3. Limit Exceeded: Working set reaches or exceeds limit (512Mi)
  4. Cgroup Controller Triggers: Linux cgroup memory controller detects limit violation
  5. OOM Killer Invoked: Kernel's OOM killer mechanism activates
  6. Process Selection: OOM killer selects victim process based on OOM score (highest memory user typically selected)
  7. SIGKILL Sent: Kernel sends SIGKILL (signal 9 - cannot be caught or blocked) to victim process
  8. Process Terminated: Application terminates immediately without cleanup, exit code 137 (128 + 9)
  9. Container Exits: Container stops with reason "OOMKilled"
  10. Kubernetes Restart: Restart policy (usually Always) triggers automatic restart

Why Exit Code 137? Unix convention: 128 + signal number. SIGKILL is signal 9, so 128 + 9 = 137. This exit code specifically indicates forceful termination by signal, and in Kubernetes context with "OOMKilled" reason almost always means memory limit exceeded.

Confirming OOMKilled Happened

# Method 1: Check termination reason
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Output: OOMKilled (confirms it)

# Method 2: Check exit code
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Output: 137 (very likely OOMKilled, especially with reason above)

# Method 3: Describe pod and look for OOMKilled in events
kubectl describe pod my-pod

# Events section shows:
# State:          Terminated
#   Reason:       OOMKilled
#   Exit Code:    137
#   Started:      Mon, 27 Oct 2025 14:30:15 +0000
#   Finished:     Mon, 27 Oct 2025 14:32:47 +0000

Understanding Kubernetes Memory Metrics

Critical memory metrics explained:

  • container_memory_working_set_bytes: THIS is what counts toward OOMKill limit. Working set = RSS (anonymous memory) + page cache that cannot be reclaimed. When this equals or exceeds limits.memory, OOMKill triggers.
  • container_memory_rss: Resident Set Size - anonymous memory (heap, stack, not backed by files). Always less than or equal to working set.
  • container_memory_cache: Page cache memory used for filesystem caching. Mostly reclaimable but some counted in working set if actively used.
  • container_memory_usage_bytes: Total memory including all caches. Higher than working set, NOT what triggers OOMKill (misleading metric!).

The confusion: usage vs working set

Many engineers check container_memory_usage_bytes which shows 800Mi and think they're fine with 1Gi limit. But container_memory_working_set_bytes is 1.02Gi triggering OOMKill. Always monitor working_set, not usage.

The 8 Most Common Root Causes of OOMKilled

Cause 1: Memory Limit Set Too Low for Normal Operation

Symptoms: Application crashes during normal operation, memory usage steady at limit before crash, happens predictably under certain operations

How to diagnose:

# Check current memory limit
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Output: 512Mi

# Check memory usage before crash (if Prometheus available)
# Query for last 24 hours before OOMKill:
container_memory_working_set_bytes{pod="my-pod"}

# Typical pattern: Usage steadily at 500-512Mi (98-100% of limit)
# Indicates limit too low, not a leak

Solution:

# Increase memory limit appropriately
kubectl set resources deployment/my-app \\
  -c=my-container \\
  --limits=memory=1Gi \\
  --requests=memory=768Mi

# Rule of thumb: Set limit 20-30% above typical p95 usage
# If p95 usage is 750Mi, set limit to 1Gi (33% headroom)

Atmosly's Recommendation Example:

OOMKilled Analysis: database-primary-0

Current Config:

  • Memory Limit: 512Mi
  • Memory Request: 256Mi

Usage Analysis (30-day history):

  • Average usage: 450Mi (88% of limit)
  • P95 usage: 505Mi (99% of limit)
  • P99 usage: 510Mi (99.6% of limit)
  • OOMKills: 12 times in last 30 days
  • Pattern: No leak detected (usage plateaus at ~500Mi)

Root Cause: Memory limit too low for normal database operation. During active query processing, memory naturally reaches 500-510Mi hitting limit.

Recommended Fix:

kubectl set resources statefulset/database \\
  -c postgres \\
  --limits=memory=768Mi \\
  --requests=memory=512Mi

Rationale: P99 usage (510Mi) + 30% safety margin = 663Mi, rounded to 768Mi
Cost Impact: +$12/month per pod
Expected OOMKills after fix: 0 (adequate headroom)

Cause 2: Memory Leak in Application Code

Symptoms: Memory usage grows continuously over time, never plateaus, eventually hits limit, crashes, restarts, grows again, crashes again

How to diagnose:

# Query Prometheus for memory growth over time
container_memory_working_set_bytes{pod=~"my-app-.*"}

# Pattern indicating leak:
# Hour 0: 200Mi
# Hour 6: 350Mi  
# Hour 12: 500Mi
# Hour 15: 512Mi → OOMKill → restart
# Hour 16: 200Mi (after restart)
# Hour 22: 350Mi (growing again)
# Continuous linear growth = memory leak

# Calculate growth rate
rate(container_memory_working_set_bytes[1h])
# If consistently positive and never negative, probable leak

Solutions:

Short-term (band-aid):

# Increase limit to buy time
kubectl set resources deployment/my-app --limits=memory=2Gi

# This delays OOMKill but doesn't fix leak

Long-term (actual fix):

  • Profile application memory: Use language-specific profilers (pprof for Go, heapdump for Java, memory_profiler for Python, heapdump for Node.js)
  • Identify leak source: Find objects accumulating without being freed (caches never cleaned, event listeners never removed, connections not closed)
  • Fix code: Add proper cleanup, implement cache eviction, close resources, remove listeners
  • Test: Run load tests monitoring memory over time, verify memory plateaus

Common leak patterns by language:

  • Go: Goroutine leaks, slice/map growth without bounds, defer in loops
  • Java: Static collections, ThreadLocal not cleaned, listeners not removed
  • Python: Global caches, circular references preventing GC, unclosed file handles
  • Node.js: Event listeners, unclosed connections, global array accumulation

Atmosly's Memory Leak Detection:

Memory Leak Detected: api-server-deployment

Usage Pattern Analysis (7-day history):

  • Memory growth rate: +18Mi per hour (continuous linear increase)
  • Starting memory (after restart): 280Mi
  • Projected time to OOMKill: 11 hours from restart
  • OOMKill frequency: Every 10-12 hours consistently
  • Leak confidence: 95% (high certainty based on pattern)

Leak Signature: Memory never plateaus, growth never stops, restart temporarily fixes (drops to 280Mi) but growth resumes immediately

Immediate Action: Increase limit to prevent crashes while leak is fixed

kubectl set resources deployment/api-server --limits=memory=2Gi

Permanent Fix Required: Profile application, identify leak source, fix code
Estimated leak location: Based on logs, likely unclosed database connections accumulating

Cost of band-aid: +$45/month (doubled memory)
Cost of not fixing: Continued crashes every 11 hours, customer impact, engineering time

Cause 3-8: Sudden memory spike under load, Too many concurrent requests, Large file processing, Inefficient caching, JVM heap misconfiguration, Container shared memory, Gradual accumulation over days

[Each cause with detailed diagnostics, memory profiling techniques, and specific solutions...]

Systematic OOMKilled Debugging Process

Step 1: Confirm OOMKilled (Not Other Crash Reason)

# Check all three indicators
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Should show: OOMKilled

kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'  
# Should show: 137

kubectl describe pod my-pod | grep -A 5 "Last State"
# Shows: Reason: OOMKilled, Exit Code: 137

Step 2: Check Current Memory Limit

# What limit is configured?
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Example: 512Mi

# Also check request
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.requests.memory}'
# Example: 256Mi

Step 3: Analyze Historical Memory Usage

# If Prometheus available, query memory trends
# Last 24 hours before OOMKill
container_memory_working_set_bytes{pod="my-pod"}[24h]

# Calculate p95 and p99 usage
quantile_over_time(0.95, container_memory_working_set_bytes{pod=~"my-app-.*"}[7d])
quantile_over_time(0.99, container_memory_working_set_bytes{pod=~"my-app-.*"}[7d])

# If metrics-server installed (real-time only, no history)
kubectl top pod my-pod

Step 4: Determine If Memory Leak or Insufficient Limit

Memory Leak Indicators:

  • Memory usage grows continuously over time
  • Never plateaus or reaches steady state
  • Growth rate is consistent (e.g., +15Mi per hour)
  • After restart, starts low then grows again
  • Pattern repeats predictably

Insufficient Limit Indicators:

  • Memory usage plateaus at or near limit
  • Usage stable at 480-500Mi when limit is 512Mi
  • OOMKills happen during known high-memory operations (batch jobs, cache warming)
  • No continuous growth usage is stable until specific events

Step 5: Calculate Optimal Limits

Statistical approach:

# Get p95 usage over last 30 days
p95_usage = quantile_over_time(0.95, container_memory_working_set_bytes[30d])

# Calculate recommended limit
recommended_limit = p95_usage * 1.3  # 30% headroom for spikes

# Example:
# p95 usage = 650Mi
# recommended limit = 650Mi * 1.3 = 845Mi
# Round up to 1Gi for cleaner configuration

Rule of thumb:

  • Stateless apps: p95 + 20% (smaller spikes expected)
  • Stateful apps (databases, caches): p99 + 30% (larger variance)
  • Batch jobs: max observed + 50% (highly variable)

Fixing OOMKilled: Immediate and Long-Term Solutions

Immediate Fix: Increase Memory Limits

# Quick fix to stop crashes
kubectl set resources deployment/my-app \\
  --limits=memory=1Gi \\
  --requests=memory=768Mi

# Verify change applied
kubectl rollout status deployment/my-app

# Monitor to confirm OOMKills stop
kubectl get pods --watch

When to increase limits:

  • Memory usage consistently at 90-100% of limit
  • No evidence of memory leak (usage plateaus)
  • Application legitimately needs more memory for its workload

Long-Term Fix: Find and Fix Memory Leaks

Memory Profiling by Language:

Go Applications:

# Enable pprof in application
import _ "net/http/pprof"

# Expose on port 6060
http.ListenAndServe(":6060", nil)

# Collect heap profile
kubectl port-forward my-pod 6060:6060
go tool pprof http://localhost:6060/debug/pprof/heap

# Analyze top memory consumers
(pprof) top10
(pprof) list 

Java Applications:

# Take heap dump when memory high
kubectl exec my-pod -- jmap -dump:live,format=b,file=/tmp/heap.bin 1

# Copy heap dump locally
kubectl cp my-pod:/tmp/heap.bin ./heap.bin

# Analyze with Eclipse MAT or VisualVM
# Look for: Objects with high retained size, memory leak suspects, duplicate strings

Python Applications:

# Use memory_profiler
pip install memory-profiler

# Decorate functions
@profile
def my_function():
    # function code

# Run with profiler
python -m memory_profiler app.py

# Or use tracemalloc (built-in Python 3.4+)
import tracemalloc
tracemalloc.start()
# ... run code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)

Node.js Applications:

# Take heap snapshot
kubectl exec my-pod -- node --expose-gc --inspect=0.0.0.0:9229 app.js

# Use Chrome DevTools or clinic.js for analysis
npm install -g clinic
clinic doctor -- node app.js

Memory Optimization Best Practices

1. Set Memory Requests and Limits Appropriately

resources:
  requests:
    memory: 512Mi  # Guaranteed allocation (for scheduling)
  limits:
    memory: 768Mi  # Maximum before OOMKill (50% headroom)

Best practices:

  • Request = typical steady-state usage
  • Limit = p95 or p99 usage + 20-30% safety margin
  • Limit should be 1.5-2x request (not 10x)
  • Never set limit without request (defaults to 0, bad for scheduling)

2. Monitor Memory Usage Over Time

# Prometheus alert for high memory usage (warning before OOMKill)
- alert: HighMemoryUsage
  expr: |
    (container_memory_working_set_bytes
    /
    container_spec_memory_limit_bytes)
    > 0.9
  for: 10m
  annotations:
    summary: "Pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of memory limit"

3. Implement Graceful Degradation

Instead of crashing when memory full, applications should:

  • Clear caches when memory pressure high
  • Reject new requests returning 503 Service Unavailable
  • Implement circuit breakers
  • Use bounded queues and buffers (not unlimited)

4. Use Horizontal Pod Autoscaling

# Scale pods based on memory usage
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70  # Scale when avg memory > 70%

Distributes load across more pods, keeping individual pod memory usage lower.

5. Implement Memory Limits in Application Code

Java JVM:

# Set max heap based on container limit
# Container limit: 1Gi
# JVM max heap: 768Mi (75% of limit, leave room for non-heap memory)

JAVA_OPTS="-Xmx768m -Xms512m"

Node.js:

# Set V8 max old space
# Container limit: 1Gi = 1024Mi
# V8 limit: ~768Mi

node --max-old-space-size=768 app.js

How Atmosly Prevents and Diagnoses OOMKilled

Proactive Memory Monitoring

Atmosly monitors memory usage continuously and alerts BEFORE OOMKill:

  • Warning at 85%: "Pod approaching memory limit (85%), OOMKill risk in 2-4 hours at current growth rate"
  • Critical at 95%: "Pod at 95% memory limit, OOMKill imminent within minutes"
  • Leak Detection: "Memory growing +12Mi/hour, probable leak, will OOMKill in 8 hours"

Post-OOMKill Analysis

When OOMKill happens, Atmosly automatically:

  1. Retrieves 24h memory usage history from Prometheus
  2. Calculates statistical patterns (mean, p50, p95, p99, growth rate)
  3. Determines if leak (continuous growth) or spike (sudden increase)
  4. Calculates optimal limit based on usage + appropriate headroom
  5. Shows cost impact of recommended limit increase
  6. Provides kubectl command to apply fix
  7. If leak detected, suggests profiling and provides language-specific profiling commands

Cost-Aware Recommendations

Atmosly always shows cost impact:

  • "Increasing from 512Mi to 1Gi adds $18/month per pod"
  • "Current pod has 5 replicas = $90/month total cost increase"
  • "Alternative: Reduce replicas from 5 to 4 and increase memory = net $0 cost change"

Enables informed decision-making balancing reliability and cost.

Conclusion: Master OOMKilled Prevention and Resolution

OOMKilled means container exceeded memory limit and was forcefully terminated. Fix by: confirming OOMKilled (exit code 137, reason OOMKilled), analyzing memory usage patterns (leak vs spike vs insufficient limit), increasing limits appropriately (p95/p99 + 20-30%), fixing memory leaks if detected through profiling, and monitoring proactively to prevent future OOMKills.

Key Takeaways:

  • OOMKilled = memory limit exceeded, kernel kills process
  • Exit code 137 = SIGKILL, usually indicates OOMKilled
  • Monitor container_memory_working_set_bytes (what triggers OOMKill)
  • Set limits = p95 usage + 30% headroom
  • Memory leak = continuous growth never plateauing
  • Profile and fix leaks, don't just increase limits indefinitely
  • Atmosly detects leaks automatically with growth rate analysis
  • Balance cost vs reliability when sizing memory limits

Ready to prevent OOMKilled with AI-powered memory analysis? Start your free Atmosly trial for proactive memory monitoring, leak detection, and optimal limit recommendations with cost impact.

Frequently Asked Questions

What is OOMKilled in Kubernetes and why does it happen?
  1. Definition:

    OOMKilled (Out Of Memory Killed) means the container exceeded its configured memory limit, and the Linux kernel forcefully terminated the process to protect the node from running out of memory.

  2. How OOMKilled Happens:
    • (1) Container allocates memory (objects, caches, request processing).
    • (2) Memory usage approaches the configured resources.limits.memory (e.g., 512Mi).
    • (3) The cgroup memory controller detects a memory limit violation.
    • (4) The Linux kernel OOM killer selects the process using the most memory.
    • (5) Kernel sends SIGKILL (signal 9) — cannot be caught or ignored — terminating the container immediately.
    • (6) Container exits with exit code 137 (128 + 9).
    • (7) Kubernetes reports reason OOMKilled and restarts the container based on restart policy.
    • (8) If memory limits remain insufficient, the container repeatedly crashes → CrashLoopBackOff.
  3. Common Causes:
    • Memory limit set too low for normal workload behavior.
    • Memory leak in application code causing unbounded growth.
    • Sudden memory spike under heavy load or large batch processing.
    • Inefficient caching or storing large data objects in memory.
  4. How to Confirm OOMKilled:
    kubectl get pod <name> \
      -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'

    Output will be OOMKilled. Exit code will be 137.

  5. How It Differs From Other Terminations:
    • Exit code 1: Application crash or runtime error.
    • Exit code 143: Graceful shutdown from SIGTERM.
    • Exit code 137: Forced kill via SIGKILL due to memory exhaustion (OOMKilled).
How do I fix OOMKilled errors in Kubernetes?
  1. Confirm OOMKilled:

    Verify the pod termination reason and exit code:

    kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
    kubectl get pod <name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'

    Expect OOMKilled and exit code 137 (128 + 9).

  2. Check configured memory limit:

    See the memory limit currently set on the container:

    kubectl get pod <name> -o jsonpath='{.spec.containers[0].resources.limits.memory}'

    Example output: 512Mi.

  3. Analyze actual memory usage:

    If Prometheus (or metrics) is available, examine historical working set to determine behavior:

    container_memory_working_set_bytes{pod="<name>"}  # query over 7–30 days (use p95/p99)

    Look for:

    • Continuous growth → likely memory leak.
    • Stable plateau but near limit → limit too low for normal operation.
    • Short spikes → consider headroom or transient scaling.
  4. Increase memory limits (short-term fix if no leak):

    Raise limits to stop immediate crashes while you investigate. Use p95 usage + ~30% headroom as a guideline:

    kubectl set resources deployment/<name> --limits=memory=1Gi --requests=memory=768Mi

    Adjust values based on observed p95/p99; avoid unbounded over-provisioning.

  5. Profile and fix memory leaks (long-term):

    If usage shows continuous growth, profile the application to find the leak:

    • Go: pprof
    • Java: jmap / VisualVM / async-profiler
    • Python: memory_profiler / tracemalloc
    • Node.js: heapdump / --inspect

    Identify objects that grow over time, fix code to release resources, and add tests to prevent regressions. Short-term: keep increased limit to avoid outages.

  6. Set limits using statistical guidance:

    Compute recommended limit = p95 usage + ~30% headroom (or use p99 for bursty apps). Update deployment when confident.

  7. Monitor after changes:

    Continue monitoring container_memory_working_set_bytes and alert on sustained growth or usage near new limits. Validate no CrashLoopBackOff occurs after the change.

  8. Atmosly automation (optional):

    Automated tooling can speed this workflow by detecting OOMKilled quickly, retrieving memory history, distinguishing leak vs spike, computing an optimal limit, showing cost impact, and providing the exact kubectl command to apply fixes.

How do I detect memory leaks in Kubernetes containers?
  1. Query memory over time (example):
    container_memory_working_set_bytes{pod=~"my-app-.*"}[7d]

    Plot this range vector (7-day) to visualize memory behavior across pod replicas.

  2. Leak pattern (what to look for):
    • Continuous linear growth that never plateaus.
    • Starts low after restart (e.g., 300Mi) and grows consistently (e.g., +15Mi/hour).
    • Eventually reaches the memory limit → OOMKill → pod restarts → pattern repeats.
    • Growth-rate check (PromQL): rate(container_memory_working_set_bytes{pod=~"my-app-.*"}[1h]) shows a consistently positive value.
  3. Non-leak pattern (stable behavior):
    • Memory usage plateaus at a steady state (e.g., ~500Mi) and stays stable for hours/days.
    • Occasional spikes during known operations return to baseline.
    • p95 usage remains consistent over long windows (no continuous upward trend).
  4. Calculate time-to-OOM (simple projection):

    Formula:

    time_to_OOM_hours = (memory_limit - current_usage) / growth_rate_per_hour

    Example: growth_rate = 15 Mi/hour, current = 400 Mi, limit = 512 Mi →

    time_to_OOM = (512 - 400) / 15 = 7.466... hours ≈ 7.5 hours
  5. Confirm leak with restart pattern:
    • If memory reaches limit, OOMKills, pod restarts to a low baseline, then memory repeats the same growth curve — this is a strong indicator of a memory leak.
    • Validate by comparing several restart cycles and ensuring the same linear slope appears each time.
  6. PromQL examples for detection:
    # Growth rate (bytes/sec) over 1h per pod
    rate(container_memory_working_set_bytes{pod=~"my-app-.*"}[1h])
    
    # Convert to Mi/hour (bytes -> Mi)
    (rate(container_memory_working_set_bytes{pod=~"my-app-.*"}[1h]) * 3600) / (1024*1024)
    
  7. Alerting heuristics:
    • Alert when growth_rate > X Mi/hour for Y hours (configurable) and current_usage > threshold% of limit (e.g., >70%).
    • Alert message should include estimated time-to-OOM and recent trend (e.g., "Memory leak detected: +15Mi/hour — projected OOMKill in ~8 hours").
  8. Actions after detection:
    • Short-term: increase memory limit or scale replicas to avoid immediate OOMKills.
    • Long-term: profile app (pprof for Go, jmap/heapdump for Java, tracemalloc/memory_profiler for Python, heapdump/inspector for Node) to find and fix leak.
    • Instrument tests to reproduce and validate the fix before rolling to production.
  9. Automation note (Atmosly example):

    Automated systems can continuously monitor memory, detect linear growth patterns, calculate growth rate and projected OOM time, alert before crash, and surface language-specific profiling guidance (Go/Java/Python/Node) to accelerate root-cause analysis.

What is the difference between container memory usage and working set in Kubernetes?
  1. container_memory_working_set_bytes
    Memory that counts toward OOMKill. Includes RSS (anonymous memory like heap/stack) plus the portion of page cache that is actively used and not reclaimable. Compare this metric to your container memory limit (resources.limits.memory) — when working set ≥ limit the kernel may OOMKill the process. Monitor this metric to prevent OOMKills.
  2. container_memory_usage_bytes
    Total memory usage including all cache (reclaimable + non-reclaimable). This value is usually higher than the working set and does not directly indicate OOMKill risk. Relying only on this metric can be misleading (it may show “safe” while working set already hit the limit).
  3. container_memory_rss
    Resident Set Size — anonymous memory only (heap/stack), excludes file-backed cache. RSS is a subset of the working set.
  4. container_memory_cache
    Page cache used for file I/O (file-backed cache). Mostly reclaimable by the kernel and typically not the primary driver for OOMKill unless the cache becomes non-reclaimable.
  5. Working set formula
    working_set ≈ RSS + active_non-reclaimable_page_cache. The working set is the correct comparator against container_spec_memory_limit_bytes.
  6. Common mistake
    Monitoring container_memory_usage_bytes and assuming safety with a 1Gi limit (e.g., seeing 600Mi) can be wrong — the working_set may already be >1Gi and trigger an OOMKill. Always compare working_set to the configured limit.
  7. kubectl/top note
    kubectl top pod reports the working set (what matters for OOMKill), so it’s a quick sanity check on memory pressure in a cluster.
  8. Prometheus alert example
    Alert when working set approaches the configured limit (warning at 90%):
    expr: (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
    Tune thresholds (p95/p99 windows or sustained duration) to avoid noisy alerts.
How does Atmosly help prevent and fix OOMKilled errors?
  1. Proactive Monitoring:
    • Continuously tracks container_memory_working_set_bytes, the metric that triggers OOMKill.
    • Warn alert at 85% of limit:
      "Warning: Approaching memory limit, OOMKill risk in 2–4 hours at current growth rate."
    • Critical alert at 95% of limit:
      "Critical: OOMKill imminent within minutes."
  2. Automatic Detection:
    • Identifies OOMKilled events within 30 seconds of occurrence.
    • Fetches exit code 137 and termination reason "OOMKilled" from pod status.
    • Removes need for manual log/pod inspection (usually 5–15 minutes delay).
  3. Historical Analysis:
    • Queries Prometheus for 7–30 day memory usage history.
    • Plots memory trends over time (graphs included).
    • Calculates mean, p50, p95, p99, and growth rate.
  4. Leak vs. Limit Determination:
    • Detects memory leaks by identifying continuous linear growth (+X Mi/hour) that never plateaus.
    • Identifies insufficient limit if memory plateaus near limit but does not grow endlessly.
    • Computes growth rate and projected time until next OOMKill.
    • Example:
      "Memory leak detected: +15Mi/hour, OOMKill expected in ~8 hours"
  5. Optimal Limit Calculation:
    • Uses p95 or p99 memory usage over 30 days.
    • Adds 20–30% headroom for safe operation.
    • Example recommendation:
      "Current: 512Mi (causing OOMKills), Recommended: 768Mi (p95 = 590Mi + 30%)."
  6. Cost Impact:
    • Shows cost of updated memory limits in real currency terms.
    • Example:
      "Increasing to 768Mi adds $12/month per pod × 3 replicas = $36/month total."
    • Enables informed cost–performance decision-making.
  7. One-Command Fix:

    Automatically generates the exact kubectl command required to fix the issue:

    kubectl set resources deployment/my-app --limits=memory=768Mi --requests=memory=512Mi
  8. Leak Remediation Assistance:
    • If a leak is detected, Atmosly provides language-specific profiling commands:
      • Go → pprof
      • Java → jmap, heap dumps
      • Python → memory_profiler, tracemalloc
      • Node.js → heapdump, --inspect
    • AI suggests likely leak sources using log and metric correlation.
  9. Outcome:

    Prevents OOMKills through early warnings, reduces debugging time from hours of manual metrics analysis to seconds, and offers data-driven recommendations rather than trial-and-error guesses.