Introduction to OOMKilled: When Memory Limits Are Exceeded
OOMKilled (Out Of Memory Killed) is one of the most disruptive and difficult-to-diagnose errors in Kubernetes, occurring when a container exceeds its configured memory limit and the Linux kernel's Out Of Memory (OOM) killer forcefully terminates the process to protect the node from complete memory exhaustion that could crash the entire system. Unlike graceful application shutdowns where your code can clean up resources, save state, and exit cleanly, OOMKilled is violent and immediate the kernel sends SIGKILL (signal 9) which cannot be caught, handled, or ignored by the application, instantly terminating the process mid-execution potentially corrupting data, losing in-flight transactions, breaking database connections without cleanup, and leaving the application in inconsistent state requiring recovery procedures when it restarts.
The OOMKilled problem manifests in several frustrating ways that impact production services: containers repeatedly crash and enter CrashLoopBackOff as Kubernetes restarts them, they hit the memory limit again, get OOMKilled again, creating an infinite loop of failures, deployments fail to roll out because new pods immediately get OOMKilled while trying to initialize, applications experience periodic crashes under load when memory usage spikes above limits during traffic bursts or batch processing jobs, databases lose connections and corrupt indexes when killed mid-transaction without proper shutdown, stateful applications lose cached data and warm-up state requiring minutes to rebuild after restart degrading performance, and customer-facing services become unreliable with random crashes that users experience as 500 errors, timeouts, or complete unavailability.
Understanding OOMKilled requires understanding how Linux memory management and Kubernetes resource limits interact: containers run in Linux cgroups (control groups) that enforce resource limits including memory, when a container's memory usage (specifically the "working set" which excludes reclaimable page cache) reaches the limit configured in resources.limits.memory in the pod specification, the cgroup memory controller triggers, Linux kernel's OOM killer selects a process to terminate based on OOM score (memory usage, OOM adjustment values), the selected process (usually your application's main process with highest memory consumption) receives SIGKILL terminating immediately, the container exits with exit code 137 (128 + 9 where 9 is the SIGKILL signal number), and Kubernetes sees the terminated container with reason "OOMKilled" and restarts it according to the restart policy, typically leading to CrashLoopBackOff if the problem persists.
This comprehensive guide teaches you everything about OOMKilled errors from root causes to permanent solutions, covering: what OOMKilled means technically and how to confirm it happened versus other container termination reasons, understanding Kubernetes memory metrics (working set bytes, RSS, cache, swap) and which one triggers OOMKill, the difference between memory requests and limits and how each affects scheduling versus runtime behavior, the 8 most common root causes of OOMKilled including insufficient limits, memory leaks, configuration errors, and unexpected load, systematic debugging methodology to identify whether you need bigger limits or to fix a memory leak, analyzing memory usage patterns over time using Prometheus queries, calculating optimal memory requests and limits based on actual usage plus appropriate headroom for spikes, fixing memory leaks in applications through profiling and code analysis, implementing memory-efficient practices in application code, and how Atmosly's AI-powered memory analysis automatically detects OOMKilled events within 30 seconds, retrieves historical memory usage metrics from Prometheus showing the usage trend leading up to the OOMKill, determines whether memory was growing linearly over time (indicating leak) or spiked suddenly (indicating load or configuration issue), calculates statistically optimal memory limits based on p95 or p99 usage over 7-30 day period plus configurable safety margin, identifies application memory leak patterns by analyzing continuous growth without plateau, shows exact cost impact of increasing memory limits ("adding 512Mi to this pod costs $15/month per replica × 10 replicas = $150/month"), and provides one-command kubectl fix to adjust limits appropriately—reducing troubleshooting time from hours of manual metric analysis, heap dump examination, and trial-and-error limit adjustments to immediate diagnosis with data-driven optimal configuration recommendations.
By mastering OOMKilled troubleshooting through this guide, you'll be able to diagnose memory issues in minutes instead of hours, distinguish between legitimate need for more memory versus fixable memory leaks, optimize memory allocation balancing cost against reliability, prevent OOMKills through proper limit configuration and application tuning, and leverage AI-powered analysis to automate the entire investigation process.
What is OOMKilled? Technical Deep Dive into Linux OOM Killer
How the Linux OOM Killer Works
The Out Of Memory killer is a Linux kernel mechanism protecting system stability when memory is completely exhausted. When a container in Kubernetes exceeds its memory limit, here's exactly what happens:
- Memory Usage Grows: Your application allocates more memory (malloc in C, new objects in Java/Python/Node.js, slice/map growth in Go)
- Approaching Limit: Container memory working set approaches the limit configured in resources.limits.memory (e.g., 512Mi)
- Limit Exceeded: Working set reaches or exceeds limit (512Mi)
- Cgroup Controller Triggers: Linux cgroup memory controller detects limit violation
- OOM Killer Invoked: Kernel's OOM killer mechanism activates
- Process Selection: OOM killer selects victim process based on OOM score (highest memory user typically selected)
- SIGKILL Sent: Kernel sends SIGKILL (signal 9 - cannot be caught or blocked) to victim process
- Process Terminated: Application terminates immediately without cleanup, exit code 137 (128 + 9)
- Container Exits: Container stops with reason "OOMKilled"
- Kubernetes Restart: Restart policy (usually Always) triggers automatic restart
Why Exit Code 137? Unix convention: 128 + signal number. SIGKILL is signal 9, so 128 + 9 = 137. This exit code specifically indicates forceful termination by signal, and in Kubernetes context with "OOMKilled" reason almost always means memory limit exceeded.
Confirming OOMKilled Happened
# Method 1: Check termination reason
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Output: OOMKilled (confirms it)
# Method 2: Check exit code
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Output: 137 (very likely OOMKilled, especially with reason above)
# Method 3: Describe pod and look for OOMKilled in events
kubectl describe pod my-pod
# Events section shows:
# State: Terminated
# Reason: OOMKilled
# Exit Code: 137
# Started: Mon, 27 Oct 2025 14:30:15 +0000
# Finished: Mon, 27 Oct 2025 14:32:47 +0000
Understanding Kubernetes Memory Metrics
Critical memory metrics explained:
- container_memory_working_set_bytes: THIS is what counts toward OOMKill limit. Working set = RSS (anonymous memory) + page cache that cannot be reclaimed. When this equals or exceeds limits.memory, OOMKill triggers.
- container_memory_rss: Resident Set Size - anonymous memory (heap, stack, not backed by files). Always less than or equal to working set.
- container_memory_cache: Page cache memory used for filesystem caching. Mostly reclaimable but some counted in working set if actively used.
- container_memory_usage_bytes: Total memory including all caches. Higher than working set, NOT what triggers OOMKill (misleading metric!).
The confusion: usage vs working set
Many engineers check container_memory_usage_bytes which shows 800Mi and think they're fine with 1Gi limit. But container_memory_working_set_bytes is 1.02Gi triggering OOMKill. Always monitor working_set, not usage.
The 8 Most Common Root Causes of OOMKilled
Cause 1: Memory Limit Set Too Low for Normal Operation
Symptoms: Application crashes during normal operation, memory usage steady at limit before crash, happens predictably under certain operations
How to diagnose:
# Check current memory limit
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Output: 512Mi
# Check memory usage before crash (if Prometheus available)
# Query for last 24 hours before OOMKill:
container_memory_working_set_bytes{pod="my-pod"}
# Typical pattern: Usage steadily at 500-512Mi (98-100% of limit)
# Indicates limit too low, not a leak
Solution:
# Increase memory limit appropriately
kubectl set resources deployment/my-app \\
-c=my-container \\
--limits=memory=1Gi \\
--requests=memory=768Mi
# Rule of thumb: Set limit 20-30% above typical p95 usage
# If p95 usage is 750Mi, set limit to 1Gi (33% headroom)
Atmosly's Recommendation Example:
OOMKilled Analysis: database-primary-0
Current Config:
- Memory Limit: 512Mi
- Memory Request: 256Mi
Usage Analysis (30-day history):
- Average usage: 450Mi (88% of limit)
- P95 usage: 505Mi (99% of limit)
- P99 usage: 510Mi (99.6% of limit)
- OOMKills: 12 times in last 30 days
- Pattern: No leak detected (usage plateaus at ~500Mi)
Root Cause: Memory limit too low for normal database operation. During active query processing, memory naturally reaches 500-510Mi hitting limit.
Recommended Fix:
kubectl set resources statefulset/database \\ -c postgres \\ --limits=memory=768Mi \\ --requests=memory=512MiRationale: P99 usage (510Mi) + 30% safety margin = 663Mi, rounded to 768Mi
Cost Impact: +$12/month per pod
Expected OOMKills after fix: 0 (adequate headroom)
Cause 2: Memory Leak in Application Code
Symptoms: Memory usage grows continuously over time, never plateaus, eventually hits limit, crashes, restarts, grows again, crashes again
How to diagnose:
# Query Prometheus for memory growth over time
container_memory_working_set_bytes{pod=~"my-app-.*"}
# Pattern indicating leak:
# Hour 0: 200Mi
# Hour 6: 350Mi
# Hour 12: 500Mi
# Hour 15: 512Mi → OOMKill → restart
# Hour 16: 200Mi (after restart)
# Hour 22: 350Mi (growing again)
# Continuous linear growth = memory leak
# Calculate growth rate
rate(container_memory_working_set_bytes[1h])
# If consistently positive and never negative, probable leak
Solutions:
Short-term (band-aid):
# Increase limit to buy time
kubectl set resources deployment/my-app --limits=memory=2Gi
# This delays OOMKill but doesn't fix leak
Long-term (actual fix):
- Profile application memory: Use language-specific profilers (pprof for Go, heapdump for Java, memory_profiler for Python, heapdump for Node.js)
- Identify leak source: Find objects accumulating without being freed (caches never cleaned, event listeners never removed, connections not closed)
- Fix code: Add proper cleanup, implement cache eviction, close resources, remove listeners
- Test: Run load tests monitoring memory over time, verify memory plateaus
Common leak patterns by language:
- Go: Goroutine leaks, slice/map growth without bounds, defer in loops
- Java: Static collections, ThreadLocal not cleaned, listeners not removed
- Python: Global caches, circular references preventing GC, unclosed file handles
- Node.js: Event listeners, unclosed connections, global array accumulation
Atmosly's Memory Leak Detection:
Memory Leak Detected: api-server-deployment
Usage Pattern Analysis (7-day history):
- Memory growth rate: +18Mi per hour (continuous linear increase)
- Starting memory (after restart): 280Mi
- Projected time to OOMKill: 11 hours from restart
- OOMKill frequency: Every 10-12 hours consistently
- Leak confidence: 95% (high certainty based on pattern)
Leak Signature: Memory never plateaus, growth never stops, restart temporarily fixes (drops to 280Mi) but growth resumes immediately
Immediate Action: Increase limit to prevent crashes while leak is fixed
kubectl set resources deployment/api-server --limits=memory=2GiPermanent Fix Required: Profile application, identify leak source, fix code
Estimated leak location: Based on logs, likely unclosed database connections accumulatingCost of band-aid: +$45/month (doubled memory)
Cost of not fixing: Continued crashes every 11 hours, customer impact, engineering time
Cause 3-8: Sudden memory spike under load, Too many concurrent requests, Large file processing, Inefficient caching, JVM heap misconfiguration, Container shared memory, Gradual accumulation over days
[Each cause with detailed diagnostics, memory profiling techniques, and specific solutions...]
Systematic OOMKilled Debugging Process
Step 1: Confirm OOMKilled (Not Other Crash Reason)
# Check all three indicators
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
# Should show: OOMKilled
kubectl get pod my-pod -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
# Should show: 137
kubectl describe pod my-pod | grep -A 5 "Last State"
# Shows: Reason: OOMKilled, Exit Code: 137
Step 2: Check Current Memory Limit
# What limit is configured?
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Example: 512Mi
# Also check request
kubectl get pod my-pod -o jsonpath='{.spec.containers[0].resources.requests.memory}'
# Example: 256Mi
Step 3: Analyze Historical Memory Usage
# If Prometheus available, query memory trends
# Last 24 hours before OOMKill
container_memory_working_set_bytes{pod="my-pod"}[24h]
# Calculate p95 and p99 usage
quantile_over_time(0.95, container_memory_working_set_bytes{pod=~"my-app-.*"}[7d])
quantile_over_time(0.99, container_memory_working_set_bytes{pod=~"my-app-.*"}[7d])
# If metrics-server installed (real-time only, no history)
kubectl top pod my-pod
Step 4: Determine If Memory Leak or Insufficient Limit
Memory Leak Indicators:
- Memory usage grows continuously over time
- Never plateaus or reaches steady state
- Growth rate is consistent (e.g., +15Mi per hour)
- After restart, starts low then grows again
- Pattern repeats predictably
Insufficient Limit Indicators:
- Memory usage plateaus at or near limit
- Usage stable at 480-500Mi when limit is 512Mi
- OOMKills happen during known high-memory operations (batch jobs, cache warming)
- No continuous growth usage is stable until specific events
Step 5: Calculate Optimal Limits
Statistical approach:
# Get p95 usage over last 30 days
p95_usage = quantile_over_time(0.95, container_memory_working_set_bytes[30d])
# Calculate recommended limit
recommended_limit = p95_usage * 1.3 # 30% headroom for spikes
# Example:
# p95 usage = 650Mi
# recommended limit = 650Mi * 1.3 = 845Mi
# Round up to 1Gi for cleaner configuration
Rule of thumb:
- Stateless apps: p95 + 20% (smaller spikes expected)
- Stateful apps (databases, caches): p99 + 30% (larger variance)
- Batch jobs: max observed + 50% (highly variable)
Fixing OOMKilled: Immediate and Long-Term Solutions
Immediate Fix: Increase Memory Limits
# Quick fix to stop crashes
kubectl set resources deployment/my-app \\
--limits=memory=1Gi \\
--requests=memory=768Mi
# Verify change applied
kubectl rollout status deployment/my-app
# Monitor to confirm OOMKills stop
kubectl get pods --watch
When to increase limits:
- Memory usage consistently at 90-100% of limit
- No evidence of memory leak (usage plateaus)
- Application legitimately needs more memory for its workload
Long-Term Fix: Find and Fix Memory Leaks
Memory Profiling by Language:
Go Applications:
# Enable pprof in application
import _ "net/http/pprof"
# Expose on port 6060
http.ListenAndServe(":6060", nil)
# Collect heap profile
kubectl port-forward my-pod 6060:6060
go tool pprof http://localhost:6060/debug/pprof/heap
# Analyze top memory consumers
(pprof) top10
(pprof) list
Java Applications:
# Take heap dump when memory high
kubectl exec my-pod -- jmap -dump:live,format=b,file=/tmp/heap.bin 1
# Copy heap dump locally
kubectl cp my-pod:/tmp/heap.bin ./heap.bin
# Analyze with Eclipse MAT or VisualVM
# Look for: Objects with high retained size, memory leak suspects, duplicate strings
Python Applications:
# Use memory_profiler
pip install memory-profiler
# Decorate functions
@profile
def my_function():
# function code
# Run with profiler
python -m memory_profiler app.py
# Or use tracemalloc (built-in Python 3.4+)
import tracemalloc
tracemalloc.start()
# ... run code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
Node.js Applications:
# Take heap snapshot
kubectl exec my-pod -- node --expose-gc --inspect=0.0.0.0:9229 app.js
# Use Chrome DevTools or clinic.js for analysis
npm install -g clinic
clinic doctor -- node app.js
Memory Optimization Best Practices
1. Set Memory Requests and Limits Appropriately
resources:
requests:
memory: 512Mi # Guaranteed allocation (for scheduling)
limits:
memory: 768Mi # Maximum before OOMKill (50% headroom)
Best practices:
- Request = typical steady-state usage
- Limit = p95 or p99 usage + 20-30% safety margin
- Limit should be 1.5-2x request (not 10x)
- Never set limit without request (defaults to 0, bad for scheduling)
2. Monitor Memory Usage Over Time
# Prometheus alert for high memory usage (warning before OOMKill)
- alert: HighMemoryUsage
expr: |
(container_memory_working_set_bytes
/
container_spec_memory_limit_bytes)
> 0.9
for: 10m
annotations:
summary: "Pod {{ $labels.pod }} using {{ $value | humanizePercentage }} of memory limit"
3. Implement Graceful Degradation
Instead of crashing when memory full, applications should:
- Clear caches when memory pressure high
- Reject new requests returning 503 Service Unavailable
- Implement circuit breakers
- Use bounded queues and buffers (not unlimited)
4. Use Horizontal Pod Autoscaling
# Scale pods based on memory usage
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70 # Scale when avg memory > 70%
Distributes load across more pods, keeping individual pod memory usage lower.
5. Implement Memory Limits in Application Code
Java JVM:
# Set max heap based on container limit
# Container limit: 1Gi
# JVM max heap: 768Mi (75% of limit, leave room for non-heap memory)
JAVA_OPTS="-Xmx768m -Xms512m"
Node.js:
# Set V8 max old space
# Container limit: 1Gi = 1024Mi
# V8 limit: ~768Mi
node --max-old-space-size=768 app.js
How Atmosly Prevents and Diagnoses OOMKilled
Proactive Memory Monitoring
Atmosly monitors memory usage continuously and alerts BEFORE OOMKill:
- Warning at 85%: "Pod approaching memory limit (85%), OOMKill risk in 2-4 hours at current growth rate"
- Critical at 95%: "Pod at 95% memory limit, OOMKill imminent within minutes"
- Leak Detection: "Memory growing +12Mi/hour, probable leak, will OOMKill in 8 hours"
Post-OOMKill Analysis
When OOMKill happens, Atmosly automatically:
- Retrieves 24h memory usage history from Prometheus
- Calculates statistical patterns (mean, p50, p95, p99, growth rate)
- Determines if leak (continuous growth) or spike (sudden increase)
- Calculates optimal limit based on usage + appropriate headroom
- Shows cost impact of recommended limit increase
- Provides kubectl command to apply fix
- If leak detected, suggests profiling and provides language-specific profiling commands
Cost-Aware Recommendations
Atmosly always shows cost impact:
- "Increasing from 512Mi to 1Gi adds $18/month per pod"
- "Current pod has 5 replicas = $90/month total cost increase"
- "Alternative: Reduce replicas from 5 to 4 and increase memory = net $0 cost change"
Enables informed decision-making balancing reliability and cost.
Conclusion: Master OOMKilled Prevention and Resolution
OOMKilled means container exceeded memory limit and was forcefully terminated. Fix by: confirming OOMKilled (exit code 137, reason OOMKilled), analyzing memory usage patterns (leak vs spike vs insufficient limit), increasing limits appropriately (p95/p99 + 20-30%), fixing memory leaks if detected through profiling, and monitoring proactively to prevent future OOMKills.
Key Takeaways:
- OOMKilled = memory limit exceeded, kernel kills process
- Exit code 137 = SIGKILL, usually indicates OOMKilled
- Monitor container_memory_working_set_bytes (what triggers OOMKill)
- Set limits = p95 usage + 30% headroom
- Memory leak = continuous growth never plateauing
- Profile and fix leaks, don't just increase limits indefinitely
- Atmosly detects leaks automatically with growth rate analysis
- Balance cost vs reliability when sizing memory limits
Ready to prevent OOMKilled with AI-powered memory analysis? Start your free Atmosly trial for proactive memory monitoring, leak detection, and optimal limit recommendations with cost impact.