Atmosly AI Copilot

Kubernetes Troubleshooting with AI: Atmosly Copilot Guide (2025)

Complete guide to Kubernetes troubleshooting with Atmosly's AI Copilot: learn how AI automates debugging (CrashLoopBackOff, OOMKilled, performance issues), ask questions in natural language, get instant root cause analysis correlating logs + metrics + events, and reduce MTTR by 97% with AI-powered investigation.

Introduction: The Evolution from Manual Debugging to AI-Powered Troubleshooting

Kubernetes troubleshooting has traditionally been a time consuming, expertise-intensive process requiring deep knowledge of container orchestration, distributed systems, Linux internals, networking fundamentals, and the ability to correlate information across multiple data sources pod status and events from Kubernetes API, application logs from container stdout/stderr, resource metrics from Prometheus or metrics-server, node conditions and capacity from kubelet, network connectivity from CNI plugins, and recent cluster changes from audit logs. Experienced Site Reliability Engineers spend 30-90 minutes investigating a single production incident, running dozens of kubectl commands, querying Prometheus manually with complex PromQL, scrolling through thousands of log lines searching for relevant errors, cross-referencing timestamps across different systems to correlate cause and effect, and piecing together the complete story like detectives examining crime scene evidence from multiple forensic sources, all while under pressure with customers experiencing outages and business impact accumulating with every minute of downtime.

For teams without deep Kubernetes expertise which includes most development teams, junior DevOps engineers, or organizations new to container orchestration troubleshooting is even more challenging, often taking hours or requiring escalation to senior engineers who become bottlenecks, creating dependency that slows incident resolution and prevents team scaling. The fundamental problem is that effective Kubernetes troubleshooting requires not just knowing what commands to run, but understanding what the output means, knowing where to look next based on what you find, recognizing patterns from experience with similar incidents, and having the mental model to correlate distributed information into coherent root cause analysis skills that take months or years to develop through hands-on experience with production incidents.

Atmosly AI Copilot represents a fundamental paradigm shift in how Kubernetes troubleshooting works, transforming it from manual investigation requiring deep expertise into conversational problem-solving accessible to anyone who can describe an issue in plain English. Instead of requiring engineers to know dozens of kubectl commands, understand PromQL syntax, manually correlate logs with metrics, and build mental models of system behavior, Atmosly's AI does all of this automatically you simply ask questions in natural language like "Why is my payment-service pod crashing?" or "What's causing high latency in the API?" and the AI performs comprehensive investigation including retrieving current pod status and container states from Kubernetes API, fetching logs from current and previous container instances handling previous flag automatically, querying Prometheus for relevant metrics over appropriate time windows, examining Kubernetes events for recent changes or failures, checking related resources like databases, ConfigMaps, Secrets, and dependent services, correlating all data sources with precise timeline reconstruction showing causal relationships, analyzing error patterns using natural language processing and learned knowledge from thousands of Kubernetes incidents, and generating human-readable root cause analysis with specific remediation recommendations including exact kubectl commands to execute, predicted recovery time, cost impact if changes require additional resources, and confidence scoring so you know how certain the AI is about its diagnosis.

This comprehensive guide teaches you everything about AI-powered Kubernetes troubleshooting with Atmosly Copilot, covering: how Atmosly's AI Copilot works architecturally from natural language query processing through distributed data collection to root cause analysis generation, real-world troubleshooting scenarios with side-by-side comparison of traditional manual approach versus AI-automated investigation showing time savings and improved accuracy, the types of questions you can ask and how to phrase queries for best results, understanding AI-generated root cause analysis reports including timeline reconstruction and contributing factors analysis, automatic correlation of metrics, logs, and events that manual investigation struggles with, proactive issue detection where Atmosly identifies problems before you even ask, from CrashLoopBackOff to OOMKilled to resource exhaustion to deployment failures to networking issues, natural language querying without needing kubectl expertise or PromQL knowledge, cost intelligence integration showing resource optimization opportunities alongside troubleshooting, and advanced capabilities like asking follow-up questions, requesting alternative solutions, or getting explanations of Kubernetes concepts on-demand during investigation.

By understanding how to leverage Atmosly's AI Copilot effectively, you'll transform your team's incident response capabilities, reduce Mean Time To Resolution (MTTR) by 90-97%, enable junior engineers to troubleshoot issues that previously required senior SRE escalation, eliminate the need for extensive Kubernetes expertise to operate production clusters effectively, and free experienced engineers from repetitive troubleshooting to focus on strategic improvements and automation.

How Atmosly AI Copilot Works: Architecture and Intelligence

The AI-Powered Troubleshooting Workflow

When you ask Atmosly's AI Copilot a question, a sophisticated multi-stage process executes in seconds:

Stage 1: Natural Language Understanding

Atmosly uses AWS Bedrock (Claude, Nova) to parse your natural language query:

  • "Why is my payment-service crashing?" → Intent: Investigate pod crash, Target: payment-service deployment
  • "Show pods using too much memory" → Intent: Resource analysis, Filter: High memory usage
  • "What changed in the last hour?" → Intent: Change analysis, Time window: 1 hour

AI extracts: intent/action required, target resources (specific pod, deployment, namespace), time windows for investigation, and filters or constraints.

Stage 2: Smart Resource Resolution

Atmosly's smart resolver handles ambiguous queries:

  • "my payment pod" → Searches for pods matching "payment" in name or labels
  • If multiple matches, lists options: "Found 3 pods matching 'payment': payment-service-abc, payment-worker-xyz, payment-api-def. Which one?"
  • Remembers context from conversation for follow-ups

Stage 3: Distributed Data Collection

AI automatically retrieves relevant data from multiple sources in parallel:

  • Kubernetes API: Pod status, container states, events, related resources (Services, ConfigMaps, Secrets, PVCs)
  • Container Logs: Current container logs, previous container logs (if crashed), last 200 lines or full logs if needed
  • Prometheus Metrics: Resource usage (CPU, memory, network, disk), application metrics (request rate, error rate, latency), time-series data over relevant windows (last hour, last 24h, last 7 days)
  • Cluster Events: Recent deployments, scaling events, node changes, OOMKills, image pulls

This happens in 2-5 seconds significantly faster than manual investigation.

Stage 4: AI Analysis and Correlation

AI correlates all data sources:

  • Timeline Reconstruction: Orders events chronologically showing cause-effect relationships
  • Pattern Recognition: Identifies known failure patterns ("connection refused" + database restart = timing issue)
  • Anomaly Detection: Compares current state to learned baselines
  • Root Cause Identification: Determines actual cause versus symptoms

Stage 5: Response Generation

AI generates human-readable response with:

  • Root cause explanation in plain English
  • Timeline of events leading to issue
  • Contributing factors
  • Impact assessment (services affected, requests failed)
  • Recommended fixes with specific kubectl commands
  • Alternative solutions if multiple approaches viable
  • Cost impact if fixes require resource changes
  • Recovery time estimate

Real-World Troubleshooting Scenarios: Manual vs AI

Scenario 1: Pod CrashLoopBackOff Investigation

Manual Approach (45 minutes):

14:30 - Alert: payment-service down
14:32 - kubectl get pods (identify crashing pod)
14:35 - kubectl describe pod payment-service-xyz (read events)
14:40 - kubectl logs payment-service-xyz --previous (examine logs)
14:48 - Search logs for errors (500 lines, find "connection refused")
14:52 - kubectl get pods | grep postgres (check database)
14:55 - kubectl logs postgres-0 (database logs)
15:00 - kubectl get events --sort-by=.lastTimestamp (recent events)
15:05 - Notice database restarted at 14:28 (2 min before payment crash)
15:08 - Query Prometheus for database metrics
15:12 - See database OOMKilled at 14:27
15:15 - Correlate manually: DB OOMKill → restart → payment connection failed → crash
15:18 - kubectl set resources statefulset/postgres --limits=memory=2Gi
15:25 - Verify recovery
TOTAL: 55 minutes

AI Copilot Approach (45 seconds):

14:30 - Alert: payment-service down
14:31 - Ask Atmosly: "Why is payment-service crashing?"
14:31:45 - AI responds with complete RCA:

ROOT CAUSE: PostgreSQL database OOMKilled at 14:27, restarted, 
entered recovery mode. Payment service attempted connection during 
recovery window at 14:30, received 'connection refused', crashed.

TIMELINE:
- 14:27:12 Database memory: 510Mi / 512Mi limit (99.6%)
- 14:27:14 Database OOMKilled (exit code 137)
- 14:27:20 Database restarted
- 14:28:00 PostgreSQL recovery mode (2 min duration typical)
- 14:30:05 Payment service connection attempt
- 14:30:07 Connection refused (DB still recovering)
- 14:30:09 Payment service crashed (exit 1)

FIX: kubectl set resources statefulset/postgres --limits=memory=1Gi
Cost: +$25/month
Recovery: ~60 seconds

14:33 - Apply fix
14:34 - Services recovered
TOTAL: 4 minutes (with only 2 minutes active engineering work)

Time savings: 93% reduction (55 min → 4 min)

Scenario 2: Performance Degradation Investigation

Question: "Why is the API slow since 3 PM?"

Atmosly AI Analysis (60 seconds):

Latency Increase Root Cause Identified

Issue: API p95 latency increased from 200ms to 2.5s starting at 15:03 PM

Root Cause: New deployment at 15:00 introduced N+1 query problem, increasing database load 10x. Database CPU saturated at 95%, query queue depth spiked to 500.

Evidence:

  • Deployment: api-backend v2.1.0 rolled out at 15:00:15
  • Application logs: 10x increase in database queries per request
  • Database CPU: 40% → 95% at 15:03
  • Database query latency: 15ms → 180ms
  • API latency: 200ms → 2.5s

Recommended Fix:

# Immediate: Rollback problematic deployment
kubectl rollout undo deployment/api-backend

# Long-term: Fix N+1 query in code
# File: api/handlers/orders.py
# Line 142: Query in loop (N+1 pattern detected)
# Solution: Use join or eager loading

Recovery: 2-3 minutes after rollback
User impact: ~1,200 slow requests during 8-minute incident

Manual investigation would take 60-90 minutes checking application logs, database metrics, deployment history, code changes in git, and performance profiling. AI does it in 60 seconds with code-level insights.

Types of Questions You Can Ask Atmosly AI Copilot

Pod and Container Issues

  • "Why is my frontend-web pod in CrashLoopBackOff?"
  • "Show me pods that are restarting frequently"
  • "What's wrong with payment-service-abc123?"
  • "Why can't my pod start?"
  • "Which pods are unhealthy in production namespace?"

Resource and Performance Questions

  • "Show pods using more than 90% memory"
  • "Which deployments are wasting the most CPU?"
  • "Why is my application slow?"
  • "What's causing high latency in the API?"
  • "Show me pods with OOMKilled errors"

Deployment and Change Analysis

  • "What changed in the last hour?"
  • "Why did my deployment fail?"
  • "Show recent deployments in production"
  • "What's different between staging and production?"

Network and Connectivity

  • "Why can't my pod connect to Redis?"
  • "Test connectivity from frontend to backend"
  • "Show me network policies blocking traffic"
  • "Why is DNS resolution failing?"

Cost Optimization

  • "Show me the most expensive pods"
  • "Which namespaces are wasting resources?"
  • "How can I reduce costs in staging?"
  • "Find over-provisioned deployments"

Atmosly AI Copilot Features and Capabilities

Feature 1: Automatic Health Monitoring (30-Second Detection)

Atmosly continuously monitors your cluster detecting issues proactively:

Automatically Detected Issues (20+ patterns):

  • CrashLoopBackOff: Container crash loops detected within 30 seconds
  • ImagePullBackOff: Image pull failures identified immediately with specific registry error
  • OOMKilled: Memory limit violations with usage trend analysis
  • Pending Pods: Scheduling failures with node-by-node analysis
  • Failed Deployments: Rollout issues with replica availability problems
  • High Restart Count: Pods restarting frequently indicating instability
  • Resource Exhaustion: Nodes or pods approaching capacity limits
  • Failed Health Checks: Liveness or readiness probe failures
  • Config Errors: Missing ConfigMaps or Secrets
  • Networking Issues: DNS failures, connection timeouts, blocked traffic

Each detection triggers automatic investigation and RCA generation.

Feature 2: Root Cause Analysis with Timeline Reconstruction

When issues detected, AI generates comprehensive RCA:

RCA Components:

  • Root Cause Statement: Clear explanation of actual problem ("PostgreSQL OOMKilled causing payment service crashes")
  • Timeline: Chronological sequence of events showing causal relationships (database crash 2 min before payment crash)
  • Contributing Factors: Additional conditions that enabled or worsened the problem
  • Impact Assessment: Quantified impact (services affected, requests failed, duration, customer exposure)
  • Recommended Actions: Prioritized fix steps with specific commands
  • Prevention: How to prevent recurrence (better limits, retry logic, alerting)

Feature 3: Natural Language to kubectl Translation

AI translates your questions into appropriate kubectl commands:

"Show me pods in production namespace"kubectl get pods -n production

"Get logs from the last crashed container"kubectl logs payment-service-xyz --previous

"List deployments with zero available replicas"kubectl get deployments -o json | jq '.items[] | select(.status.availableReplicas==0)'

You don't need to remember kubectl syntax—just ask naturally.

Feature 4: Prometheus Query Generation

AI writes PromQL queries for you:

"Show memory usage for payment-service over last 24 hours"

AI generates and executes:

container_memory_working_set_bytes{
  pod=~"payment-service-.*",
  namespace="production"
}[24h]

Returns human-readable result: "Payment service memory ranged from 450Mi to 890Mi over last 24h, averaging 680Mi"

Feature 5: Cost Intelligence Integration

Every resource recommendation includes cost impact:

"Should I increase memory for database?"

AI response:

Yes, recommend increasing database memory from 1Gi to 2Gi.

Reasoning: Current usage p95 is 950Mi (95% of 1Gi limit). OOMKilled 3 times in last 7 days when usage spiked to 1.02Gi.

Cost Impact: +$35/month for primary replica + $35/month for standby = $70/month total

Performance Benefit: Eliminates OOMKills (zero downtime from crashes), provides headroom for query caching (improves performance 15-20%)

Recommendation: Yes, cost is justified by reliability improvement

Feature 6: Proactive Alerting with Context

Instead of just "pod is down" alerts, Atmosly provides full context:

Slack Alert Example:

🔴 CRITICAL: Payment Service Crashed (CrashLoopBackOff)

What: payment-service-7d9f8b-xyz in CrashLoopBackOff (4 restarts)

Why: Database connection timeout. PostgreSQL OOMKilled at 14:27, restarted, recovery took 2 minutes. Payment service connected during recovery, got "connection refused", crashed.

Impact: Payment processing down 8 minutes, ~15 transactions failed

Fix: kubectl set resources statefulset/postgres --limits=memory=1Gi

Status: Auto-recovering (DB stable at 14:35, payment service will recover in ~60s)

Engineer reads message, confirms auto-recovery, no action needed. Or applies fix if auto-recovery doesn't happen.

Advanced AI Copilot Capabilities

Conversational Context and Follow-ups

AI remembers conversation context:

You: "Why is payment-service crashing?"
AI: [Provides RCA about database OOMKill]

You: "Show me database memory usage over last week"
AI: [Understands 'database' refers to postgres from previous context, shows metrics]

You: "What's the optimal memory limit?"
AI: [Knows you're still talking about postgres, calculates optimal limit: 1.2Gi based on p99 usage]

You: "Apply that fix"
AI: [Generates kubectl command, asks confirmation]

You: "Yes, apply it"
AI: [Executes: kubectl set resources statefulset/postgres --limits=memory=1.2Gi]

Natural conversation flow, no repeating context.

Confirmation for Destructive Actions

AI asks confirmation before executing changes:

"Delete all pods in dev namespace"

AI: "⚠️ This will delete 23 pods in dev namespace. This action cannot be undone. Type 'confirm' to proceed or 'cancel' to abort."

Safety mechanism prevents accidental destructive operations.

Explaining Kubernetes Concepts

AI can teach while troubleshooting:

"What is OOMKilled?"

AI provides clear explanation with examples, relates to current issue if investigating one, and offers to show examples in your cluster.

Multi-Cluster Support

If managing multiple clusters:

"Show me all CrashLoopBackOff pods across all clusters"

AI aggregates data from all connected clusters, groups by cluster, and prioritizes by severity.

Comparison: Traditional Troubleshooting vs Atmosly AI

AspectTraditional ManualAtmosly AI Copilot
Detection Time5-45 minutes (manual dashboard review)30 seconds (automatic)
Investigation Time30-90 minutes (multiple kubectl commands, log analysis, metric queries)10-45 seconds (AI retrieves and analyzes automatically)
Required KnowledgeDeep kubectl expertise, PromQL, Kubernetes architectureNone (ask in natural language)
Data CorrelationManual (error-prone, time-consuming)Automatic (AI correlates logs + metrics + events)
Root CauseSometimes identified, sometimes guessAI-identified with confidence score
Fix RecommendationsEngineer must determine based on diagnosisSpecific kubectl commands provided
Cost AwarenessNo visibility into cost impactEvery recommendation includes cost ($X/month)
Typical MTTR60-120 minutes5-15 minutes

Best Practices for Using Atmosly AI Copilot

1. Be Specific When Possible

Vague: "Something is broken"

Better: "Why is my payment-service in production crashing?"

Best: "Why is payment-service-7d9f8b-xyz pod in CrashLoopBackOff?"

2. Ask Follow-Up Questions

Use conversational flow:

"Why is my pod crashing?"
[AI provides RCA]
"Show me the logs"
[AI shows relevant log sections]
"What are the memory metrics?"
[AI shows Prometheus memory data]
"Increase the memory limit"
[AI provides kubectl command]

3. Request Explanations

"Explain what OOMKilled means"

"Why does increasing memory cost more?"

"What is a liveness probe?"

AI provides educational context during troubleshooting.

4. Use for Cost Optimization

"Find pods wasting CPU in production"

"How can I reduce costs by 20%?"

"Show me idle resources"

AI analyzes usage patterns and recommends optimizations.

Conclusion: The Future of Kubernetes Troubleshooting is AI-Powered

Kubernetes troubleshooting has evolved from manual investigation requiring deep expertise to AI-powered analysis accessible to anyone. Atmosly AI Copilot reduces MTTR by 90-97%, enables teams without Kubernetes expertise to troubleshoot effectively, eliminates repetitive debugging work, and provides cost-aware recommendations.

Key Benefits:

  • 30-second automatic issue detection vs 5-45 minutes manual
  • Comprehensive RCA in seconds vs hours of investigation
  • Natural language queries vs kubectl/PromQL expertise
  • Automatic correlation vs manual cross-referencing
  • Cost-aware recommendations vs blind resource increases
  • Proactive alerting vs reactive firefighting

Ready to transform your Kubernetes troubleshooting from hours to seconds? Start your free Atmosly trial and ask your AI Copilot the first troubleshooting question no kubectl expertise required.

Frequently Asked Questions

What is Atmosly AI Copilot for Kubernetes?
  1. Natural-language troubleshooting: Atmosly AI Copilot replaces manual debugging (kubectl, PromQL, dashboards). Engineers can ask questions like “Why is my pod crashing?” and the AI automatically handles the investigation.
  2. Automated data retrieval:
    • Pod status and container state
    • Logs from current and previous container instances
    • Kubernetes events with timestamps
    • Prometheus metrics (CPU, memory, network, restarts)
    • Dependent resources (ConfigMaps, Secrets, Services, DB pods)
    • Recent cluster activities (deployments, node issues, OOMKills)
  3. Correlation & timeline reconstruction: AI correlates all data sources to build a precise sequence of events, identifying cause-effect relationships across workloads and dependencies.
  4. Root cause identification: Uses pattern recognition, anomaly detection, and model reasoning to determine the real root cause—e.g., OOMKilled, misconfig, failed dependency, image errors—along with a confidence score.
  5. Full RCA report generation:
    • Plain-language explanation of the issue
    • Detailed event timeline
    • Contributing factors
    • Impact analysis (services, transactions, downtime)
  6. Actionable remediation:
    • Specific kubectl commands to fix the issue
    • Cost impact of recommended changes (e.g., increasing memory)
    • Risk assessment before applying the fix
  7. Recovery time estimation: AI predicts approximate recovery duration based on issue type, fix complexity, and cluster state.
  8. Automatic health monitoring: Detects 20+ Kubernetes issue patterns within 30 seconds, including:
    • CrashLoopBackOff
    • OOMKilled
    • ImagePullBackOff
    • Pending / Unschedulable
    • Probe failures
    • High latency / error spikes
  9. Conversational troubleshooting: Provides interactive follow-ups, remembers context, explains metrics and logs, and walks engineers through multi-step debugging.
  10. Proactive alerting: Sends Slack/Teams alerts with full RCA before any manual investigation begins, drastically reducing response time.
  11. Massive MTTR reduction:

    Traditional manual debugging: ~60 minutes Atmosly AI analysis: ~45 seconds → 99% reduction in Mean Time To Resolution (MTTR)

  12. Powered by AWS Bedrock: Uses Claude and Amazon Nova models with Kubernetes-specific training for accurate, domain-aware diagnostics.
How does Atmosly AI Copilot reduce troubleshooting time?
  1. Manual Traditional Troubleshooting (60–120 minutes):
    • (1) Issue Detection (5–45 min): Engineers manually notice problems through dashboards or alerts.
    • (2) Data Gathering (10–20 min): Run multiple kubectl commands to collect pod status, logs, events.
    • (3) Metrics Investigation (15–30 min): Query Prometheus for CPU, memory, network, and latency metrics.
    • (4) Manual Correlation (10–20 min): Align timestamps and interpret logs, events, and metrics across multiple systems.
    • (5) Research & Diagnosis (10–30 min): Search error messages, documentation, and past incidents to determine root cause.

    Total manual troubleshooting time: 60–120 minutes.

  2. Atmosly AI Troubleshooting (≈45 seconds):
    • (1) Automatic Detection (≈30 sec): Health monitoring identifies issues instantly (CrashLoopBackOff, OOMKilled, ImagePullBackOff, etc.).
    • (2) Automated Data Collection (5–10 sec): Gathers pod status, logs (current + previous), events, Prometheus metrics, and dependency status in parallel.
    • (3) AI Analysis & Correlation (5–15 sec): AI reconstructs the timeline and finds cause-effect relationships across logs, events, and metrics.
    • (4) Instant RCA Output: Produces full root cause report with recommended kubectl fix commands and impact assessment.

    Total Atmosly time: ≈45 seconds → 99% faster.

  3. Time Savings Comparison:

    Traditional: ~60 minutes
    Atmosly: ~45 seconds
    99% reduction in troubleshooting time

  4. Real Example – Payment Service CrashLoopBackOff:
    • Traditional (55 minutes): Engineer ran kubectl describe, logs --previous, checked DB health, queried Prometheus, manually connected timeline, discovered DB OOMKill → restart → connection refused → payment crash.
    • Atmosly (45 seconds): AI detected crash, retrieved all logs/metrics, correlated DB OOMKill to payment service failure, and suggested the exact kubectl fix.
    • Engineer active time: 55 minutes → 5 minutes (simply reviewing and applying AI recommendations).
  5. Outcome:

    Atmosly enables junior engineers to resolve issues that previously required senior SRE expertise, dramatically accelerating incident resolution and improving reliability.

What types of questions can I ask Atmosly AI Copilot?
  1. Pod Issues (CrashLoop, restarts, failures):
    • “Why is my frontend pod crashing?”
    • “Show pods in CrashLoopBackOff.”
    • “What’s wrong with payment-service-abc123?”
    • “Which pods are restarting frequently?”
  2. Performance & Latency Analysis:
    • “Why is the API slow?”
    • “Show pods using >90% memory.”
    • “What’s causing high latency?”
    • “Which deployments waste the most CPU?”
  3. Resource Usage & Efficiency:
    • “Show me pods with OOMKilled errors.”
    • “Find over-provisioned deployments.”
    • “Which namespace uses the most resources?”
  4. Deployment & Change Tracking:
    • “What changed in the last hour?”
    • “Why did the deployment fail?”
    • “Show recent rollouts.”
  5. Networking & Connectivity:
    • “Why can’t the pod connect to Redis?”
    • “Test connectivity from frontend to backend.”
    • “Show network policies.”
  6. Cost Optimization:
    • “How can I reduce Kubernetes costs?”
    • “Show expensive pods.”
    • “Find idle resources.”
  7. Explanations & Education:
    • “What is OOMKilled?”
    • “Explain CrashLoopBackOff.”
    • “How do liveness probes work?”
  8. How Atmosly AI Copilot works:

    AI translates natural language into Kubernetes commands and Prometheus queries, retrieves data, correlates logs/events/metrics, and responds with clear explanations and actionable output.

    It supports conversational troubleshooting with context-aware follow-ups:

    • “Why is it crashing?” → RCA
    • “Show logs.” → Log output
    • “Fix it.” → kubectl patch/apply command
How does Atmosly AI Copilot correlate metrics, logs, and events automatically?
  1. Data Collection (parallel, 2–5 seconds):
    • Queries the Kubernetes API for pod status, conditions, events, owner references, and related resources.
    • Retrieves container logs from both the current and previous container instances.
    • Pulls Prometheus metrics over the appropriate time window (1h, 24h, 7d) depending on issue type.
    • Collects recent cluster changes including deployments, scaling events, node pressure, OOMKills, or restarts.
    • All sources are fetched in parallel for maximum speed.
  2. Timeline Reconstruction:
    • AI merges logs, metrics, and events into a unified timeline with precise timestamps.
    • Identifies causal chains (e.g., “DB OOMKilled at 14:27 → Payment service crash at 14:30”).
    • Builds clearly structured before / during / after sequences showing incident progression.
  3. Pattern Recognition:
    • Detects known log signatures (e.g., “connection refused” + recent restart = dependency startup timing issue).
    • Analyzes resource usage trends (e.g., linear memory increase = memory leak).
    • Identifies anomalies such as sudden traffic spikes, latency surges, or CPU bursts.
    • Distinguishes expected behavior from outliers using statistical and ML-based detection.
  4. Root Cause Analysis (RCA):
    • Determines the true root cause instead of surface-level symptoms.
    • Example: Symptom: “payment container crash.” Root cause: “database OOMKilled 2 minutes earlier.”
    • Filters out red herrings (e.g., high CPU unrelated to crash).
    • Assigns a confidence score (e.g., 95% certain root cause vs 60% probable).
  5. Synthesis & Remediation:
    • Generates a clean, human-readable RCA summarizing all findings.
    • Includes supporting evidence: log excerpts, event timestamps, metric graphs, dependency failures.
    • Provides specific remediation steps with kubectl commands and configuration changes.
  6. Example: Payment Service Crash
    • 14:27 — Database memory hits 510Mi/512Mi → OOMKilled event logged.
    • 14:27–14:29 — Database restarting; not accepting connections.
    • 14:30 — Payment service logs: “connection refused to postgres.”
    • 14:30 — Payment pod enters CrashLoopBackOff.
    • AI Conclusion: Database OOMKill caused downstream payment crash due to failed connections.
    • Time taken: 10 seconds vs 30–60 minutes manual correlation.
Can Atmosly AI Copilot help with cost optimization?
  1. Ask cost optimization questions naturally:
    • “Show expensive pods.”
    • “Find wasted resources.”
    • “How do we reduce costs by 30%?”
    • “Which deployments are over-provisioned?”
    • “What’s the cost per namespace?”

    Atmosly AI Copilot understands these queries and performs all underlying Kubernetes + cost analysis automatically.

  2. AI Resource Analysis (Using 7–30 days of Prometheus metrics):
    • (1) Compares actual CPU/memory usage vs requests vs limits across all pods.
    • (2) Detects over-provisioning and waste patterns:
      • Pods requesting 2 CPU but using 0.3 CPU (85% waste).
      • Staging environments running 24/7 with zero activity at night/weekends.
      • Idle pods with no traffic or work for 7+ days.
    • (3) Identifies workloads trending toward inefficiency (e.g., rising memory → possible leak).
  3. Precise Cost Impact Calculation:

    Atmosly translates resource waste into exact cloud cost numbers.

    Example:
    Pod requests 2 CPU but uses 0.4 CPU  
    = 1.6 wasted cores × $30/core/month  
    = $48 per pod/month × 10 replicas  
    = $480/month waste

    Makes cost overruns tangible and measurable.

  4. Optimization Recommendations (with kubectl fixes):
    • Provides exact resource right-sizing commands, e.g.:
    kubectl set resources deployment/frontend --requests=cpu=500m
    # saves ~$400/month
    • Ensures recommendations do not compromise performance or SLOs.
  5. Cost–Performance Tradeoffs:
    • Explains consequences of tuning decisions:
      • Increasing DB memory ($35/month) avoids OOMKills — recommended.