Core 01 · Cluster Operations

An AI SRE agent that fixes Kubernetes incidents — and shows its work.

Atmosly's AI SRE Agent watches pod health, metrics, events, and rollout history across every connected cluster. When something breaks, it finds the root cause in under a minute, explains its reasoning, and proposes ranked fixes you can apply or revert in one click — never a black box.

  • Read-only by default
  • Root cause in <1 min
  • Every fix is reversible
Active issues live · 5 open
OOMKilled · payments-api
default · 47 reports · 3h
CRIT
CrashLoopBackOff · checkout-worker
orders · 21 reports · 2h
CRIT
ImagePullBackOff · inventory-sync
inventory · 8 reports · 37m
HIGH
Pending · analytics-job
analytics · 18 reports · 18m
HIGH
ConfigError · notifications-worker
platform · 12 reports · 23m
HIGH
payments-api · OOMKilled AI-generated
Root cause
Memory limit 256Mi is below the working-set requirement (~240Mi peak — 94% of limit). Any small spike triggers a kernel OOM kill.
AI confidence
0.82
FIX 1 Raise memory limit 256Mi → 768Mi
# Deployment: default/payments-api resources: limits: - memory: 256Mi + memory: 768Mi
blast radius · single deployment reversible
How it works

Detect, diagnose, fix — the full incident loop, automated.

Backstage and Port show you what's in the cluster. The AI SRE Agent does something about it — running the same loop a senior SRE would, on every issue, around the clock.

live · event stream
07:42:03Pod payments-api exited 137 · OOMKilled
07:42:05Correlating metrics + lastState.terminated
07:42:11Reading 4h of container_status transitions
07:42:21Root cause identified · confidence 0.82
01 — Detect

Always-on watch across every cluster

The agent ingests pod health, resource usage, traffic patterns, and the live event stream — continuously, on every connected cluster. No dashboards to babysit, no alerts to wire up.

  • OOMKills, CrashLoops, ImagePull, scheduling & config errors
  • Deduplicates repeat events into a single tracked issue
  • Severity-ranked so the on-call sees what matters first
root-cause analysis
AI reasoning
Every restart correlates with the working set reaching 240Mi within 30s of ready, then a 137 exit. Deployment hasn't changed in 11 days — this is capacity-vs-load, not a regression.
Contributing factors
  • Request (128Mi) is half the limit — scheduled near capacity
  • No HPA — one replica absorbs all traffic
02 — Diagnose

Root cause in under a minute, with its reasoning

The agent reads logs, metrics, and rollout history together — the way a senior engineer would — and writes up the actual cause, an AI confidence score, and the contributing factors. You see why, not just what.

  • Plain-English summary + technical root cause
  • Confidence score on every analysis — no false certainty
  • Cross-references ReplicaSet history to separate load from regressions
remediation · ranked options
FIX 1Rollback to v2.14.2
Previous revision ran 14 days with zero restarts. Safer than waiting on a fix while prod crashes.
single deploymentconf 0.91
FIX 2Pin image digest to last-known-good
Defensive — prevents re-pulls of the broken tag if it's re-pushed.
03 — Fix

Ranked, reversible fixes — kubectl and YAML, ready to apply

Each issue comes with 1–3 ranked fix options, each with a rationale, confidence, blast radius, and preconditions the agent has already checked. Apply the recommended one, or open the PR and review the diff yourself.

  • Concrete artifacts: kubectl, YAML diff, action spec
  • Blast radius + preconditions shown before anything runs
  • Every action has a recorded rollback — revert in one click
Coverage

The failure modes it handles on day one

Purpose-built diagnosis packs for the incidents that actually page your team — not a generic chatbot guessing at YAML.

OOMKilled

Detects working-set-vs-limit pressure, distinguishes a leak from organic growth, and sizes the new limit from observed p95.

CrashLoopBackOff

Pulls --previous logs, ties the panic to a recent rollout, and recommends a rollback to the proven revision.

ImagePullBackOff

Spots stale pull secrets after a token rotation and walks you through refreshing them and rolling the pod.

Pending / unschedulable

Explains autoscaler cooldowns and capacity gaps, and recommends scaling the node group or right-sizing the request.

Config errors

Traces a missing ConfigMap back to the audit log, and offers a GitOps sync or a manual recreate when source-of-truth is known.

Learns from history

Remembers fixes that worked — "same fix resolved auth-api in 4m" — so confidence climbs with every resolved incident.

The 2am difference

What changes when the agent owns triage

<1 min
to root cause, vs. a 4-person Slack war room
94%
fewer 3am pages once triage is automated*
1-click
revert on every applied fix — nothing is one-way
24/7
coverage across every connected cluster

*Representative of customer-reported outcomes. Your results depend on workload mix and cluster size.

Questions

What teams ask before connecting a cluster

Does the SRE Agent need write access to my cluster?
No. The agent runs read-only by default. It detects and diagnoses on its own, then proposes fixes as kubectl commands or pull requests for you to apply. Any write action is opt-in, scoped, and gated by your guardrails — and every applied fix records a rollback.
Does it replace my monitoring stack?
No — it builds on it. The agent reads the signals you already collect (metrics, logs, events, rollout history) and acts on them. It coexists with Prometheus, Datadog, and your existing alerting rather than replacing them.
How do I know its fixes aren't just noise?
Every analysis carries an AI confidence score, and every fix shows its rationale, blast radius, and the preconditions the agent already checked. You see the reasoning before anything runs — and the agent gets more confident as it learns which fixes resolved similar incidents in your clusters.
Is it safe to run in production?
Yes. Read-only by default, every action reversible with a recorded rollback, a full audit trail, and RBAC + SSO. Nothing changes your cluster without you — or a guardrail you defined — approving it first.

See what the agent finds in your cluster.

Connect one cluster, read-only. Live issues, root causes, and ranked fixes show up on your dashboard in about five minutes. Free, no sales call.

Connect a cluster → Book a 15-min walkthrough