How It Works

Autoheal is built around three core concepts: the Production Context Graph, Multiplayer AI Agents, and a Continuous Learning Loop. Together they investigate production issues using the same signals an experienced engineering team would consult.

Production Context Graph (PCG)

The Production Context Graph is the core of Autoheal. It's a continuously updated, interconnected graph that connects four layers of your production environment:

Infrastructure

Services, dependencies, topology, deployment targets, and resource configurations. The PCG knows how your systems relate to each other.

Code

Repositories, recent deployments, pull requests, CI/CD pipelines, and change history. The PCG knows what changed and when.

Tools

Observability platforms, incident management systems, documentation wikis, and custom internal tools, all connected via MCP integrations.

Tribal Knowledge

Skills, past incident RCAs, service ownership, debugging procedures, and team expertise captured in the Production Context Graph.

How the PCG Self-Learns

The Production Context Graph evolves with every interaction:

Investigations add new connections between symptoms, root causes, and resolutions.
Humans enrich the graph through skill updates, Production Context Graph additions, and agent feedback.
Agent actions reinforce successful diagnostic paths and deprioritize dead ends.
RCAs create durable links between failure modes and their causes.

The graph grows with use, so investigations get faster and more accurate over time.

AI Agent Architecture

The Investigation Agent

The Investigation Agent is the core orchestrator. When an investigation starts, this orchestrator agent uses a team of subagents to:

Assesses the situation. Reads the alert, ticket, or user query to understand the problem.
Plans the investigation. Determines which tools to query and in what order.
Gathers evidence. Queries integrations (Datadog, Grafana, GitHub, and others) through MCP.
Consults the PCG. Checks past incidents, runbooks, and service ownership for context.
Forms hypotheses. Generates ranked probable causes with supporting evidence.
Proposes fixes. Suggests remediations based on runbooks, past resolutions, or code-level changes.
Captures learnings. After resolution, records memories and surfaces proactive actions that feed back into the PCG.

The agent maintains full conversation context, so you can interact with it, redirect its investigation, or ask follow-up questions at any point.

Fast vs Deep Investigation

You choose how much reasoning the Investigation Agent applies to a given investigation:

Fast. The agent pursues a single hypothesis for a faster time-to-root-cause. Best for well-understood alerts and routine checks where speed matters most.
Deep. The agent generates multiple hypotheses and cross-validates them against the evidence before presenting a conclusion. Best for ambiguous or high-stakes incidents where thoroughness matters more than speed.

Both modes produce the same decision traces and run through the same adversarial review; the difference is how many hypotheses the agent explores.

The two AI SRE flows map onto these modes by default: Alert Intelligence uses a Fast investigation to surface causes in seconds across high alert volume, while Incident Response uses a Deep investigation to drive a declared incident to a validated root cause.

Decision Traces

Every decision the agent makes is documented in a decision trace, a transparent record of:

What data the agent looked at
What hypotheses it considered
Why it pursued certain paths over others
What evidence supported or contradicted each hypothesis

Decision traces give your team a record of the agent's reasoning that you can audit after the fact.

Adversarial Agent Review

Before presenting findings, Autoheal runs an adversarial review process. A separate validation step challenges the primary agent's conclusions by:

Checking that every finding is backed by concrete evidence
Identifying gaps in the investigation
Flagging hypotheses that lack supporting data
Ensuring the root cause analysis is logically sound

Conclusions that survive this step reach you with their evidence intact; those that don't are sent back for more investigation.

Integration Layer (MCP)

Autoheal uses the Model Context Protocol (MCP), an open standard developed by Anthropic, to connect to your tools. MCP provides:

Standardized tool access. Every integration exposes specific capabilities such as querying metrics, searching logs, and reading code.
Security boundaries. The agent can only perform actions your integration explicitly permits.
Real-time data. There are no data syncing or ETL pipelines; the agent queries your tools live during investigations.
Extensibility. Connect any tool by building a custom MCP server.

See Integrations for the full list of supported tools and setup guides.

Investigation Flow

Here's how a typical investigation flows through the system:

Trigger

An investigation starts from a PagerDuty alert, Datadog monitor, Sentry exception, Slack message, support ticket, or manual input in the Autoheal UI.

Context Loading

The agent loads relevant context from the PCG: service ownership, related runbooks, past incidents for this service, recent deployments, and known dependencies.

Evidence Gathering

The agent queries your connected integrations, pulling metrics, logs, traces, deployment history, and error data. It builds a timeline of events.

Hypothesis Formation

Based on evidence and PCG context, the agent generates ranked hypotheses. Each hypothesis includes the reasoning and supporting evidence via decision traces.

Adversarial Review

Findings are validated through adversarial review to ensure they're evidence-backed and logically sound.

Fix Proposals

The agent proposes mitigating fixes such as rollbacks, restarts, config changes, and scaling actions. For code-level issues, it surfaces preventive fixes for your team to review.

Memories & Proactive Actions

After resolution, Autoheal captures a memory of what happened and how it was resolved, and surfaces proactive actions such as tuning a noisy alert or fixing a recurring root cause to prevent recurrence.

Learning

Memories, proactive actions, and investigation findings feed back into the Production Context Graph, making future investigations faster and more accurate.

Continuous Learning Loop

Autoheal is designed around a continuous improvement cycle:

Alert / Ticket / Query
        ↓
  Agent Investigates (using PCG + Integrations)
        ↓
  Team Reviews & Resolves
        ↓
  Memories & Proactive Actions Captured → PCG Updated
        ↓
  Next Investigation Is Faster & More Accurate

The loop compounds over time:

Runbooks that worked get referenced more.
Hypotheses that were wrong get deprioritized.
Service relationships discovered during investigations enrich the graph.
Resolution patterns become reusable across similar incidents.

Your institutional knowledge stops living in people's heads and starts compounding in the system.

Production Context Graph (PCG)​

How the PCG Self-Learns​

AI Agent Architecture​

The Investigation Agent​

Fast vs Deep Investigation​

Decision Traces​

Adversarial Agent Review​

Integration Layer (MCP)​

Investigation Flow​

Continuous Learning Loop​