Skip to main content

How It Works

Autoheal is built around three core concepts: the Production Context Graph, multiplayer AI agents, and a continuous learning loop. Together they form a platform that investigates production issues the way a senior engineering team would — but without the context-switching, fatigue, or knowledge gaps.

Production Context Graph (PCG)

The Production Context Graph is Autoheal's foundational technology. It's a continuously updated, interconnected graph that connects four layers of your production environment:

Infrastructure

Services, dependencies, topology, deployment targets, and resource configurations. The PCG knows how your systems relate to each other.

Code

Repositories, recent deployments, pull requests, CI/CD pipelines, and change history. The PCG knows what changed and when.

Tools

Observability platforms, incident management systems, documentation wikis, and custom internal tools — all connected via MCP integrations.

Tribal Knowledge

Skills, past incident RCAs, service ownership, debugging procedures, and team expertise captured in the Production Context Graph.

How the PCG Self-Learns

The Production Context Graph isn't static — it evolves with every interaction:

  • From investigations — each investigation adds new connections between symptoms, root causes, and resolutions
  • From humans — skill updates, Production Context Graph additions, and agent feedback enrich the graph
  • From agent actions — successful diagnostic paths are reinforced; dead ends are deprioritized
  • From RCAs — root cause analyses create durable links between failure modes and their causes

This means the more you use Autoheal, the faster and more accurate investigations become.

AI Agent Architecture

The OnCall Agent

The OnCall Agent is the core orchestrator. When an investigation starts, the agent:

  1. Assesses the situation — reads the alert, ticket, or user query to understand the problem
  2. Plans the investigation — determines which tools to query and in what order
  3. Gathers evidence — queries integrations (Datadog, Grafana, GitHub, etc.) through MCP
  4. Consults the PCG — checks past incidents, runbooks, and service ownership for context
  5. Forms hypotheses — generates ranked probable causes with supporting evidence
  6. Proposes fixes — suggests remediations based on runbooks, past resolutions, or code-level changes
  7. Generates RCA — produces a structured root cause analysis after resolution

The agent maintains full conversation context, so you can interact with it, redirect its investigation, or ask follow-up questions at any point.

Decision Traces

Every decision the agent makes is documented in a decision trace — a transparent record of:

  • What data the agent looked at
  • What hypotheses it considered
  • Why it pursued certain paths over others
  • What evidence supported or contradicted each hypothesis

Decision traces give your team full visibility into the agent's reasoning, making it auditable and trustworthy.

Adversarial Agent Review

Before presenting findings, Autoheal runs an adversarial review process. A separate validation step challenges the primary agent's conclusions by:

  • Checking that every finding is backed by concrete evidence
  • Identifying gaps in the investigation
  • Flagging hypotheses that lack supporting data
  • Ensuring the root cause analysis is logically sound

This prevents false positives and ensures the quality of every investigation.

Integration Layer (MCP)

Autoheal uses the Model Context Protocol (MCP) — an open standard developed by Anthropic — to connect to your tools. MCP provides:

  • Standardized tool access — every integration exposes specific capabilities (query metrics, search logs, read code)
  • Security boundaries — the agent can only perform actions your integration explicitly permits
  • Real-time data — no data syncing or ETL pipelines; the agent queries your tools live during investigations
  • Extensibility — connect any tool by building a custom MCP server

See Integrations for the full list of supported tools and setup guides.

Investigation Flow

Here's how a typical investigation flows through the system:

1
Trigger

An investigation starts from a PagerDuty alert, Datadog monitor, Sentry exception, Slack message, support ticket, or manual input in the Autoheal UI.

2
Context Loading

The agent loads relevant context from the PCG: service ownership, related runbooks, past incidents for this service, recent deployments, and known dependencies.

3
Evidence Gathering

The agent queries your connected integrations — pulling metrics, logs, traces, deployment history, and error data. It builds a timeline of events.

4
Hypothesis Formation

Based on evidence and PCG context, the agent generates ranked hypotheses. Each hypothesis includes the reasoning and supporting evidence via decision traces.

5
Adversarial Review

Findings are validated through adversarial review to ensure they're evidence-backed and logically sound.

6
Fix Proposals

The agent proposes mitigating fixes — rollbacks, restarts, config changes, scaling actions. For code-level issues, it surfaces preventive fixes for your team to review.

7
RCA Generation

After resolution, a structured 5-Why RCA is generated with timeline, impact analysis, root cause, and preventive measures.

8
Learning

The RCA and investigation findings feed back into the Production Context Graph, making future investigations faster and more accurate.

Continuous Learning Loop

Autoheal is designed around a continuous improvement cycle:

Alert / Ticket / Query

Agent Investigates (using PCG + Integrations)

Team Reviews & Resolves

RCA Captured → PCG Updated

Next Investigation Is Faster & More Accurate

The loop compounds over time:

  • Runbooks that worked get referenced more
  • Hypotheses that were wrong get deprioritized
  • Service relationships discovered during investigations enrich the graph
  • Resolution patterns become reusable across similar incidents

Your institutional knowledge stops living in people's heads and starts compounding in the system.