Capabilities

Autoheal's AI agents are built for the SRE needs of regulated enterprises. This page covers the capabilities that set Autoheal apart from generic AI tools.

Alert Normalization & Deduplication

When an incident occurs, it typically generates multiple alerts across different tools: a PagerDuty page, a Datadog monitor trigger, a Grafana alert, and a spike in Sentry errors. Instead of creating separate investigations for each, Autoheal:

Normalizes alerts from different sources into a consistent format with standardized severity, timing, and service attribution.
Deduplicates alerts that refer to the same underlying issue.
Categorizes each alert by service, failure domain, and probable impact.
Correlates simultaneous alerts across services to tell cascading failures apart from independent issues.

Your on-call engineer sees one consolidated investigation instead of a wall of noise.

Decision Traces

Every investigation decision is documented in a decision trace, a transparent record of the agent's reasoning.

A decision trace captures:

Element	Description
Data sources queried	Which tools were consulted and what data was retrieved
Hypotheses considered	All probable causes the agent evaluated, not just the final answer
Evidence for/against	What data supported or contradicted each hypothesis
Path decisions	Why the agent pursued certain diagnostic paths over others
Confidence levels	How confident the agent is in each finding and why

Why Decision Traces Matter

Auditability. Your team can review exactly how the agent reached its conclusions.
Trust. Engineers can verify the reasoning, not just the result.
Learning. When the agent is wrong, the trace shows where the reasoning went off-track.
Compliance. The trace provides the documentation trail that regulated environments require.

Adversarial Agent Review

Before presenting findings to your team, Autoheal runs an adversarial review process. A separate validation step challenges the primary investigation by:

Evidence checking. It verifies that every finding is backed by concrete, retrievable evidence.
Gap identification. It flags areas the investigation didn't cover that it should have.
Logic validation. It checks that the causal chain from symptoms to root cause holds up.
False positive filtering. It catches correlations that don't imply causation.

This review reduces false positives, so when Autoheal presents a root cause, your team can act on it.

Multi-Turn Conversations

Autoheal investigations are interactive, not fire-and-forget. The Investigation Agent maintains full conversation context, enabling:

Follow-up questions, such as "Can you also check the Redis connection pool metrics?"
Investigation redirection, such as "I think this is related to the database migration we ran at 3 PM."
Clarification requests, such as "What exactly changed in that deployment?"
Scope expansion, such as "Are any other services seeing similar issues?"

The agent remembers everything discussed in the investigation and incorporates new directions without losing prior context.

Preventive Fixes

When the root cause is a code defect, Autoheal goes beyond identification to show your team exactly what to fix:

Root Cause Identification

The agent pinpoints the specific file, function, and logic causing the issue, linking it to the evidence from the investigation.

Fix Proposal

Based on the root cause analysis, the agent generates a proposed code change that addresses the underlying defect.

Team Review

Your engineering team reviews the suggested fix with full context on why the change was proposed and how it connects to the incident.

The fix comes directly out of the investigation, so your team knows what to change to prevent recurrence.

5-Why Root Cause Analysis

Autoheal generates structured RCAs using the 5-Why methodology. It asks "why?" iteratively until it reaches the underlying systemic cause, not just the proximate trigger.

A typical 5-Why RCA includes:

Incident summary. What happened, when, and the business impact.
Timeline. Chronological events built from investigation evidence rather than memory.
5-Why chain. Root cause analysis working from symptom down to systemic cause.
Impact assessment. Services affected, duration, user impact, and SLA implications.
Mitigations applied. What was done to resolve the immediate issue.
Preventive measures. Action items to prevent recurrence, with owners and timelines.

Guaranteed Post-Mortems

Because the RCA is generated from actual investigation evidence, every incident gets a post-mortem, not just the ones someone remembers to write up. This closes the common gap where only major incidents get RCAs while smaller issues quietly recur.

Skill Execution

The agent also follows your team's established procedures to fix the problems it finds. When the Production Context Graph contains relevant skills, the agent:

Matches the current issue to applicable skills based on symptoms and service.
Follows documented remediation steps, adapting them to the specific situation.
Reports which steps it took and their outcomes.
Suggests improvements when steps are missing, outdated, or unclear.

Knowledge Evolution

Every investigation enriches the Production Context Graph:

New connections between services, failure modes, and resolutions are recorded.
Successful diagnostic paths are reinforced for similar future issues.
Dead-end paths are deprioritized to save time later.
Resolution patterns become reusable across similar incidents.
Skill gaps found during an investigation prompt Production Context Graph improvements.

Institutional knowledge then compounds over time instead of being lost to turnover and context-switching.

Alert Normalization & Deduplication​

Decision Traces​

Why Decision Traces Matter​

Adversarial Agent Review​

Multi-Turn Conversations​

Preventive Fixes​

5-Why Root Cause Analysis​

Guaranteed Post-Mortems​

Skill Execution​

Knowledge Evolution​