Skip to main content

AI SRE

Autoheal's AI SRE capability acts as a tireless member of your on-call rotation — triaging alerts, gathering diagnostic signals, and conducting deep-dive investigations so your engineers can focus on what matters.

Alert Triage

When alerts fire, the noise-to-signal ratio is often the biggest problem. Autoheal's AI SRE handles this automatically:

Alert Normalization & Deduplication

  • Normalizes incoming alerts from different sources (PagerDuty, Datadog, Grafana) into a consistent format
  • Deduplicates related alerts that stem from the same underlying issue
  • Categorizes alerts by service, severity, and probable impact
  • Correlates simultaneous alerts to identify cascading failures

Instead of your on-call engineer waking up to 15 separate alerts, they see one consolidated investigation with context.

Investigation

Once alerts are triaged, the OnCall Agent conducts a structured investigation:

1
Gather Real-Time Diagnostic Signals

The agent queries your observability stack for relevant data: metrics, logs, traces, error rates, latency percentiles, and resource utilization.

2
Check Recent Changes

It reviews recent deployments, config changes, feature flag toggles, and infrastructure modifications that correlate with the alert timing.

3
Consult the Production Context Graph

The agent checks past incidents for this service, relevant skills, known failure modes, and service dependency topology.

4
Deep-Dive Analysis with Decision Traces

The agent follows diagnostic paths, documenting every step in a decision trace — what it checked, what it found, and why it pursued each direction. This makes every investigation transparent and auditable.

5
Propose Hypotheses

Based on evidence, the agent generates ranked hypotheses with supporting data. Each hypothesis includes confidence level and the evidence that supports or contradicts it.

Interactive Resolution

The AI SRE isn't a black box. Your team works with it interactively:

  • Ask follow-up questions — "Can you check the database connection pool?" or "What about the upstream API?"
  • Redirect the investigation — "I think this is related to the migration we ran yesterday"
  • Request specific actions — "Pull the last 30 minutes of error logs from the payment service"

The agent maintains full conversation context throughout the investigation.

Suggested Remediations

Based on the investigation findings and your skills, the agent proposes concrete fixes:

  • Rollbacks — identifies the exact deployment to roll back
  • Restarts — suggests which services need recycling
  • Config changes — points to specific configuration that needs updating
  • Scaling actions — recommends scaling parameters based on current load
  • Preventive fixes — for bugs, identifies the root cause in code and surfaces a proposed fix for your team

How It Integrates

The AI SRE works with your existing on-call workflow:

IntegrationRole
PagerDutyReceives alerts, triggers investigations automatically
Datadog / GrafanaQueries metrics, dashboards, and monitors for evidence
SentryPulls error details, stack traces, and exception patterns
GitHub / GitLabReviews recent deployments, commits, and PR history
SlackAgents can be invoked via @Autoheal and post findings to channels
ElasticsearchSearches logs for error patterns and anomalies

Getting Started

  1. Connect your monitoring tools (Datadog, Grafana, or similar)
  2. Set up PagerDuty for automatic alert-triggered investigations
  3. Add skills to your Production Context Graph for your most common alert types
  4. Connect Slack so your team can interact with investigations in real-time