Knowledge Base

Think of the Knowledge Base as the onboarding guide you'd give a new engineer joining your on-call rotation. It contains everything they'd need to investigate incidents effectively: which repositories to check, what Datadog dashboards to pull up, how to interpret specific alerts, and the tribal knowledge your team has accumulated over years.

The Autoheal AI agent uses this knowledge during investigations—and unlike a new hire, it never forgets.

How the Agent Uses Your Knowledge

When you start an investigation, the AI agent doesn't operate in isolation. It combines two powerful capabilities:

Integrations (MCP Tools)

Real-time access to your observability stack—Datadog, Grafana, GitHub, Sentry, and more. The agent can query metrics, search logs, pull traces, and check deployment history.

Knowledge Base (Context)

Your team's operational knowledge—runbooks, service ownership, debugging procedures, and learnings from past incidents. This tells the agent how to use the tools effectively.

Integrations give the agent capabilities. The Knowledge Base gives it context.

Without knowledge base content, the agent can query Datadog but doesn't know which dashboard matters for your payment service. With your knowledge, it knows to check payments-api-latency first, that spikes after 2pm PST usually correlate with batch jobs, and that the payments-oncall Slack channel is where your team coordinates.

What to Put in Your Knowledge Base

Think about what you'd tell a new engineer on their first on-call shift. That's exactly what belongs here.

Team Context (`AGENTS.md`)

Start with a document that orients the agent to your team's environment:

---
category: generic_instructions
title: Team Context
---

# Platform Team Context

## Our Services
- **payments-api**: Handles all payment processing. Primary repo: github.com/acme/payments-api
- **user-service**: Authentication and user management. Primary repo: github.com/acme/user-service
- **gateway**: API gateway, routes all external traffic

## Key Datadog Resources
- Dashboard: "Platform Team Overview" (for high-level health)
- Dashboard: "Payments Deep Dive" (when investigating payment issues)
- APM Service: `payments-api-prod` (use tag `env:production`)
- Log query: `service:payments-api status:error` for payment errors

## On-Call Essentials
- PagerDuty escalation: Platform → SRE → Engineering Manager
- Slack channels: #platform-oncall (primary), #incidents (company-wide)
- Runbook location: All runbooks are in this knowledge base

## Common Investigation Starting Points
1. Check the "Platform Team Overview" dashboard for anomalies
2. Look at recent deployments in GitHub Actions
3. Search Sentry for new error patterns
4. Check #platform-oncall for recent discussions

Alert-Specific Runbooks

For recurring alerts, document exactly how to investigate and resolve them:

---
category: alert_runbooks
alert_name: HighPaymentLatency
---

# High Payment Latency Runbook

## What This Alert Means
Payment API p99 latency exceeded 500ms for 5+ minutes. Customers may experience slow checkouts.

## Immediate Investigation
1. Open Datadog dashboard "Payments Deep Dive"
2. Check if latency is isolated to specific endpoints (usually `/v1/charge`)
3. Look for correlation with database latency in the "PostgreSQL" dashboard

## Common Causes

### Database Connection Pool Exhaustion
- **How to check**: Query `postgresql.connections.active` in Datadog
- **Resolution**: Scale up API replicas or investigate long-running queries
- **Who to contact**: @database-team in Slack

### Downstream Payment Provider Issues
- **How to check**: Look at `external_api.latency` metric with tag `provider:stripe`
- **Resolution**: If Stripe is slow, check status.stripe.com. Consider enabling fallback provider.
- **Who to contact**: #payments-oncall

### Recent Deployment Regression
- **How to check**: Correlate latency increase with deployment times in GitHub Actions
- **Resolution**: Roll back via `kubectl rollout undo deployment/payments-api`
- **Who to contact**: Last deployer (check GitHub Actions history)

## Escalation
If unresolved after 15 minutes, page the Payments team lead.

Integration-Specific Knowledge

Help the agent understand how your team uses specific tools:

---
category: integrations
integration_type: datadog
---

# How We Use Datadog

## Important Dashboards
| Dashboard | When to Use |
|-----------|-------------|
| Platform Team Overview | First stop for any incident |
| Payments Deep Dive | Payment-related alerts |
| Database Performance | Slow query investigations |
| Kubernetes Cluster | Resource/scaling issues |

## Our Tagging Convention
- `env:production`, `env:staging`, `env:development`
- `service:<service-name>` (e.g., `service:payments-api`)
- `team:platform`, `team:payments`, `team:infrastructure`

## Useful Log Queries
- All errors: `status:error env:production`
- Payment failures: `service:payments-api @error.type:PaymentFailed`
- Auth issues: `service:user-service "authentication failed"`

## APM Services
Our services are instrumented with these names:
- `payments-api-prod` (production payments)
- `user-service-prod` (production auth)
- `gateway-prod` (API gateway)

How Knowledge Evolves

Your Knowledge Base isn't static—it grows smarter with every investigation.

During Investigations

As the agent investigates, it may discover information gaps:

Agent asks a question → Your answer becomes a candidate for the Knowledge Base
Agent finds a useful query → Document it for future incidents
Agent identifies root cause → Capture the investigation path

note

When the agent asks "Which dashboard should I check for this service?" or "Who owns this component?", that's a signal to add that information to your Knowledge Base.

After Incidents

The most valuable knowledge comes from real incidents:

---
category: memories
incident_date: 2024-01-15
services_affected: payments-api
---

# Memory: Payment Timeouts During Batch Processing

## What Happened
On Jan 15, 2024, payment latency spiked to 2s+ every day at 2pm PST.

## Root Cause
The nightly batch job (which actually runs at 2pm PST due to timezone confusion)
was executing large database queries without connection limits, exhausting the
connection pool for the payments-api.

## How We Found It
1. Noticed pattern only occurred on weekdays at same time
2. Correlated with batch job schedule in Kubernetes CronJobs
3. Found connection pool metrics showed saturation during batch runs

## Resolution
- Added connection pooling limits to batch job
- Separated batch job to use read replica
- Added alert for connection pool saturation

## Prevention
- New alert: `postgresql.connections.active > 80%` triggers warning
- Runbook updated to check batch job schedule for time-correlated issues

The Feedback Loop

Investigation Starts

Agent pulls relevant Knowledge Base content and queries integrations.

Gaps Identified

Agent asks questions or makes assumptions that could be documented.

Incident Resolved

Team captures learnings—what worked, what was missing, what to check next time.

Knowledge Updated

New runbooks, updated procedures, or memories added to the Knowledge Base.

Future Investigations

Agent uses improved knowledge, investigations get faster.

Content Categories

Alert Runbooks

Step-by-step procedures tied to specific alerts. Include what the alert means, how to investigate, common causes, and resolution steps.

Integration Guides

How your team uses specific tools—important dashboards, tagging conventions, useful queries, and service mappings.

Generic Instructions

Team context, service ownership, escalation procedures, and general debugging approaches that apply across incidents.

Memories

Learnings from past incidents—root causes, investigation paths, and solutions. The institutional knowledge that prevents repeat incidents.

Best Practices

Write for 3 AM

Your on-call engineer (and the AI agent) will use this documentation when they're tired and stressed. Be explicit:

Use exact dashboard names, not "the main dashboard"
Include actual commands with copy-paste-ready syntax
Specify who to contact with Slack handles or team names
List the specific metrics to check, not "look at the metrics"

Start with Your Top 5 Alerts

Don't try to document everything at once. Start with:

Alerts that page most frequently
Incidents that take longest to resolve
Issues that require specific tribal knowledge
Alerts that new team members struggle with
Anything you've explained more than twice

Include the 'Why'

Don't just document what to do—explain why:

Why do we check this dashboard first?
Why is this metric significant?
Why do we escalate to this team?

This helps the agent make better decisions in novel situations.

Keep It Current

Outdated documentation is worse than no documentation. After each incident:

Update runbooks that didn't quite fit
Remove steps that no longer apply
Add discoveries from the investigation
Mark deprecated services or dashboards

Link to Integrations

Reference your actual tools:

Specific Datadog dashboard URLs
GitHub repository links
Grafana panel names
Sentry project identifiers

The agent can use these to navigate directly to the right places.

Getting Started

Setup Guide

Step-by-step guide to creating your first Knowledge Base documents.

Connect Integrations

Set up the tools your Knowledge Base will reference.

How the Agent Uses Your Knowledge​

What to Put in Your Knowledge Base​

Team Context (AGENTS.md)​

Alert-Specific Runbooks​

Integration-Specific Knowledge​

How Knowledge Evolves​

During Investigations​

After Incidents​

The Feedback Loop​

Content Categories​

Best Practices​

Getting Started​