Knowledge Base
Think of the Knowledge Base as the onboarding guide you'd give a new engineer joining your on-call rotation. It contains everything they'd need to investigate incidents effectively: which repositories to check, what Datadog dashboards to pull up, how to interpret specific alerts, and the tribal knowledge your team has accumulated over years.
The Autoheal AI agent uses this knowledge during investigations—and unlike a new hire, it never forgets.
How the Agent Uses Your Knowledge
When you start an investigation, the AI agent doesn't operate in isolation. It combines two powerful capabilities:
Real-time access to your observability stack—Datadog, Grafana, GitHub, Sentry, and more. The agent can query metrics, search logs, pull traces, and check deployment history.
Your team's operational knowledge—runbooks, service ownership, debugging procedures, and learnings from past incidents. This tells the agent how to use the tools effectively.
Integrations give the agent capabilities. The Knowledge Base gives it context.
Without knowledge base content, the agent can query Datadog but doesn't know which dashboard matters for your payment service. With your knowledge, it knows to check payments-api-latency first, that spikes after 2pm PST usually correlate with batch jobs, and that the payments-oncall Slack channel is where your team coordinates.
What to Put in Your Knowledge Base
Think about what you'd tell a new engineer on their first on-call shift. That's exactly what belongs here.
Team Context (AGENTS.md)
Start with a document that orients the agent to your team's environment:
---
category: generic_instructions
title: Team Context
---
# Platform Team Context
## Our Services
- **payments-api**: Handles all payment processing. Primary repo: github.com/acme/payments-api
- **user-service**: Authentication and user management. Primary repo: github.com/acme/user-service
- **gateway**: API gateway, routes all external traffic
## Key Datadog Resources
- Dashboard: "Platform Team Overview" (for high-level health)
- Dashboard: "Payments Deep Dive" (when investigating payment issues)
- APM Service: `payments-api-prod` (use tag `env:production`)
- Log query: `service:payments-api status:error` for payment errors
## On-Call Essentials
- PagerDuty escalation: Platform → SRE → Engineering Manager
- Slack channels: #platform-oncall (primary), #incidents (company-wide)
- Runbook location: All runbooks are in this knowledge base
## Common Investigation Starting Points
1. Check the "Platform Team Overview" dashboard for anomalies
2. Look at recent deployments in GitHub Actions
3. Search Sentry for new error patterns
4. Check #platform-oncall for recent discussions
Alert-Specific Runbooks
For recurring alerts, document exactly how to investigate and resolve them:
---
category: alert_runbooks
alert_name: HighPaymentLatency
---
# High Payment Latency Runbook
## What This Alert Means
Payment API p99 latency exceeded 500ms for 5+ minutes. Customers may experience slow checkouts.
## Immediate Investigation
1. Open Datadog dashboard "Payments Deep Dive"
2. Check if latency is isolated to specific endpoints (usually `/v1/charge`)
3. Look for correlation with database latency in the "PostgreSQL" dashboard
## Common Causes
### Database Connection Pool Exhaustion
- **How to check**: Query `postgresql.connections.active` in Datadog
- **Resolution**: Scale up API replicas or investigate long-running queries
- **Who to contact**: @database-team in Slack
### Downstream Payment Provider Issues
- **How to check**: Look at `external_api.latency` metric with tag `provider:stripe`
- **Resolution**: If Stripe is slow, check status.stripe.com. Consider enabling fallback provider.
- **Who to contact**: #payments-oncall
### Recent Deployment Regression
- **How to check**: Correlate latency increase with deployment times in GitHub Actions
- **Resolution**: Roll back via `kubectl rollout undo deployment/payments-api`
- **Who to contact**: Last deployer (check GitHub Actions history)
## Escalation
If unresolved after 15 minutes, page the Payments team lead.
Integration-Specific Knowledge
Help the agent understand how your team uses specific tools:
---
category: integrations
integration_type: datadog
---
# How We Use Datadog
## Important Dashboards
| Dashboard | When to Use |
|-----------|-------------|
| Platform Team Overview | First stop for any incident |
| Payments Deep Dive | Payment-related alerts |
| Database Performance | Slow query investigations |
| Kubernetes Cluster | Resource/scaling issues |
## Our Tagging Convention
- `env:production`, `env:staging`, `env:development`
- `service:<service-name>` (e.g., `service:payments-api`)
- `team:platform`, `team:payments`, `team:infrastructure`
## Useful Log Queries
- All errors: `status:error env:production`
- Payment failures: `service:payments-api @error.type:PaymentFailed`
- Auth issues: `service:user-service "authentication failed"`
## APM Services
Our services are instrumented with these names:
- `payments-api-prod` (production payments)
- `user-service-prod` (production auth)
- `gateway-prod` (API gateway)
How Knowledge Evolves
Your Knowledge Base isn't static—it grows smarter with every investigation.
During Investigations
As the agent investigates, it may discover information gaps:
- Agent asks a question → Your answer becomes a candidate for the Knowledge Base
- Agent finds a useful query → Document it for future incidents
- Agent identifies root cause → Capture the investigation path
When the agent asks "Which dashboard should I check for this service?" or "Who owns this component?", that's a signal to add that information to your Knowledge Base.
After Incidents
The most valuable knowledge comes from real incidents:
---
category: memories
incident_date: 2024-01-15
services_affected: payments-api
---
# Memory: Payment Timeouts During Batch Processing
## What Happened
On Jan 15, 2024, payment latency spiked to 2s+ every day at 2pm PST.
## Root Cause
The nightly batch job (which actually runs at 2pm PST due to timezone confusion)
was executing large database queries without connection limits, exhausting the
connection pool for the payments-api.
## How We Found It
1. Noticed pattern only occurred on weekdays at same time
2. Correlated with batch job schedule in Kubernetes CronJobs
3. Found connection pool metrics showed saturation during batch runs
## Resolution
- Added connection pooling limits to batch job
- Separated batch job to use read replica
- Added alert for connection pool saturation
## Prevention
- New alert: `postgresql.connections.active > 80%` triggers warning
- Runbook updated to check batch job schedule for time-correlated issues
The Feedback Loop
Agent pulls relevant Knowledge Base content and queries integrations.
Agent asks questions or makes assumptions that could be documented.
Team captures learnings—what worked, what was missing, what to check next time.
New runbooks, updated procedures, or memories added to the Knowledge Base.
Agent uses improved knowledge, investigations get faster.
Content Categories
Step-by-step procedures tied to specific alerts. Include what the alert means, how to investigate, common causes, and resolution steps.
How your team uses specific tools—important dashboards, tagging conventions, useful queries, and service mappings.
Team context, service ownership, escalation procedures, and general debugging approaches that apply across incidents.
Learnings from past incidents—root causes, investigation paths, and solutions. The institutional knowledge that prevents repeat incidents.
Best Practices
Write for 3 AM
Your on-call engineer (and the AI agent) will use this documentation when they're tired and stressed. Be explicit:
- Use exact dashboard names, not "the main dashboard"
- Include actual commands with copy-paste-ready syntax
- Specify who to contact with Slack handles or team names
- List the specific metrics to check, not "look at the metrics"
Start with Your Top 5 Alerts
Don't try to document everything at once. Start with:
- Alerts that page most frequently
- Incidents that take longest to resolve
- Issues that require specific tribal knowledge
- Alerts that new team members struggle with
- Anything you've explained more than twice
Include the 'Why'
Don't just document what to do—explain why:
- Why do we check this dashboard first?
- Why is this metric significant?
- Why do we escalate to this team?
This helps the agent make better decisions in novel situations.
Keep It Current
Outdated documentation is worse than no documentation. After each incident:
- Update runbooks that didn't quite fit
- Remove steps that no longer apply
- Add discoveries from the investigation
- Mark deprecated services or dashboards
Link to Integrations
Reference your actual tools:
- Specific Datadog dashboard URLs
- GitHub repository links
- Grafana panel names
- Sentry project identifiers
The agent can use these to navigate directly to the right places.