Setting Up Your Production Context Graph
This guide walks you through creating and organizing your Production Context Graph documents.
Prerequisites
- Access to your Autoheal organization
- At least one integration connected (recommended, so you have something to document)
Accessing the Production Context Graph
From the Autoheal sidebar, click any of the pages and any of the tabs within those pages.
| Page | Tab | Purpose |
|---|---|---|
| Skills | Agent's instructions | Agents.md containing general context about your team, services, and procedures for alert and incident response |
| Skills | Integration Skills | How your team uses specific tools (Datadog, GitHub, etc.) |
| Skills | Alert skills | Step-by-step procedures for specific alerts or common issues |
| Catalog | Organized by entity type | See entities like service, team, people (in Table view) that Autoheal extracts/imports/syncs from your integrations and the relationships between them (in Graph view) |
| Decision traces | Accepted root causes | |
| Decision traces | Published postmortems | |
| Decision traces | Learned memories |
Creating Your First Document
Click the tab matching the type of document you want to create. For your first document, start with Alert Skills—this gives the agent step-by-step procedures for specific alerts.
Click the + button to create a new document.
Select a template that fits your content:
- Agent's instructions - Overview of your services and environment
- Alert Skill - Procedure for a specific alert
- Integration Skill - How to use a specific tool
- Blank - Start from scratch
Use the markdown editor to write your document. The editor supports:
- Standard markdown formatting
- Code blocks with syntax highlighting
- Tables for structured information
- Live preview
Click Save. Your document is immediately available to the AI agent during investigations.
Document Categories
Alert Skills
Procedures for specific alerts. Include what the alert means, how to investigate, and how to resolve:
# High Payment Latency
## What This Means
Payment API p99 latency exceeded 500ms for 5+ minutes.
## Investigation Steps
1. Open "Payments Deep Dive" dashboard in Datadog
2. Check if latency is isolated to specific endpoints
3. Look for correlation with database latency
## Common Causes
- Database connection pool exhaustion
- Downstream payment provider issues
- Recent deployment regression
## Resolution
- If database: Scale API replicas or kill long-running queries
- If provider: Check status.stripe.com, enable fallback
- If deployment: Roll back with `kubectl rollout undo`
Integration skills
How your team uses specific tools—dashboards, queries, tagging conventions:
# How We Use Datadog
## Tagging Convention
- `env:production`, `env:staging`
- `service:<service-name>`
- `team:platform`, `team:payments`
## Useful Log Queries
- All errors: `status:error env:production`
- Payment failures: `service:payments-api @error.type:PaymentFailed`
## APM Services
- `payments-api-prod` (production payments)
- `user-service-prod` (production auth)
Memories
Learnings from past investigations—what happened, how you found it, how to prevent it:
# Memory: Payment Timeouts During Batch Processing
## What Happened
Payment latency spiked to 2s+ every day at 2pm PST.
## Root Cause
Batch job exhausting database connection pool.
## How We Found It
1. Noticed pattern only occurred weekdays at same time
2. Correlated with batch job schedule
3. Connection pool metrics showed saturation
## Prevention
- Added connection limits to batch job
- New alert: connection pool > 80%
Writing Effective Documents
Be Specific
The agent can't use vague guidance. Instead of "check the dashboard," write "open the 'Platform Team Overview' dashboard in Datadog."
Vague: Check the logs for errors.
Specific: Run this Datadog query: service:payments-api status:error env:production
Include Commands
Copy-paste-ready commands help both the agent and tired on-call engineers:
# Roll back the payments-api deployment
kubectl rollout undo deployment/payments-api -n production
# Check current replica count
kubectl get deployment payments-api -n production
Explain Why
Don't just document what to do—explain why. This helps the agent make better decisions in novel situations:
"Check the batch job schedule first because time-correlated latency issues are often caused by scheduled jobs competing for database connections."
Name Names
Include specific contacts, channels, and escalation paths:
- Slack: #payments-oncall
- Escalation: @jane-smith (payments lead)
- PagerDuty: Platform → SRE → Engineering Manager
Recommended First Documents
Start with these three documents—they'll immediately improve investigation quality:
Create an Alert Skill for the alert that pages most often:
- What the alert means
- Step-by-step investigation procedure
- Common causes and resolutions
- When to escalate
Create an Integration Skill for your primary monitoring tool:
- Your tagging conventions
- Key dashboards and when to use them
- Useful log queries
- Service names in APM
LLM Review
When you save a document, Autoheal generates a review with suggestions for improvement. You'll see feedback like:
- Questions about missing information
- Suggestions for more specific commands
- Recommendations for better organization
Review the feedback and update your document as needed. This helps ensure your production context is as useful as possible.
Importing Existing Documentation
If you have existing skills in Confluence, Notion, or a Git repository:
- Export or copy the content as markdown
- Create a new document in the appropriate category
- Paste and adjust the formatting
- Add any missing context (specific dashboard names, commands, contacts)
Don't try to migrate everything at once. Start with your most critical skills—the ones that get used during real incidents. Expand from there based on what gaps you notice during investigations.
Verifying Your Setup
Test that your Production Context Graph is working:
- Go to Investigations
- Start a new investigation
- Ask about something you've documented:
How should I investigate high payment latency? - The agent should reference your skill in its response
If the agent doesn't find your document, check that:
- The document was saved successfully
- The content includes the keywords you're asking about
- You're asking in a way that matches how you described the issue
Next Steps
Learn how the agent uses your Production Context Graph and how it evolves over time.