Skip to main content

Setting Up Your Knowledge Base

This guide walks you through creating and organizing your Knowledge Base documents.

Prerequisites

  • Access to your Autoheal organization
  • At least one integration connected (recommended, so you have something to document)

Accessing the Knowledge Base

From the Autoheal sidebar, click Knowledge. You'll see four tabs organizing your content:

TabPurpose
Agent's InstructionsGeneral context about your team, services, and procedures
IntegrationsHow your team uses specific tools (Datadog, GitHub, etc.)
Alert RunbooksStep-by-step procedures for specific alerts
MemoriesLearnings from past investigations

Creating Your First Document

1
Select a Category Tab

Click the tab matching the type of document you want to create. For your first document, start with Agent's Instructions—this gives the agent general context about your environment.

2
Click New Document

Click the + button to create a new document.

3
Choose a Template

Select a template that fits your content:

  • Team Context - Overview of your services and environment
  • Alert Runbook - Procedure for a specific alert
  • Integration Guide - How to use a specific tool
  • Blank - Start from scratch
4
Write Your Content

Use the markdown editor to write your document. The editor supports:

  • Standard markdown formatting
  • Code blocks with syntax highlighting
  • Tables for structured information
  • Live preview
5
Save

Click Save. Your document is immediately available to the AI agent during investigations.

Document Categories

Agent's Instructions

General context that applies across investigations. Start here with a team overview document:

# Platform Team Context

## Our Services
- **payments-api**: Payment processing (github.com/acme/payments-api)
- **user-service**: Authentication (github.com/acme/user-service)
- **gateway**: API gateway, routes all external traffic

## Key Datadog Dashboards
- "Platform Team Overview" - First stop for any incident
- "Payments Deep Dive" - Payment-related investigations
- "Database Performance" - Slow query issues

## On-Call Contacts
- Primary: #platform-oncall in Slack
- Escalation: Page @platform-lead after 15 minutes

Alert Runbooks

Procedures for specific alerts. Include what the alert means, how to investigate, and how to resolve:

# High Payment Latency

## What This Means
Payment API p99 latency exceeded 500ms for 5+ minutes.

## Investigation Steps
1. Open "Payments Deep Dive" dashboard in Datadog
2. Check if latency is isolated to specific endpoints
3. Look for correlation with database latency

## Common Causes
- Database connection pool exhaustion
- Downstream payment provider issues
- Recent deployment regression

## Resolution
- If database: Scale API replicas or kill long-running queries
- If provider: Check status.stripe.com, enable fallback
- If deployment: Roll back with `kubectl rollout undo`

Integrations

How your team uses specific tools—dashboards, queries, tagging conventions:

# How We Use Datadog

## Tagging Convention
- `env:production`, `env:staging`
- `service:<service-name>`
- `team:platform`, `team:payments`

## Useful Log Queries
- All errors: `status:error env:production`
- Payment failures: `service:payments-api @error.type:PaymentFailed`

## APM Services
- `payments-api-prod` (production payments)
- `user-service-prod` (production auth)

Memories

Learnings from past investigations—what happened, how you found it, how to prevent it:

# Memory: Payment Timeouts During Batch Processing

## What Happened
Payment latency spiked to 2s+ every day at 2pm PST.

## Root Cause
Batch job exhausting database connection pool.

## How We Found It
1. Noticed pattern only occurred weekdays at same time
2. Correlated with batch job schedule
3. Connection pool metrics showed saturation

## Prevention
- Added connection limits to batch job
- New alert: connection pool > 80%

Writing Effective Documents

Be Specific

The agent can't use vague guidance. Instead of "check the dashboard," write "open the 'Platform Team Overview' dashboard in Datadog."

Vague: Check the logs for errors.

Specific: Run this Datadog query: service:payments-api status:error env:production

Include Commands

Copy-paste-ready commands help both the agent and tired on-call engineers:

# Roll back the payments-api deployment
kubectl rollout undo deployment/payments-api -n production

# Check current replica count
kubectl get deployment payments-api -n production
Explain Why

Don't just document what to do—explain why. This helps the agent make better decisions in novel situations:

"Check the batch job schedule first because time-correlated latency issues are often caused by scheduled jobs competing for database connections."

Name Names

Include specific contacts, channels, and escalation paths:

  • Slack: #payments-oncall
  • Escalation: @jane-smith (payments lead)
  • PagerDuty: Platform → SRE → Engineering Manager

Start with these three documents—they'll immediately improve investigation quality:

1
Team Context Document

Create an Agent's Instructions document covering:

  • Your key services and what they do
  • Important Datadog dashboards and their purposes
  • On-call channels and escalation paths
  • Common investigation starting points
2
Your Most Frequent Alert

Create an Alert Runbook for the alert that pages most often:

  • What the alert means
  • Step-by-step investigation procedure
  • Common causes and resolutions
  • When to escalate
3
Integration Guide

Create an Integrations document for your primary monitoring tool:

  • Your tagging conventions
  • Key dashboards and when to use them
  • Useful log queries
  • Service names in APM

LLM Review

When you save a document, Autoheal generates a review with suggestions for improvement. You'll see feedback like:

  • Questions about missing information
  • Suggestions for more specific commands
  • Recommendations for better organization

Review the feedback and update your document as needed. This helps ensure your knowledge base is as useful as possible.

Importing Existing Documentation

If you have existing runbooks in Confluence, Notion, or a Git repository:

  1. Export or copy the content as markdown
  2. Create a new document in the appropriate category
  3. Paste and adjust the formatting
  4. Add any missing context (specific dashboard names, commands, contacts)
tip

Don't try to migrate everything at once. Start with your most critical runbooks—the ones that get used during real incidents. Expand from there based on what gaps you notice during investigations.

Verifying Your Setup

Test that your knowledge base is working:

  1. Go to Investigations
  2. Start a new investigation
  3. Ask about something you've documented:
    How should I investigate high payment latency?
  4. The agent should reference your runbook in its response

If the agent doesn't find your document, check that:

  • The document was saved successfully
  • The content includes the keywords you're asking about
  • You're asking in a way that matches how you described the issue

Next Steps

Knowledge Base Overview

Learn how the agent uses your knowledge base and how it evolves over time.