Getting Started
This guide will help you set up Autoheal and run your first AI-powered incident investigation.
Prerequisites
Before you begin, ensure you have:
- An Autoheal account (contact support@autoheal.ai for access)
- Admin role in your organization
- Credentials for at least one observability tool (Datadog, Grafana, etc.)
Step 1: Log In to Autoheal
Navigate to your Autoheal instance URL (e.g., https://<tenant>.autoheal.ai) and click Log In.
Sign in using your Email / Password or your organization's SSO provider.
Step 2: Connect Your First Integration
Integrations allow the AI agent to query your observability tools. Let's connect your first one.
From the sidebar, click Integrations.
Select the integration you want to connect. We recommend starting with your primary monitoring tool (e.g., Datadog, Grafana).
Provide the required API credentials. See Integrations for detailed instructions for each integration.
Start with an observability tool as your first integration - they provide the richest context for incident investigation.
Step 3: Set Up Your Knowledge Base
The Knowledge Base stores runbooks, procedures, and learnings that the AI agent can reference during investigations.
From the sidebar, click Knowledge.
Click New Document and select Runbook as the template.
Document a common incident type your team handles. Include:
- Symptoms and how to identify the issue
- Step-by-step remediation procedures
- Escalation paths if needed
Save your runbook. It's now available to the AI agent.
Step 4: Start an Investigation
Now you're ready to use the AI OnCall Agent!
From the sidebar, click Investigations.
Type a description of the incident you want to investigate. For example:
High latency alerts firing on the checkout service.
Started around 2:30 PM UTC.
The AI agent will:
- Query your connected integrations for relevant data
- Search your knowledge base for related runbooks
- Analyze patterns and correlations
- Present findings and recommendations
Ask follow-up questions to dive deeper:
Can you check the database metrics during that time?
Are there any similar incidents in the past week?
Example Investigation
Here's what a typical investigation looks like:
User Query
"We're seeing 5xx errors spike on the payment service. Can you investigate?"
Agent Response
The agent will:
- Query Datadog/Grafana for error rates and latency metrics
- Pull recent logs from the payment service
- Check for recent deployments or changes
- Search the knowledge base for payment service runbooks
- Present a summary with potential root causes
Follow-up Questions
You can then ask:
- "What changed in the last deployment?"
- "Show me the database connection pool metrics"
- "Are other services affected?"
Next Steps
Add more observability tools for richer investigations.
Document your runbooks and procedures.
Add your team and configure permissions.
See all available integrations.
Getting Help
Need assistance? Contact our support team at support@autoheal.ai or check the detailed guides in this documentation.