Tutorial: Incident Response Auto-Remediation
Build an intelligent incident response system that detects issues, diagnoses problems, attempts auto-remediation, and escalates to humans when needed.
What you'll learn:
- PagerDuty webhook integration
- Multi-stage diagnostic workflows
- Conditional remediation logic
- Human-in-the-loop approvals
- Post-incident analysis
Time: 30 minutes
Prerequisites
aofctlinstalled- PagerDuty account (free tier works)
- Slack workspace (for notifications)
- Kubernetes cluster
- OpenAI or Anthropic API key
Architecture Overview
PagerDuty Alert
↓
[Webhook Trigger]
↓
[Diagnostic Agent] ─→ Analyze logs, metrics, events
↓
[Severity Check]
├─→ Critical: Request Human Approval
└─→ Non-Critical: Auto-Remediate
↓
[Remediation Agent] ─→ Fix the issue
↓
[Verify Fix]
↓
[Notify Slack] ─→ Report outcome
↓
[Close PagerDuty]
Step 1: Create Diagnostic Agent
Create diagnostic-agent.yaml:
apiVersion: aof.dev/v1
kind: Agent
metadata:
name: incident-diagnostic
labels:
purpose: diagnostics
team: sre
spec:
model: anthropic:claude-3-5-sonnet-20241022
model_config:
temperature: 0.2 # Low temperature for deterministic analysis
max_tokens: 3000
instructions: |
You are an expert SRE performing incident diagnostics.
Your role:
- Analyze the incident alert details
- Check pod status, logs, and events
- Identify root cause
- Classify severity (critical, high, medium, low)
- Recommend remediation steps
Diagnostic process:
1. Understand the alert (service, error, metrics)
2. Check current state (kubectl get/describe)
3. Review recent logs (kubectl logs)
4. Check events (kubectl get events)
5. Analyze patterns and correlations
6. Determine root cause
7. Assess impact and severity
8. Recommend fix
Output format:
```json
{
"severity": "critical|high|medium|low",
"root_cause": "Brief description",
"affected_components": ["pod-name", "service-name"],
"impact": "User-facing impact description",
"recommended_action": "Specific remediation step",
"requires_approval": true|false,
"confidence": 0.0-1.0
}
tools:
-
type: Shell config: allowed_commands:
- kubectl
- helm timeout_seconds: 60
-
type: MCP config: name: kubectl-mcp command: ["npx", "-y", "@modelcontextprotocol/server-kubectl"] env: KUBECONFIG: "${KUBECONFIG}"
-
type: HTTP config:
For checking service health
timeout_seconds: 10
memory: type: SQLite config: path: ./incident-diagnostics.db
## Step 2: Create Remediation Agent
Create `remediation-agent.yaml`:
```yaml
apiVersion: aof.dev/v1
kind: Agent
metadata:
name: incident-remediation
labels:
purpose: remediation
team: sre
spec:
model: openai:gpt-4
model_config:
temperature: 0.1 # Very low - we want predictable fixes
max_tokens: 2000
instructions: |
You are an expert SRE performing incident remediation.
Your role:
- Execute the recommended remediation action
- Verify the fix worked
- Document what was done
- Rollback if fix fails
Available remediation actions:
- Restart pods (kubectl rollout restart)
- Scale deployments (kubectl scale)
- Update resources (kubectl patch)
- Clear stuck resources (kubectl delete pod)
- Rollback deployments (kubectl rollout undo)
Safety rules:
- Always use --dry-run first for destructive ops
- Verify current state before changes
- Take snapshots of resources before modification
- Monitor for 60 seconds after remediation
- Rollback if health checks fail
Output format:
```json
{
"action_taken": "Specific command executed",
"result": "success|failed|partial",
"verification": "Health check results",
"rollback_needed": true|false,
"logs": "Relevant output"
}
tools:
-
type: Shell config: allowed_commands:
- kubectl
- helm timeout_seconds: 120
-
type: MCP config: name: kubectl-mcp command: ["npx", "-y", "@modelcontextprotocol/server-kubectl"]
memory: type: SQLite config: path: ./incident-remediation.db
## Step 3: Create Incident Response Flow
Create `incident-response-flow.yaml`:
```yaml
apiVersion: aof.dev/v1
kind: AgentFlow
metadata:
name: incident-auto-response
spec:
# Triggered by PagerDuty webhook
trigger:
type: Webhook
config:
path: /pagerduty/webhook
methods: [POST]
auth:
type: Bearer
token: ${PAGERDUTY_WEBHOOK_TOKEN}
nodes:
# 1. Parse PagerDuty alert
- id: parse-alert
type: Transform
config:
script: |
# Extract incident details
export INCIDENT_ID="${event.incident.id}"
export INCIDENT_TITLE="${event.incident.title}"
export INCIDENT_SERVICE="${event.incident.service.name}"
export INCIDENT_URGENCY="${event.incident.urgency}"
export INCIDENT_URL="${event.incident.html_url}"
# Extract K8s context if available
export K8S_NAMESPACE="${event.incident.custom_details.namespace:-default}"
export K8S_RESOURCE="${event.incident.custom_details.resource}"
# 2. Run diagnostics
- id: diagnose
type: Agent
config:
agent: incident-diagnostic
input: |
Incident: ${INCIDENT_TITLE}
Service: ${INCIDENT_SERVICE}
Urgency: ${INCIDENT_URGENCY}
Namespace: ${K8S_NAMESPACE}
Resource: ${K8S_RESOURCE}
Diagnose this incident and recommend remediation.
timeout_seconds: 180
# 3. Notify Slack immediately
- id: notify-diagnosis
type: Slack
config:
channel: "#incidents"
message: |
🚨 **Incident Detected**
**Incident**: ${INCIDENT_TITLE}
**Service**: ${INCIDENT_SERVICE}
**Severity**: ${diagnose.output.severity}
**Root Cause**: ${diagnose.output.root_cause}
**Impact**: ${diagnose.output.impact}
**Recommended Action**: ${diagnose.output.recommended_action}
**Status**: Analyzing...
**Link**: ${INCIDENT_URL}
# 4. Check if critical severity
- id: check-severity
type: Conditional
config:
conditions:
- name: is_critical
expression: ${diagnose.output.severity} == "critical"
- name: needs_approval
expression: ${diagnose.output.requires_approval} == true
# 5a. Request human approval for critical incidents
- id: request-approval
type: Slack
config:
channel: "#incidents"
message: |
⚠️ **CRITICAL: Human Approval Required**
**Incident**: ${INCIDENT_TITLE}
**Root Cause**: ${diagnose.output.root_cause}
**Proposed Fix**: ${diagnose.output.recommended_action}
**Confidence**: ${diagnose.output.confidence}
React with ✅ to approve auto-remediation
React with ⏸️ to pause and investigate manually
React with ❌ to skip auto-remediation
cc: @oncall @sre-lead
wait_for_reaction: true
timeout_seconds: 600 # 10 minutes
conditions:
- from: check-severity
when: is_critical == true OR needs_approval == true
# 5b. Auto-proceed for non-critical
- id: auto-approve
type: Transform
config:
script: export APPROVED=true
conditions:
- from: check-severity
when: is_critical == false AND needs_approval == false
# 6. Execute remediation
- id: remediate
type: Agent
config:
agent: incident-remediation
input: |
Execute the following remediation:
Action: ${diagnose.output.recommended_action}
Namespace: ${K8S_NAMESPACE}
Resource: ${K8S_RESOURCE}
Verify the fix works and monitor for 60 seconds.
timeout_seconds: 300
conditions:
# Run if auto-approved OR human-approved
- from: auto-approve
when: APPROVED == true
- from: request-approval
when: reaction == "white_check_mark"
# 7. Verify fix worked
- id: verify-fix
type: Agent
config:
agent: incident-diagnostic
input: |
Verify the incident is resolved:
Original Issue: ${INCIDENT_TITLE}
Remediation: ${remediate.output.action_taken}
Check if the problem is fixed.
timeout_seconds: 120
# 8. Handle failed remediation
- id: remediation-failed
type: Conditional
config:
condition: ${remediate.output.result} != "success"
- id: rollback
type: Agent
config:
agent: incident-remediation
input: "Rollback the failed remediation: ${remediate.output.action_taken}"
conditions:
- from: remediation-failed
when: true
- id: escalate
type: Slack
config:
channel: "#incidents"
message: |
🔴 **Auto-Remediation FAILED - Manual Intervention Required**
**Incident**: ${INCIDENT_TITLE}
**Attempted Fix**: ${remediate.output.action_taken}
**Result**: ${remediate.output.result}
**Rollback**: ${rollback.output.result}
@oncall please investigate immediately
**Link**: ${INCIDENT_URL}
conditions:
- from: remediation-failed
when: true
# 9. Success path - notify and close
- id: notify-success
type: Slack
config:
channel: "#incidents"
message: |
✅ **Incident Auto-Resolved**
**Incident**: ${INCIDENT_TITLE}
**Root Cause**: ${diagnose.output.root_cause}
**Fix Applied**: ${remediate.output.action_taken}
**Verification**: ${verify-fix.output}
**Resolution Time**: ${flow.duration_seconds}s
**Status**: Resolved automatically
conditions:
- from: remediation-failed
when: false
# 10. Close PagerDuty incident
- id: close-pagerduty
type: HTTP
config:
method: PUT
url: https://api.pagerduty.com/incidents/${INCIDENT_ID}
headers:
Authorization: "Token token=${PAGERDUTY_API_KEY}"
Content-Type: application/json
body: |
{
"incident": {
"type": "incident_reference",
"status": "resolved",
"resolution": "Auto-resolved by AOF: ${remediate.output.action_taken}"
}
}
conditions:
- from: remediation-failed
when: false
# 11. Log incident for analysis
- id: log-incident
type: Transform
config:
script: |
# Store incident data for post-mortem
cat > /tmp/incident-${INCIDENT_ID}.json <<EOF
{
"incident_id": "${INCIDENT_ID}",
"title": "${INCIDENT_TITLE}",
"severity": "${diagnose.output.severity}",
"root_cause": "${diagnose.output.root_cause}",
"remediation": "${remediate.output.action_taken}",
"result": "${remediate.output.result}",
"duration_seconds": ${flow.duration_seconds},
"auto_resolved": true,
"timestamp": "$(date -u +%Y-%m-%dT%H:%M:%SZ)"
}
EOF
# Define execution flow
connections:
- from: parse-alert
to: diagnose
- from: diagnose
to: notify-diagnosis
- from: notify-diagnosis
to: check-severity
- from: check-severity
to: request-approval
to: auto-approve
- from: request-approval
to: remediate
- from: auto-approve
to: remediate
- from: remediate
to: verify-fix
- from: verify-fix
to: remediation-failed
- from: remediation-failed
to: rollback
to: notify-success
- from: rollback
to: escalate
- from: notify-success
to: close-pagerduty
- from: close-pagerduty
to: log-incident
Step 4: Configure PagerDuty
Create Webhook in PagerDuty
- Go to Integrations → Generic Webhooks (v3)
- Add webhook:
https://your-domain.com/pagerduty/webhook - Select events:
- Incident Triggered
- Incident Acknowledged
- Incident Resolved
- Copy webhook token
export PAGERDUTY_WEBHOOK_TOKEN=your-token
export PAGERDUTY_API_KEY=your-api-key
Add Custom Fields
Add to your PagerDuty service:
namespace- K8s namespaceresource- K8s resource (deployment/pod/service)
Step 5: Deploy the System
# Deploy agents
aofctl apply -f diagnostic-agent.yaml
aofctl apply -f remediation-agent.yaml
# Deploy flow
aofctl apply -f incident-response-flow.yaml
# Start the flow
aofctl run agentflow incident-auto-response --daemon
# Verify it's running
aofctl describe agentflow incident-auto-response
Step 6: Test the System
Test 1: Simulate Non-Critical Incident
# Trigger a test PagerDuty incident
curl -X POST https://your-domain.com/pagerduty/webhook \
-H "Authorization: Bearer ${PAGERDUTY_WEBHOOK_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"event": {
"incident": {
"id": "TEST001",
"title": "High memory usage on api-deployment",
"service": {
"name": "API Service"
},
"urgency": "high",
"html_url": "https://yourcompany.pagerduty.com/incidents/TEST001",
"custom_details": {
"namespace": "production",
"resource": "deployment/api"
}
}
}
}'
Expected flow:
- Alert parsed
- Diagnostic agent analyzes (finds high memory pods)
- Slack notification sent
- Auto-approves (not critical)
- Remediation agent restarts pods
- Verification succeeds
- PagerDuty incident closed
- Success notification
Test 2: Critical Incident (Requires Approval)
curl -X POST https://your-domain.com/pagerduty/webhook \
-H "Authorization: Bearer ${PAGERDUTY_WEBHOOK_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"event": {
"incident": {
"id": "TEST002",
"title": "Database cluster down - all replicas failing",
"service": {
"name": "PostgreSQL"
},
"urgency": "high",
"html_url": "https://yourcompany.pagerduty.com/incidents/TEST002",
"custom_details": {
"namespace": "production",
"resource": "statefulset/postgres"
}
}
}
}'
Expected flow:
- Alert parsed
- Diagnostic agent classifies as CRITICAL
- Slack notification sent
- Approval requested (waits for reaction)
- Human approves ✅
- Remediation attempts fix
- Verification check
- Result notification
Step 7: Add Post-Incident Analysis
Create post-incident-agent.yaml:
apiVersion: aof.dev/v1
kind: Agent
metadata:
name: post-incident-analyzer
spec:
model: anthropic:claude-3-5-sonnet-20241022
instructions: |
You are an SRE performing post-incident analysis.
Analyze incident logs and generate:
- Timeline of events
- Root cause analysis
- Contributing factors
- Remediation effectiveness
- Recommendations to prevent recurrence
Format as a markdown report suitable for a post-mortem doc.
tools:
- type: Shell
config:
allowed_commands: [cat, jq, grep]
Add to flow:
# Daily post-incident report
- id: daily-analysis
type: Agent
config:
agent: post-incident-analyzer
input: "Analyze all incidents from the past 24 hours: /tmp/incident-*.json"
- id: send-report
type: Slack
config:
channel: "#sre-postmortems"
message: ${daily-analysis.output}
Step 8: Monitor the System
# View all incident responses
aofctl logs agentflow incident-auto-response
# Get metrics
aofctl describe agentflow incident-auto-response | grep "success_rate"
# Check success rate
aofctl describe agentflow incident-auto-response
# View diagnostic logs
aofctl logs agent incident-diagnostic -f
# View remediation logs
aofctl logs agent incident-remediation -f
Production Best Practices
1. Add Rate Limiting
spec:
rate_limit:
max_concurrent: 3 # Max 3 incidents at once
queue_size: 10 # Queue up to 10
timeout_seconds: 1800 # 30 min max per incident
2. Add Monitoring
- id: track-metrics
type: HTTP
config:
method: POST
url: https://metrics.yourcompany.com/incidents
body: |
{
"incident_id": "${INCIDENT_ID}",
"duration": ${flow.duration_seconds},
"severity": "${diagnose.output.severity}",
"auto_resolved": ${remediate.output.result == "success"}
}
3. Add Runbook Integration
- id: fetch-runbook
type: HTTP
config:
url: https://runbooks.yourcompany.com/api/${INCIDENT_SERVICE}
method: GET
- id: diagnose-with-runbook
type: Agent
config:
agent: incident-diagnostic
input: |
Incident: ${INCIDENT_TITLE}
Runbook Steps: ${fetch-runbook.output}
Follow runbook for diagnosis.
Troubleshooting
Diagnostics timeout
# Increase timeout
timeout_seconds: 300 # 5 minutes
# Or split into multiple steps
Remediation fails
# Check agent logs
aofctl agent logs incident-remediation --tail 100
# Verify kubectl access
kubectl cluster-info
# Test remediation manually
aofctl agent exec incident-remediation "Check pod status in production"
Approval reactions not working
# Check Slack integration
aofctl flow logs incident-auto-response | grep slack
# Verify bot scopes include reactions:read
Next Steps
- AgentFlow Reference - Advanced patterns
- Example Flows - More automation examples
- Production Deployment - Scale to handle all incidents
🎉 You've built an intelligent incident response system! Your on-call just got a lot easier.