Incident Management
SolidPing automatically creates, tracks, and resolves incidents based on check results. This page explains how the incident system works and how to configure it.
How Incidents Work
Check Fails → Threshold Reached → Incident Created → Notifications Sent
↓ ↓
Check Recovers ←──────────────────────────────────── Incident Resolved
Incident Lifecycle
- Detection - A check starts failing
- Threshold - Consecutive failures reach
incident_threshold - Creation - An incident is created and notifications are sent
- Escalation - If failures reach
escalation_threshold, escalation notifications are sent - Recovery - Check succeeds
recovery_thresholdconsecutive times - Resolution - Incident is resolved and resolution notifications are sent
Incident States
| State | Description |
|---|---|
active | Incident is ongoing, check is failing |
resolved | Check has recovered, incident closed |
Thresholds
Configure thresholds per check to control when incidents are created:
Incident Threshold
Number of consecutive failures before creating an incident.
incident_threshold: 2 # Create incident after 2 consecutive failures
Default: 1 (incident created on first failure)
Use cases:
- Set to
1for critical services that need immediate alerting - Set to
2-3for services with occasional transient failures - Set higher for non-critical checks to reduce noise
Escalation Threshold
Number of consecutive failures before escalating the incident.
escalation_threshold: 5 # Escalate after 5 consecutive failures
Default: 3
Escalation sends additional notifications to alert that an issue has persisted. Use this to page on-call engineers for prolonged outages.
Recovery Threshold
Number of consecutive successes before resolving an incident.
recovery_threshold: 2 # Resolve after 2 consecutive successes
Default: 1
Use cases:
- Set to
1for quick resolution notifications - Set to
2-3to avoid false recoveries during flapping
Notification Events
| Event | When | Description |
|---|---|---|
incident.created | Threshold reached | Initial alert |
incident.escalated | Escalation threshold | Prolonged outage |
incident.resolved | Recovery threshold | Service recovered |
Notification Flow Example
With default thresholds (1, 3, 1):
Failure 1 → incident.created sent
Failure 2 → (no notification)
Failure 3 → incident.escalated sent
Failure 4 → (no notification)
Success 1 → incident.resolved sent
Incident Details
Each incident includes:
- UID - Unique identifier
- Check - Associated check details
- Status - Current state (active/resolved)
- Started At - When the incident was created
- Resolved At - When the incident was resolved (if applicable)
- Failure Count - Number of consecutive failures
- Events - Timeline of state changes
Events
SolidPing logs events for audit and debugging:
| Event Type | Description |
|---|---|
check.created | New check added |
check.updated | Check configuration changed |
check.deleted | Check removed |
incident.created | New incident |
incident.escalated | Incident escalated |
incident.resolved | Incident resolved |
notification.queued | Notification scheduled |
notification.sent | Notification delivered |
notification.failed | Notification failed |
API Endpoints
List Incidents
GET /api/v1/orgs/{org}/incidents
Query parameters:
status- Filter by status:active,resolvedcheck_uid- Filter by checklimit- Number of resultsoffset- Pagination offset
Get Incident Details
GET /api/v1/orgs/{org}/incidents/{uid}
Get Incident Events
GET /api/v1/orgs/{org}/incidents/{uid}/events
Best Practices
Reduce Alert Fatigue
- Tune thresholds - Set
incident_threshold: 2for checks with occasional transient failures - Use recovery threshold - Set
recovery_threshold: 2to avoid alerts during flapping - Group related checks - Use tags or naming conventions to organize checks
Effective Escalation
- Set meaningful escalation thresholds - 3-5 failures typically indicates a real issue
- Configure escalation notifications - Route escalations to different channels (e.g., PagerDuty)
- Review escalation frequency - If too many escalations, investigate root causes
Incident Response
- Acknowledge incidents - Mark incidents as acknowledged to prevent duplicate alerts
- Document resolutions - Add notes about what caused the incident and how it was resolved
- Review incident history - Use incident data to identify recurring issues
Example Configuration
checks:
- name: Production API
url: https://api.example.com/health
period: 30s
timeout: 10s
incident_threshold: 2 # Alert after 2 failures
escalation_threshold: 6 # Escalate after 3 minutes of downtime
recovery_threshold: 2 # Require 2 successes to resolve
- name: Background Worker
url: tcp://worker.internal:8080
period: 60s
timeout: 30s
incident_threshold: 3 # More tolerance for worker
escalation_threshold: 10 # Escalate after 10 minutes
recovery_threshold: 1 # Quick resolution is fine
Metrics
SolidPing tracks incident metrics:
- MTTR (Mean Time To Recovery) - Average time to resolve incidents
- MTTA (Mean Time To Acknowledge) - Average time to acknowledge
- Incident Count - Number of incidents over time
- Availability - Uptime percentage based on incident duration