Incident Management

SolidPing automatically creates, tracks, and resolves incidents based on check results. This page explains how the incident system works and how to configure it.

How Incidents Work

Check Fails → Threshold Reached → Incident Created → Notifications Sent
     ↓                                                        ↓
Check Recovers ←──────────────────────────────────── Incident Resolved

Incident Lifecycle

Detection - A check starts failing
Threshold - Consecutive failures reach incident_threshold
Creation - An incident is created and notifications are sent
Escalation - If failures reach escalation_threshold, escalation notifications are sent
Recovery - Check succeeds recovery_threshold consecutive times
Resolution - Incident is resolved and resolution notifications are sent

Incident States

State	Description
`active`	Incident is ongoing, check is failing
`resolved`	Check has recovered, incident closed

Thresholds

Configure thresholds per check to control when incidents are created:

Incident Threshold

Number of consecutive failures before creating an incident.

incident_threshold: 2  # Create incident after 2 consecutive failures

Default: 1 (incident created on first failure)

Use cases:

Set to 1 for critical services that need immediate alerting
Set to 2-3 for services with occasional transient failures
Set higher for non-critical checks to reduce noise

Escalation Threshold

Number of consecutive failures before escalating the incident.

escalation_threshold: 5  # Escalate after 5 consecutive failures

Default: 3

Escalation sends additional notifications to alert that an issue has persisted. Use this to page on-call engineers for prolonged outages.

Recovery Threshold

Number of consecutive successes before resolving an incident.

recovery_threshold: 2  # Resolve after 2 consecutive successes

Default: 1

Use cases:

Set to 1 for quick resolution notifications
Set to 2-3 to avoid false recoveries during flapping

Notification Events

Event	When	Description
`incident.created`	Threshold reached	Initial alert
`incident.escalated`	Escalation threshold	Prolonged outage
`incident.resolved`	Recovery threshold	Service recovered

Notification Flow Example

With default thresholds (1, 3, 1):

Failure 1 → incident.created sent
Failure 2 → (no notification)
Failure 3 → incident.escalated sent
Failure 4 → (no notification)
Success 1 → incident.resolved sent

Incident Details

Each incident includes:

UID - Unique identifier
Check - Associated check details
Status - Current state (active/resolved)
Started At - When the incident was created
Resolved At - When the incident was resolved (if applicable)
Failure Count - Number of consecutive failures
Events - Timeline of state changes

Events

SolidPing logs events for audit and debugging:

Event Type	Description
`check.created`	New check added
`check.updated`	Check configuration changed
`check.deleted`	Check removed
`incident.created`	New incident
`incident.escalated`	Incident escalated
`incident.resolved`	Incident resolved
`notification.queued`	Notification scheduled
`notification.sent`	Notification delivered
`notification.failed`	Notification failed

API Endpoints

List Incidents

GET /api/v1/orgs/{org}/incidents

Query parameters:

status - Filter by status: active, resolved
check_uid - Filter by check
limit - Number of results
offset - Pagination offset

Get Incident Details

GET /api/v1/orgs/{org}/incidents/{uid}

Get Incident Events

GET /api/v1/orgs/{org}/incidents/{uid}/events

Best Practices

Reduce Alert Fatigue

Tune thresholds - Set incident_threshold: 2 for checks with occasional transient failures
Use recovery threshold - Set recovery_threshold: 2 to avoid alerts during flapping
Group related checks - Use tags or naming conventions to organize checks

Effective Escalation

Set meaningful escalation thresholds - 3-5 failures typically indicates a real issue
Configure escalation notifications - Route escalations to different channels (e.g., PagerDuty)
Review escalation frequency - If too many escalations, investigate root causes

Incident Response

Acknowledge incidents - Mark incidents as acknowledged to prevent duplicate alerts
Document resolutions - Add notes about what caused the incident and how it was resolved
Review incident history - Use incident data to identify recurring issues

Example Configuration

checks:
  - name: Production API
    url: https://api.example.com/health
    period: 30s
    timeout: 10s
    incident_threshold: 2      # Alert after 2 failures
    escalation_threshold: 6    # Escalate after 3 minutes of downtime
    recovery_threshold: 2      # Require 2 successes to resolve

  - name: Background Worker
    url: tcp://worker.internal:8080
    period: 60s
    timeout: 30s
    incident_threshold: 3      # More tolerance for worker
    escalation_threshold: 10   # Escalate after 10 minutes
    recovery_threshold: 1      # Quick resolution is fine

Metrics

SolidPing tracks incident metrics:

MTTR (Mean Time To Recovery) - Average time to resolve incidents
MTTA (Mean Time To Acknowledge) - Average time to acknowledge
Incident Count - Number of incidents over time
Availability - Uptime percentage based on incident duration

How Incidents Work​

Incident Lifecycle​

Incident States​

Thresholds​

Incident Threshold​

Escalation Threshold​

Recovery Threshold​

Notification Events​

Notification Flow Example​

Incident Details​

Events​

API Endpoints​

List Incidents​

Get Incident Details​

Get Incident Events​

Best Practices​

Reduce Alert Fatigue​

Effective Escalation​

Incident Response​

Example Configuration​

Metrics​