Skip to main content

Incident Management

SolidPing automatically creates, tracks, and resolves incidents based on check results. This page explains how the incident system works and how to configure it.

How Incidents Work

Check Fails → Threshold Reached → Incident Created → Notifications Sent
↓ ↓
Check Recovers ←──────────────────────────────────── Incident Resolved

Incident Lifecycle

  1. Detection - A check starts failing
  2. Threshold - Consecutive failures reach incident_threshold
  3. Creation - An incident is created and notifications are sent
  4. Escalation - If failures reach escalation_threshold, escalation notifications are sent
  5. Recovery - Check succeeds recovery_threshold consecutive times
  6. Resolution - Incident is resolved and resolution notifications are sent

Incident States

StateDescription
activeIncident is ongoing, check is failing
resolvedCheck has recovered, incident closed

Thresholds

Configure thresholds per check to control when incidents are created:

Incident Threshold

Number of consecutive failures before creating an incident.

incident_threshold: 2  # Create incident after 2 consecutive failures

Default: 1 (incident created on first failure)

Use cases:

  • Set to 1 for critical services that need immediate alerting
  • Set to 2-3 for services with occasional transient failures
  • Set higher for non-critical checks to reduce noise

Escalation Threshold

Number of consecutive failures before escalating the incident.

escalation_threshold: 5  # Escalate after 5 consecutive failures

Default: 3

Escalation sends additional notifications to alert that an issue has persisted. Use this to page on-call engineers for prolonged outages.

Recovery Threshold

Number of consecutive successes before resolving an incident.

recovery_threshold: 2  # Resolve after 2 consecutive successes

Default: 1

Use cases:

  • Set to 1 for quick resolution notifications
  • Set to 2-3 to avoid false recoveries during flapping

Notification Events

EventWhenDescription
incident.createdThreshold reachedInitial alert
incident.escalatedEscalation thresholdProlonged outage
incident.resolvedRecovery thresholdService recovered

Notification Flow Example

With default thresholds (1, 3, 1):

Failure 1 → incident.created sent
Failure 2 → (no notification)
Failure 3 → incident.escalated sent
Failure 4 → (no notification)
Success 1 → incident.resolved sent

Incident Details

Each incident includes:

  • UID - Unique identifier
  • Check - Associated check details
  • Status - Current state (active/resolved)
  • Started At - When the incident was created
  • Resolved At - When the incident was resolved (if applicable)
  • Failure Count - Number of consecutive failures
  • Events - Timeline of state changes

Events

SolidPing logs events for audit and debugging:

Event TypeDescription
check.createdNew check added
check.updatedCheck configuration changed
check.deletedCheck removed
incident.createdNew incident
incident.escalatedIncident escalated
incident.resolvedIncident resolved
notification.queuedNotification scheduled
notification.sentNotification delivered
notification.failedNotification failed

API Endpoints

List Incidents

GET /api/v1/orgs/{org}/incidents

Query parameters:

  • status - Filter by status: active, resolved
  • check_uid - Filter by check
  • limit - Number of results
  • offset - Pagination offset

Get Incident Details

GET /api/v1/orgs/{org}/incidents/{uid}

Get Incident Events

GET /api/v1/orgs/{org}/incidents/{uid}/events

Best Practices

Reduce Alert Fatigue

  1. Tune thresholds - Set incident_threshold: 2 for checks with occasional transient failures
  2. Use recovery threshold - Set recovery_threshold: 2 to avoid alerts during flapping
  3. Group related checks - Use tags or naming conventions to organize checks

Effective Escalation

  1. Set meaningful escalation thresholds - 3-5 failures typically indicates a real issue
  2. Configure escalation notifications - Route escalations to different channels (e.g., PagerDuty)
  3. Review escalation frequency - If too many escalations, investigate root causes

Incident Response

  1. Acknowledge incidents - Mark incidents as acknowledged to prevent duplicate alerts
  2. Document resolutions - Add notes about what caused the incident and how it was resolved
  3. Review incident history - Use incident data to identify recurring issues

Example Configuration

checks:
- name: Production API
url: https://api.example.com/health
period: 30s
timeout: 10s
incident_threshold: 2 # Alert after 2 failures
escalation_threshold: 6 # Escalate after 3 minutes of downtime
recovery_threshold: 2 # Require 2 successes to resolve

- name: Background Worker
url: tcp://worker.internal:8080
period: 60s
timeout: 30s
incident_threshold: 3 # More tolerance for worker
escalation_threshold: 10 # Escalate after 10 minutes
recovery_threshold: 1 # Quick resolution is fine

Metrics

SolidPing tracks incident metrics:

  • MTTR (Mean Time To Recovery) - Average time to resolve incidents
  • MTTA (Mean Time To Acknowledge) - Average time to acknowledge
  • Incident Count - Number of incidents over time
  • Availability - Uptime percentage based on incident duration