Escalation policies

Define who gets paged first, who gets escalated to, and how long the system waits before climbing the chain. Built so no alert ever sits unattended.

Updated

An escalation policy answers a single question: if nobody acknowledges this alert, what happens next?

Every alert that fires in AlertKick is matched to a policy. The policy decides whether to ping a person, a team, or a webhook, and how long to wait before climbing one rung higher. Configure it once, and it runs without you thinking about it.

The shape of a policy

A policy has two parts:

  1. Default notifications - sent immediately when the alert triggers, is acknowledged, or resolves. This is the “first knock” - usually a Slack message or an email to the on-call.
  2. Escalation levels - an ordered list of further actions to take if the alert isn’t acknowledged within a timeout. Each level has:
    • escalation_timeout - minutes to wait before this level fires
    • action_type - what to do (notify a user, notify a roster, hit a webhook, call a phone number, send an SMS)
    • entity - the target (user UUID, roster UUID, webhook URL, phone number…)

If repeat is enabled, the chain restarts after the last level instead of giving up. repeat_max caps how many times the loop runs.

Worked example

A typical “production database” policy:

LevelWaitAction
00 minNotify default - #alerts-prod Slack
15 minNotify on-call DBA (current roster)
215 minSMS the on-call DBA
330 minPhone call to the on-call DBA
445 minEscalate to the engineering manager

If nobody acknowledges by minute 45 and repeat is on, the chain restarts. In practice it almost never gets past level 2 - but the levels exist so it can’t slip through.

Picking timeouts

Some rough starting points:

  • Critical, customer-facing - start at 5 minutes between levels. A half-hour gap to the manager.
  • Internal infra - 10-15 minutes between levels.
  • Best-effort / observability - 30+ minutes, or skip phone/SMS levels entirely.

Tighter is not always better. If the on-call is being paged for things that genuinely need 15 minutes to investigate, escalating after 5 just wakes the next person and they’ll be looking at the same screen.

Sending to a roster vs an individual

action_type: notify_user targets one person - useful for tier-2 specialists (“escalate to the database lead, regardless of who’s on-call”). action_type: notify_roster targets the current on-call from a roster, so the policy keeps working as people rotate in and out. Most levels should use rosters; see [[roster-management]] for how rotations work.

Default notifications: trigger, acknowledge, resolve

Default notifications send on three events:

  • Trigger - alert fires
  • Acknowledge - someone hits the Ack button
  • Resolve - the underlying check goes back to green

A common setup is to send all three to a shared #alerts channel so the team sees the full lifecycle without the on-call having to status-update manually.

Linking a policy to an alert

Policies don’t bind to alerts directly - they bind to alert services (groupings like “production-db”, “staging-web”, “compliance-checks”). Each check on a host is assigned to one alert service, and the alert service points at one policy.

This indirection means you can change the on-call schedule for “production” in one place rather than editing every alert.

Common mistakes

  • Too many levels too fast - if level 1 fires after 1 minute, you’ll wake the second person before the first has even seen the page.
  • Pointing every level at the same person - the chain exists so that if the first person can’t respond, the next one does. Personal-only chains defeat the purpose.
  • No repeat on critical policies - if the on-call misses the chain because they were genuinely asleep, an alert that’s been firing for 30 minutes shouldn’t go quiet. Turn on repeat for anything customer-facing.