On-Call Excellence: Reducing Alert Fatigue
Alert fatigue kills team morale and masks real incidents. Here's how we cut alerts by 70% while improving response times.
The Problem
Our on-call engineers received 200+ alerts per week. 85% were false positives or low-priority noise. Engineers started ignoring alerts, leading to missed critical incidents. Burnout was inevitable.
The Solution: Alert Hygiene
**Classify alerts** into three tiers: Critical (pages immediately), Warning (review during business hours), Info (logs only). Only Critical alerts page on-call engineers.
**Define SLOs** and alert only on SLO budget burn. If we commit to 99.9% uptime, alert when we're tracking toward violation, not on every small blip.
**Reduce false positives** by tuning thresholds based on historical data. Use anomaly detection instead of static thresholds. Require 2-3 consecutive failures before alerting.
Actionable Alerts
Every alert must include: What's wrong, why it matters, suggested remediation steps, and runbook link. No alerts without context.
Results
Alerts dropped from 200/week to 60/week. Page volume decreased 70%. Median response time improved from 15 minutes to 5 minutes. Engineer satisfaction scores increased significantly.
Best Practices
- Alert on symptoms users experience, not on internal metrics - Make alerts actionable with clear next steps - Review alert effectiveness monthly—delete noisy alerts - Rotate on-call duty fairly with adequate compensation - Post-mortems for missed or delayed pages
On-call shouldn't be painful. With proper alert hygiene, it becomes manageable and even helps teams learn their systems better.