Asif | Blog, Resume, Community

Alert fatigue kills team morale and masks real incidents. Here's how we cut alerts by 70% while improving response times.

The Problem

Our on-call engineers received 200+ alerts per week. 85% were false positives or low-priority noise. Engineers started ignoring alerts, leading to missed critical incidents. Burnout was inevitable.

The Solution: Alert Hygiene

**Classify alerts** into three tiers: Critical (pages immediately), Warning (review during business hours), Info (logs only). Only Critical alerts page on-call engineers.

**Define SLOs** and alert only on SLO budget burn. If we commit to 99.9% uptime, alert when we're tracking toward violation, not on every small blip.

**Reduce false positives** by tuning thresholds based on historical data. Use anomaly detection instead of static thresholds. Require 2-3 consecutive failures before alerting.

Actionable Alerts

Every alert must include: What's wrong, why it matters, suggested remediation steps, and runbook link. No alerts without context.

Results

Alerts dropped from 200/week to 60/week. Page volume decreased 70%. Median response time improved from 15 minutes to 5 minutes. Engineer satisfaction scores increased significantly.

Best Practices

- Alert on symptoms users experience, not on internal metrics - Make alerts actionable with clear next steps - Review alert effectiveness monthly—delete noisy alerts - Rotate on-call duty fairly with adequate compensation - Post-mortems for missed or delayed pages

On-call shouldn't be painful. With proper alert hygiene, it becomes manageable and even helps teams learn their systems better.