SRE Principles: Service Level Objectives
Service Level Objectives (SLOs) define reliability targets and balance velocity with stability. Here's how we implemented SLOs across our platform.
The SRE Framework
SLI (Service Level Indicator): Quantitative measure of service level. Examples: request success rate, latency, availability.
SLO (Service Level Objective): Target value for an SLI. Example: 99.9% of requests succeed, 95% of requests complete in <500ms.
SLA (Service Level Agreement): Contractual commitment with consequences. Usually more lenient than SLOs.
Error Budget: Allowable downtime or errors. If SLO is 99.9%, error budget is 0.1% (43 minutes/month downtime).
Choosing SLOs
Start with user-facing services. Pick SLIs users care about—not internal metrics. Common SLIs: availability (uptime), latency (response time), throughput (requests/second), correctness (error rate).
Set realistic targets based on current performance. Don't start at 99.99%—that's expensive. Most services do fine at 99.5%-99.9%.
Example: API Service SLOs
- **Availability**: 99.9% of requests return HTTP 2xx/3xx (error budget: 43 min/month) - **Latency**: 95% of requests complete in <500ms, 99% in <2s - **Measured over**: 30-day rolling window
Error Budget Policy
When error budget is healthy (>50% remaining): Ship fast, take risks, deploy frequently.
When error budget is low (<25% remaining): Freeze non-critical features, focus on reliability, increase testing.
When error budget is exhausted: Stop deployments except critical fixes, conduct incident review, fix underlying issues.
Implementation
We track SLOs in Grafana with burn rate alerts. Alert when we're spending error budget too fast (e.g., 5% budget consumed in 1 hour means we'll violate SLO).
Quarterly SLO reviews with stakeholders. Adjust targets based on business needs and cost tradeoffs.
Results
SLOs gave us a shared language between engineering and product. Teams now make data-driven decisions about reliability vs velocity. Outages decreased as teams focused on services near SLO violation.
The key insight: Perfect reliability (99.99%+) is expensive and often unnecessary. SLOs define "good enough" and create space for innovation.