Automating Toil: The 80/20 of DevOps
Toil is manual, repetitive work that doesn't provide lasting value. Google's SRE book says engineers should spend <50% of time on toil. Here's how we eliminated our top toil sources.
What is Toil?
Toil is: manual (needs human intervention), repetitive (done over and over), automatable (can be scripted), tactical (interrupt-driven), lacks enduring value (doesn't improve the system).
Examples: Manually restarting services, running scripts for deployments, creating user accounts, investigating the same alert repeatedly.
Identifying Toil
We surveyed engineers: "What tasks do you do regularly that feel like a waste of time?" Top answers: deployment approvals, certificate renewals, log analysis, on-call alerts for known issues.
Track toil with metrics: time spent, frequency, number of engineers involved. Calculate cost: if 5 engineers spend 2 hours/week on deployments, that's 520 hours/year or $52K at $100/hour.
The 80/20 Approach
Don't automate everything at once. Find the toil that consumes the most time or causes the most frustration. Automate those first.
Our top toil sources: 1. **Manual deployments** (10 hours/week) → GitOps with ArgoCD 2. **Certificate renewals** (5 hours/month) → cert-manager on Kubernetes 3. **Log analysis** (8 hours/week) → Loki with LogQL queries and dashboards 4. **Infrastructure provisioning** (6 hours/week) → Terraform with automated PR workflows
Automation Examples
Certificate management: Replaced manual cert renewals with cert-manager + Let's Encrypt. Zero engineer time now.
Deployment approvals: Moved from Slack approval messages + kubectl commands to GitOps. Deploy = merge PR. Automated approvals for non-prod.
Alert fatigue: Automated remediation for common issues. Example: disk space alert → auto-cleanup script → only page if script fails.
Onboarding: Created Terraform modules for new services. Used to take 4 hours to set up monitoring, logging, CI/CD—now takes 10 minutes.
Implementation Tips
Start small: Pick one task, automate it, measure time saved. Build momentum with quick wins.
Make automation reliable: Add error handling, logging, and alerts. Unreliable automation creates new toil.
Document everything: Even automated systems need runbooks for when they fail.
Results
After 6 months: Toil dropped from 45% of engineering time to 20%. Engineers happier (less grunt work, more engineering). Velocity increased (faster deployments, shorter lead times).
ROI calculation: Automated deployment cost 80 hours to build, saves 520 hours/year. Payback in under 2 months.
The best toil to eliminate is the work that wakes you up at 3am. Automate that first.