← Back to Operations
Automation

Automating Toil: The 80/20 of DevOps

Toil is manual, repetitive work that doesn't provide lasting value. Google's SRE book says engineers should spend <50% of time on toil. Here's how we eliminated our top toil sources.

What is Toil?

Toil is: manual (needs human intervention), repetitive (done over and over), automatable (can be scripted), tactical (interrupt-driven), lacks enduring value (doesn't improve the system).

Examples: Manually restarting services, running scripts for deployments, creating user accounts, investigating the same alert repeatedly.

Identifying Toil

We surveyed engineers: "What tasks do you do regularly that feel like a waste of time?" Top answers: deployment approvals, certificate renewals, log analysis, on-call alerts for known issues.

Track toil with metrics: time spent, frequency, number of engineers involved. Calculate cost: if 5 engineers spend 2 hours/week on deployments, that's 520 hours/year or $52K at $100/hour.

The 80/20 Approach

Don't automate everything at once. Find the toil that consumes the most time or causes the most frustration. Automate those first.

Our top toil sources: 1. **Manual deployments** (10 hours/week) → GitOps with ArgoCD 2. **Certificate renewals** (5 hours/month) → cert-manager on Kubernetes 3. **Log analysis** (8 hours/week) → Loki with LogQL queries and dashboards 4. **Infrastructure provisioning** (6 hours/week) → Terraform with automated PR workflows

Automation Examples

Certificate management: Replaced manual cert renewals with cert-manager + Let's Encrypt. Zero engineer time now.

Deployment approvals: Moved from Slack approval messages + kubectl commands to GitOps. Deploy = merge PR. Automated approvals for non-prod.

Alert fatigue: Automated remediation for common issues. Example: disk space alert → auto-cleanup script → only page if script fails.

Onboarding: Created Terraform modules for new services. Used to take 4 hours to set up monitoring, logging, CI/CD—now takes 10 minutes.

Implementation Tips

Start small: Pick one task, automate it, measure time saved. Build momentum with quick wins.

Make automation reliable: Add error handling, logging, and alerts. Unreliable automation creates new toil.

Document everything: Even automated systems need runbooks for when they fail.

Results

After 6 months: Toil dropped from 45% of engineering time to 20%. Engineers happier (less grunt work, more engineering). Velocity increased (faster deployments, shorter lead times).

ROI calculation: Automated deployment cost 80 hours to build, saves 520 hours/year. Payback in under 2 months.

The best toil to eliminate is the work that wakes you up at 3am. Automate that first.