In the last 14 days, approximately how did you allocate your working time? Assign a total of 100 points across the activities.
In the last 14 days, about how many hours per week did you spend on repetitive manual tasks?
In the last 30 days, which were your main sources of toil? Select up to 5.
- Noisy or flaky alerts
- Manual deployments
- Brittle CI/CD pipelines
- Environment drift or config mismatch
- Access or permissions requests
- Manual change approvals
- Capacity management chores
- Ticket handoffs or coordination
- Limited observability or telemetry gaps
- Flaky tests
- Rollback or roll-forward complexity
- Data migrations or backfills
- Tooling integrations or gaps
Rank the following by how disruptive they are to your focused engineering time (1 = most disruptive).
Which tooling do you actively use to manage reliability and reduce toil? Select all that apply.
- Alerting/Monitoring (e.g., Prometheus, Datadog)
- Incident management (e.g., PagerDuty, Opsgenie)
- Infrastructure as Code (e.g., Terraform, Pulumi)
- Configuration management (e.g., Ansible, Chef)
- CI/CD orchestration (e.g., Jenkins, GitHub Actions)
- Feature flags/progressive delivery
- SLO/Error budget tooling
- Runbooks/ChatOps automation
- Change management (e.g., ServiceNow)
- Internal developer portal (e.g., Backstage)
- Chaos/Resilience testing
Overall, how automated are your common operations tasks today?
How effective are your current tools for each area?
Approximately how many manual steps did you automate or remove from runbooks in the last 30 days?
Attention check: To confirm you are paying attention, please select “I am paying attention.”
- I am paying attention
- I did not read the instructions
- I prefer to skip this question
Roughly how many incidents with user impact occurred in the last 30 days?
Compared to 3 months ago, how has your median time to resolve incidents changed?
- Improved (decreased)
- About the same
- Worsened (increased)
- Not sure/Don’t track
During your most significant incident in the last 30 days, what added the most toil?
- Paging noise or alert confusion
- Manual runbook steps
- Access or permissions delays
- Coordination or hand-off overhead
- Rollback or roll-forward complexity
- Limited data or observability gaps
- Change approvals or governance delays
- No significant incidents in the last 30 days
What single tooling change would most reduce toil for your team?
Max 100 chars
What are the biggest blockers to automating more of your operations work next quarter?
Max 600 chars
What is your primary role?
- SRE/Production Engineer
- Platform/Infrastructure Engineer
- Software Engineer
- DevOps Engineer
- Engineering Manager
- Other
How many years have you worked in this type of role?
Approximately how large is your organization?
- 1–49 employees
- 50–249
- 250–999
- 1,000–4,999
- 5,000–19,999
- 20,000+
Approximately how large is your SRE/Platform team?
How often do you take on-call rotations?
- Never
- Ad hoc/occasionally
- Weekly
- Every 2 weeks
- Monthly
- Less often than monthly
Which region best describes your primary working time zone?
- Americas
- EMEA
- APAC
- Other/Multiple
What is your work location model?
Any other comments about toil, reliability, or tooling that we didn’t cover?
Max 600 chars
AI Interview: 2 Follow-up Questions on Your Responses
Thanks for your time—your input helps us track toil and prioritize the right reliability tooling.