AgileMechanics.com | Operations Review

Origins

Operations Review formalizes a cadence that high-reliability operations teams have long run informally. David J. Anderson's Kanban Method¹ named and structured the practice for software teams; Google's SRE tradition² developed its own variation, called the Production Meeting, with similar goals; and DevOps culture at large adopted the practice as services-owned-by-teams became standard.

The animating insight is straightforward. A team that ships software also runs software. The running side carries an operational load — incidents, on-call rotations, monitoring alerts, capacity planning, deployment health — that deserves the same deliberate attention as feature delivery. Operations Review is the cadence where that side gets it.

What an Operations Review Examines

Unlike the customer-facing Service Delivery Review, the Operations Review is internal. It asks how the team's running systems are actually doing and what work that creates for the team:

Incident frequency and severity: how many incidents per period, of what severity, with what root causes.
On-call burden: pages per shift, escalations, time-of-night patterns, who's carrying the load.
Deployment frequency and change failure rate: the DORA metrics for the team's own pipeline.
System health indicators: error rates, latency, saturation against SLOs.
Toil: repetitive operational work that should be automated but isn't.
Tech debt affecting operations: workarounds, brittle scripts, manual processes that bite during incidents.

The output is operational work that needs to make it onto the team's backlog — reliability investments, automation, monitoring gaps, runbook updates.

Who Attends

Operations Review is typically a team-internal meeting:

The team itself — engineers, ops, anyone who participates in on-call.
A facilitator (often Tech Lead, SRE, or Engineering Manager).
Sometimes a manager or platform engineer if cross-team patterns are emerging.

It is not a customer-facing review; that's the SDR's job. It is not a retrospective; retros focus on process and people. Operations Review focuses on production systems and the operational work they generate.

The Cadence

Weekly is common for teams with significant on-call duty. Biweekly works for steadier services. Monthly is the minimum for teams with any real operational responsibility — less frequent than that and patterns hide for too long.

The meeting should be relatively short — 30 to 45 minutes — and have a predictable rhythm. A long meeting once a quarter is the wrong substitute for a short meeting every week.

A Typical Agenda

1. Incident review (10–15 min)

Walk through incidents since the last review. Not deep post-mortems — those happen separately. The review-level question is "what patterns are emerging?" Same component failing repeatedly. Same time of day. Same root cause class.

2. On-call burden (5 min)

How heavy is on-call right now? Pages per shift, sleep impact, who feels burdened. Watch for trends. A team whose on-call has gotten quietly heavier owes itself an explicit conversation about why.

3. System health (5–10 min)

Key SLOs — latency, error rate, availability — against targets. Trending? Stable? Where is the error budget being spent?

4. Toil and follow-ups (10 min)

What manual, repetitive work has the team done since the last review? What automation didn't get prioritized? What runbooks need updating? Action items from last review — closed, in flight, dropped?

5. Decisions for the backlog (5 min)

What operational work needs to make it into the team's regular planning. Specific, owned, sized.

Operations Review vs. Post-Mortem

Operations Review is a pattern-finding cadence. Post-mortems (or incident retrospectives) are deep dives on specific incidents. Both are necessary; neither substitutes for the other.

Post-mortems answer "why did this incident happen and what would prevent it." Operations Review answers "are we doing operations well as a system." A team that runs detailed post-mortems but no operations review will fix individual incidents while failing to notice that on-call burden has tripled. A team that runs operations reviews but no post-mortems will track trends without understanding what's producing them.

Common Pitfalls

Cancelled when nothing's wrong: a quiet operations period gets the meeting cancelled. The cadence is precisely what catches drift before it becomes a crisis. Don't skip when things look fine.
Anecdotes instead of data: "I think we had a few incidents this week" is not an operations review. Actual numbers, trends, and SLO measurements are the foundation.
No backlog follow-through: operational findings that don't reach the team's sprint plan become wishes. Track what was decided and what shipped.
Conflated with retro: trying to make Operations Review also serve as retrospective produces a session that does neither job well. Run them separately.
Wrong audience: a review attended only by management without the engineers who run the systems is a status update. The team needs to be present.
Toil normalization: heroic manual work gets celebrated rather than questioned. Treat sustained manual operational load as a signal to automate, not a culture to reward.

Coaching Tips

Hold the Cadence

Don't skip operations review when nothing's wrong. The cadence is what catches drift; skipping it removes the safety system precisely when it's needed.

Bring Real Data

Incident counts, on-call pages, SLO measurements. Without data, the review devolves into impression and the worst pattern is the easiest to miss.

Watch for Heroism

An engineer doing brilliant manual work to keep the system running is a problem, not a model. Treat sustained heroism as a signal to automate.

Connect to Planning

Operational findings need to reach the backlog and get prioritized. Without that bridge, the review surfaces problems the team never solves.

Look at On-Call

On-call burden quietly accumulates. Make it a permanent agenda item, and watch the trend across reviews, not just the latest number.

Track Decisions to Closure

Open every review with "what we decided last time." Without that loop, decisions accumulate without action and the review loses credibility.

Summary

Most teams that ship and run their own software have an operational layer that consumes time, energy, and attention — and that layer is usually invisible until something breaks. Operations Review is the cadence that makes the layer visible regularly enough that the team can manage it rather than be managed by it.

The investment is small — 30 minutes a week, the team and some data. The return is large: operational problems caught while still small, on-call burden noticed before it burns the team out, automation prioritized before it would have prevented an incident. Teams that don't run operations reviews learn what they're missing the hard way.

Footnotes

Anderson, D. J. (2010). Kanban: Successful Evolutionary Change for Your Technology Business. Blue Hole Press.
Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O'Reilly.

Back to DevOps