- Principle 1: Low-Risk Releases Are Incremental
- Principle 2: Decouple Deployment and Release
- Principle 3: Focus on Reducing Batch Size
- Principle 4: Optimize for Resilience
- Conclusions
Principle 3: Focus on Reducing Batch Size
Another essential component of decreasing the risk of releases is to reduce batch size. In general, reducing batch size is one of the most powerful techniques available for improving the flow of features from brains to users. Donald G. Reinertsen spends a whole chapter in his excellent book The Principles of Product Development Flow: Second Generation Lean Product Development (Celeritas, 2009) discussing a whole constellation of benefits generated by reducing batch size, from reducing cycle time (without changing capacity or demand) and preventing scope creep to increasing team motivation and reducing risk.
We particularly care about that last benefitreducing risk. When we reduce batch size we can deploy more frequently, because reducing batch size drives down cycle time. Why does this reduce risk? When a release engineering team spends a weekend in a data center deploying the last three months' work, the last thing anybody wants to do is deploy again any time soon. But, as Dave Farley and I explain in our book Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation, when something hurts, the solution is to do it more often and bring the pain forward. Figure 4 shows a slide from John Allspaw's excellent presentation "Ops Meta-Metrics: The Currency You Use to Pay for Change," which should help to illustrate the following discussion on how reducing batch size helps decrease deployment risk.
Figure 4 Reducing batch size reduces risk.
Deploying to production more often helps to reduce the risk of any individual release for three reasons:
- When you deploy to production more often, you're practicing the deployment process more often. Therefore, you'll find and fix problems earlier (and hopefully in deployments to preproduction environments), and the deployment process itself will change less between deployments.
- Finding out that an incident has in fact occurred (which is why monitoring is so important).
- Finding out enough about the root causes to be able to work out how to get the system back up again.
- Getting the system back up, followed by root-cause analysis and prioritizing work to prevent the incident from happening again.
- When you're deploying more frequently, working out what went wrong is much easier because the amount of change is much smaller. It's going to take you a very long time to find what went wrong if you have several months' worth of changes to searchprobably you'll end up rolling back the release if you have a critical issue. But if you're deploying multiple times a week, the changes between releases are small, and they're likely to be a good place to start when looking for the root causes of the incident.
- Finally, rolling back a small change is much easier than rolling back several months' worth of stuff. On the technical front, the number of components affected is much smaller; on the business front, it's usually a much easier conversation to persuade the team to roll back one small feature than twenty big features the marketing team is relying on as part of a launch.
The other reasons have to do with optimizing the process of fixing incidents. It's often the case that a deployment gone wrong causes an incident. Incidents occur in three phases:
Deploying more frequently helps with the second and third steps of the incident-resolution process.
Figure 5 Lifecycle of an incident.
If your deployment pipeline is really efficient, it can actually be quicker to check in a patch (whether that's a change to the code or a configuration setting) and roll forward to the new version. This is also safer than rolling back to a previous version, because you're using the same deployment process you always use, rather than a rollback process that's not as well tested.
As my colleague Ilias Bartolini points out, this capability depends on two conditions:
- Having a small lead time between check-in and release, since often multiple commits are required to fix a problem. (You might first want to add some logging to help with root-cause analysis.)
- Your organization must be set up to support a highly optimized deployment process. Developers must be able to get changes through to production without having to wait for out-of-band approvals or tickets to be raised.