One way to create a "stable" service is to limit the number of changes made to the system. Limiting change may improve stability, but it may also limit the ability of your business to grow. Creating new features, implementing new ideas, and fixing known bugs require change.
Updating a product description or the product price is a change. Updating the database schema supporting the product is also a change.
Products that experience rapid growth need more capacity to fulfill higher transaction demand. Teams looking to increase user engagement need to add new services. Each of these actions are change. They all carry some level of risk but represent future potential. As a general rule, whenever an employee needs to touch any piece of hardware, software, or firmware, it is a change.
All changes are not inherently bad. Change is necessary for any application or system we wish to evolve and grow. Not only do we need to allow change, but we need to enable and embrace changes.
What is change?A change is any action you take to modify a product or service (including system data).
The very first thing you should do to limit the impact of changes is to log every production change. The log needs to include:
- Exact time and date of the change
- System undergoing change
- Actual change
- Expected results of the change
- Contact information of the person making the change
Implementing a process to help manage the effect of changes is critical. Experience has taught us that the majority of production incidents are related to software or hardware changes. These processes are even more important in a continuous delivery world.
Continuous delivery (CD) is a popular approach within many product teams. Smaller changes in functionality, bug fixes, or a few lines of code checked in by developers. A version control system initiates automated build and test phases. Once all steps complete, the build gets released in an automated fashion. Software can be released to production at any time.
The theory behind CD is smaller, more frequent releases represent a lower risk to a production environment. The probability of failure increases with size, complexity, and effort.
A continuous delivery pipeline supports a change management process in many ways. Once a code commit has passed all appropriate tests, the CD framework can auto-approve and schedule the change. The system can log the required data at the exact time of the change along with the other data required for the log. The log can provide reports detailing exactly what changed between each deployment.
All companies, large and small, need a place to log changes. It is rare for everyone who has made a change be present when an incident occurs. The change identification log allows us to determine if there is any correlation (in time) to an event. Spending less time on value-destroying incidents increases the ability to scale.
The change identification log should provide the definitive answer to the "What changed most recently?" question.
Change identification is a point-in-time action, someone indicates a change and moves on. This is a component of a much larger and more complex change management process. Change management is a life-cycle process where changes are:
- Implemented and logged
- Validated as successful
- Reviewed and reported over time
The intent of change management is to limit the probability of changes causing production incidents. Great companies implement change management to increase the rate of change while minimizing the associated risks.
Change identification is a lightweight process for small companies. It can help limit the negative impact on users when changes go badly. As companies grow and their rate of change grows, they often need a more robust process. Change management attempts to take control of changes, not slow the change process down.During a crisis event, a cross-functional team assembles until services are restored. Ideally, the technical incident manager will ask "what most recently changed?". We often joke that you only need to wait for someone to say "Yeah, but that change couldn't possibly have caused this issue". This statement usually leads to one or more of the incident causes.