This Incident Report also points to the fact that Google has lots of internal systems and procedural machinery happening behind the scenes. I think of these as best practices for any company. For example, they have automated service monitoring and alerting capabilities, we know this because they listed when the outage began, and when the team was alerted via pager. They also have change management, in that they were able to see who did what when, and ultimately try and roll back the changes. In my mind this is key, if you do not have this visibility into changes, then it will take time to figure out what triggered the issue in the first place, never mind trying to roll it back. They also did not sugarcoat the fact that the configuration push was not the safest and skipped testing.