Integration Test Email #25 - 123dev #25

Posted on September 30, 2021 • 3 minutes • 606 words

A building floor collapses as a construction crew lays concrete

Comments

Who’s to blame?

Many of us probably heard about the HBOMax email that was sent out by mistake last week. It wasn’t a major outage, but a lot of people were talking about it. I especially loved the outpouring of support I saw online for whomever triggered the integration job that sent the email. 💖

Just like the gif, no one person is to blame. The person who poured cement isn’t any more to blame than the person who architected the building. With any incident, if a single person can cause an outage it is almost certainly not their fault.

My biggest outage

There was a CVE for Google Chrome and I was asked to update it on all the computers. We had tooling to run arbitrary commands on sets of machines so obviously I ran yum update -y google-chrome-stable. In that moment I had forgotten Chrome had special repos which did not point to our internal mirrors. I also forgot that Chrome was installed on servers. Instead of updating less than 800 computers from internal repos I triggered 4000+ systems to simultaneously download Chrome directly from Google.

Our ISP connection couldn’t handle that throughput (~225 GB) so systems started to back-off and retry as the bandwidth was used and systems re-connected. With that many connections coming at once Google automatically blocked our public IP which cause more systems to back-off and retry. It also blocked us from all of our GSuite tools so no one could check email or use docs.

I had caused a 3 hour delay to 1000+ employees on an already tight schedule. I learned a very important lesson and we took action to make sure an outage like that was not as easy to cause in the future.

Links

Blameless isn’t enough to make postmortems useful. I like the wording this article uses with “actionable retrospectives” because that aligns with the goals. It also has tips that go a lot further than avoiding blame.

Blameless postmortems don’t work. Be blame-aware but don’t go negative — techbeacon.com Blameless postmortems feel wrong since humans are wired for blame. Successful teams are blame-aware and focus on actionable outcomes without going…

Reading postmortems is a great way to learn systems and how they break without needing to live it. Of course, living it helps cement the paranoia that many sysadmins have learned over years from on the job experience. Well written postmortem can teach a lot, and this one stuck out in my memory as one I learned from recently.

A Terrible, Horrible, No-Good, Very Bad Day at Slack - Slack Engineering — slack.engineering This story describes the technical details of the problems that caused the Slack downtime on May 12th, 2020. To learn more about the process behind incident response for same outage, read Ryan Katkov’s post, “All Hands on Deck”. On May 12, 2020, Slack had our first significant outage in a long time. We published a summary …

If you need some more guidance on conducting your own postmortems this post has more details on what you should and shouldn’t do and some tools that can help. I hope everyone can learn from their mistakes openly, and not have to fear retribution for their actions.

A Project Postmortem Toolkit: Apps and Approaches that Help You Learn More from Retrospectives — zapier.com In war, you need to have a reason for every action. One misstep, and you put your entire squad in danger. That’s why the U.S. Army treats every training exercise like the real thing—to the extent that they conduct performance critiques after every activity. Team leaders are tasked with planning…