From approximately 12:15 a.m. on Friday, July 19 through 12:30 p.m. on Saturday, July 20, many Brown IT services were unavailable or experiencing intermittent problems. In addition, from 12:15 a.m. on Friday, July 19 through 4 p.m. on Tuesday, July 23, a large number of Windows computers used by Brown community members were failing to start correctly. In the case of both servers and PCs, the Windows operating system was continuously crashing and restarting.
This outage affected all members of the Brown community who were trying to use any of the affected services during the outage, as well as anyone using a Windows PC that was affected by the problem.
Soon after midnight on the morning of Friday, July 19, a misconfigured software update from the security provider CrowdStrike started causing technology outages around the world. The behavior of the problem was continuous crashing and restarting of affected computers.
This impacted most Brown IT services running on over 1100 servers, as well as over 1000 Windows desktops and laptops used by students, faculty and staff across the Brown community.
The Office of Information Technology (OIT) first observed services failing shortly after 1 a.m. on Friday and immediately assembled to diagnose the problem and start recovering critical services. CrowdStrike issued a fix for the problem by 1:45 a.m. on Friday to prevent any further impact, but the problem had already been created on any Windows computers with CrowdStrike installed that was actively running between 12:15 a.m. and 1:30 a.m. Any affected servers or PCs could not be fixed remotely or programmatically; instead, each one would need a manual fix applied by someone directly at the computer, with knowledge of individual storage encryption keys used on all Brown-owned computers. Knowing this would take many hours to resolve, OIT moved quickly to alert the Brown community. Because Brown's usual bulk email services were affected, OIT sent email to the community before 7 a.m. using the Brown Alert system, and published an outage message on the OIT Statuspage service dashboard and the phone greeting at the OIT Help Desk.
Over the course of the day on Friday, over 150 people from many OIT teams worked urgently to restore services and to repair individual PCs by hand across our College Hill and Jewelry District campus areas. OIT also worked closely with groups of departmental IT partners to ensure the highest rate of resolution. By the end of the day on Friday, over 500 people had working PCs again and most mission-critical services were running properly. OIT completed restoring all services by approximately noon on Saturday, July 20, and updated the Statuspage alert at that time. OIT and IT partners worked several more days the next week to resolve as many affected PCs as possible.
CrowdStrike has been very transparent, responsive, and supportive throughout this outage, and has released their extended technical analysis of the entire event on their outage-specific information hub. In addition, they have started work immediately on a full audit and improvement of their code release processes. They continue to keep us informed in detail as a customer of their services.
As of the publication date of this After-Action Report, there are still almost 200 additional PCs expected to need attention across the Brown community. If you have a PC that is constantly restarting, please refer to the IT Help Article available from OIT, or contact your usual IT support professional or the OIT Help Desk so they can help you.
In addition, we have held analytical reviews of our service architecture and incident mitigation steps, our support work to resolve PCs in the field, and our multiple communication steps during this significant outage. Each of these reviews has led to multiple plans to improve our service resiliency and our readiness for future major incidents.
OIT would like to express our joint appreciation for the compassion, patience, and many words of encouragement and appreciation we received from students, faculty and staff during this major outage. We are proud to support you, and we are truly grateful for your kindness and your trust.