What had happened?
On Friday, 19.07.2024, there were numerous failures of Windows systems worldwide. The reason for this was a Update of the company CrowdStrike. This update installed a file on the Windows computers involved that caused a CrowdStrike device driver to crash. As a result, the affected Windows computers ended up in an endless loop of reboots (Blue Screen of Death, BSoD). As a result, entire airports were paralyzed, among other things. It was therefore a serious event, one could even say a catastrophe.
Of course, there were and still are some accusations and malice. Above all, there was the question of how a device driver could get into the system that could cause such a serious problem. However, it was not the device driver that was updated, but a data file. The driver crashed because this data file contained a scenario to which the driver did not react correctly. So it was not a short-term update of a driver. And I assume that both CrowdStrike and Microsoft have established processes aimed at avoiding such situations. And yet it happened.
The question for me is what we can learn from this. It's impossible for me to make specific recommendations about the CrowdStrike situation. There will always be capable developers working there and testing their stuff properly. Nevertheless, my interim conclusion is that we as developers need to focus massively on testing. If you have too little test coverage in your projects, you need to take care of it!
The BSI...
The Spiegel writesThe President of the Federal Office for Information Security (BSI), Claudia Plattner, has now announced measures to prevent such mishaps from occurring in the future. "There are a few places and levers where we can and must do something," she told the Phoenix channel.
According to Plattner, the main aim is to pay closer attention to the quality of products from manufacturers. "We will take a much closer look," she said. Much has already been done in the recent past to increase safety. "However, today has shown us that there are still some issues in the supply chain where we all need to do more."
Interesting statement. I wonder whether Ms. Plattner understands what she is talking about. Large companies in particular have established processes aimed at detecting software errors at an early stage. Perhaps she is thinking of even more regulations and requirements. In my opinion, one thing above all else will help: we developers need to get our act together and do more to ensure that our software is correct. The CrowdStrike problem is the tip of the iceberg. There are countless software projects that continue to work without or with drastically too few tests. The problem is usually downplayed. Since 2009, when we launched the Clean Code Developer Initiative, we have never tired of pointing out that automated software testing is an absolute necessity. You can compare it to putting on a life jacket when sailing. You just do it! Even in good weather, even as an experienced sailor. Period.
What now?
Now I don't want to claim that everything is tip-top with my own software projects. The CrowdStrike case gave me a lot to think about and I took a look at my test coverage. There's still room for improvement! Even when using various tools. We are no longer where we were in 2009 when it comes to testing. Coverage analysis, Mutation TestingProperty Based Testing, Approval Testing... countless possibilities.
Programming languages have also evolved. In C#, it is imperative that developers familiarize themselves with the topic of nullable to deal with. Each project must contain the entry enable are included. The compiler and the IDE then typically bombard us with warnings in a legacy project. The task now is to tackle these one by one and consciously decide what the best solution is in each case to solve the issue. zero at the root. The aim must be to no longer use nullable reference types wherever possible.
As developers, we can do a lot. It is important to reflect on our way of working in order to make a conscious decision not to tackle the next feature until the previous ones have achieved reliable test coverage.
Conclusion
The CrowdStrike disaster must be investigated in detail. It would be helpful for the entire IT industry to receive an exact description of the processes. That way, we can learn from the processes and their shortcomings and do better in future. In the maritime sector, every accident is investigated by the Federal Bureau of Maritime Casualty Investigation and the reports are openly accessible (see https://www.bsu-bund.de/DE/Publikationen/Unfallberichte/Unfallberichte_node.html). As a sailor, I check it from time to time to be reminded of the measures needed to avoid accidents. In the field of software development, we absolutely need a mentality of development and learning. Put on your life jackets!
2 thoughts on “CrowdStrike – Was können wir daraus lernen?”
Very good contribution, but I am currently confronted with an IoT project where we have to communicate with CAN bus controllers. Interacting with Azure components e.g. blob storage, iot hub, event grid, ... For a proper test coverage the question arises where to start. Automatic tests including the controllers require corresponding environments in different versions. Or simulated controller interfaces. Not to mention the Azure infrastructure including routes etc. Not to mention.
The example is intended to show that it is not so easy to increase the test coverage without considerable effort. Even if I am a friend of 100% coverage.
Nobody said it was easy 😉 In the end, it's about achieving the value of correctness on the one hand. At the same time, however, production efficiency must also be achieved as a value. This initially speaks in favor of test automation. And in the end, there may be a small residual quantity that has to be tested "on foot". I am simply striving to constantly reduce the amount of non-automated tests. Docker + TestContainers can do a lot.