On July 19th, 2024, an update by CrowdStrike triggered what some have called “The largest IT outage in history.” The CrowdStrike IT outage will certainly always be remembered for the impact it had across global enterprises, airports, emergency services, etc.
However, updates that cause issues happen every day. And while this CrowdStrike IT outage was perhaps the largest in history, I think “to date” should be appended to that moniker. Why? Because there’s always a chance there could be an even worse outage just around the corner!
During the outage, I spoke with several customers who used the 1E Client to safeguard operations and ensure employees could work. After all, there’s no larger negative impact on DEX and productivity than a blue screen of death. Other customers I spoke to needed assistance getting machines out of blue screen.
Remember, you can hope for the best, but should always be prepared for the worst. There are some simple preparatory steps to make this much easier to do.
Here are my top 3 lessons from this outage: Things that all 1E customers can do to make sure they're ready for the next outage, no matter how big or small it is.
Let’s dive into each of these in turn.
This is a piece of IT hygiene that any company using BitLocker should already be on top of. However, as the outage showed, many weren't prepared.
There's a simple 1E Automation, which:
Safe mode is a fantastic tool. It enables only the most important, essential services and software when Windows boots. This is so that, for example, when a service installed on your machine has a bad update, you can still boot the machine up to repair the issue.
There's a 1E Automation that allows you to set which version of Safe Mode you want the 1E Client to work in. We default this to Safe Mode with Networking and suggest that’s the one you use.
It’s critical that you do not enable the 1E Client in both versions of safe mode (with networking and minimal). To guarantee this, the automation will disable one when you enable the other. If you set Windows to run the 1E Client in Safe Mode with Networking, it'll make sure it isn't enabled for Safe Mode minimal, and vice versa. Allowing startup in only one type of safe mode means that if there's an issue with the 1E Client, you can get to a safe mode that doesn't start the 1E Client.
During the CrowdStrike IT outage, you could boot to Safe Mode with Networking. If you had set 1E Client to work in this mode, you could then issue an instruction to rename the offending update files. This works to resolve the issue across all machines in seconds.
If you did not have this setting already in place, your users would have to go into the right folder to delete the files by themselves. Regardless of their comfort level in doing so, they may not have had administrative rights that would enable them to take the required action.
This was part of the reason why fully resolving the issue took so long, many days in some organizations.
Having the 1E Client set to be available in Safe Mode with Networking allows you real-time control of remote devices while they are in that state.
It’s always the unknown unknowns that get you, and it’s important to prepare for what we can predict. The 1E Client is great at making sure devices are kept in a compliant state, which will deliver the best end-user experience, starting with being available and stable.
If you can't communicate with devices directly, if the 1E Client is running—even when the devices aren't connected to the network—you equip users with instructions to "heal" themselves. The 1E Client can run automations in reaction to “Triggers”, even when the device is offline.
Our goal is to ensure that you and your users have the tools available when needed to react to unforeseen circumstances.
For example:
1. Enable users to self-elevate their user to become a Local Administrator user in Windows.
Depending on who the user is (persona-based) you can make this easily available. For technical users (e.g. those in your IT team) you can allow them to self-elevate on-demand with a desktop icon. For non-technical users, you can leave resources available but unknown, requiring a password to access. This allows the user to gain access, under guidance, to an elevated command line or add their user to the local admin group. This is much safer than, for example, having a common local admin user with a common password on all machines.This elevation can be revoked automatically, for example, after a period of time, or on log-out (whichever is sooner).
2. Have an instruction that allows boot to safe mode with networking enabled.
Often the challenge is in getting the user started. There are several steps to reboot a machine to safe-mode or safe-mode with networking. You can simplify this by having an instruction that lets you start the process from anywhere. This will bring one, some, or all devices to a safe mode or safe mode with networking state. This should be controlled so that approval is needed by someone else before it can be run. All of this is possible in the 1E platform (it’s default that such actions need approval by someone else!).
If you follow these steps, the next time an outage, large or small, comes along, you and your devices will be best prepared to handle whatever may occur.