The cyber incident on Friday July 19, 2024 was caused by a code error in an update pushed to Windows machines by CrowdStrike. This was not a malicious attack. CrowdStrike Falcon, the specific product impacted, is a cloud-based product with a small local footprint, designed to detect breaches.
CrowdStrike Falcon is a cloud-based protection product. CrowdStrike describes the product this way: "Falcon is the CrowdStrike platform purpose-built to stop breaches via a unified set of cloud-delivered technologies that prevent all types of attacks — including malware and much more."
A small file, termed a sensor, is installed on the computer. This sensor monitors for viruses, malware, zero-day (emerging), and other types of threats. The sensor communicates with CrowdStrike through the cloud, and if a breach is detected, CrowdStrike can then respond. By keeping the bulk of the service in the cloud, the protected computer isn't bogged down with a resource-heavy software package. The sensor file is only about 5 MB.
CrowdStrike recommends that their Falcon customers use an N-2 update cadence, or at least N-1. This means the sensor file software runs either one update (N-1) or two updates (N-2) behind the current version. Ideally, this allows any issue with an update to be found and resolved before it ever reaches a client computer.
The update that caused the Windows Blue Screen of Death (BSOD) and boot-looping issue last Friday wasn't controlled by the N-1 or N-2 policy that is set up on most systems. The update was to the signature files, which help the Falcon Sensor determine what is a threat, and they need to be updated as quickly as possible. For this reason, they aren't covered by the usual update cadence.
The U.S., Canada, the UK, Europe, and Asia experienced disruptions to various services during the outage. While Mac and Linux computers remained unaffected, over 4,000 flights worldwide were canceled. The financial and healthcare sectors were severely impacted, with many elective medical procedures postponed. Numerous payment systems were also unavailable in the early hours of the incident.
There's much more information about this available online if you want to find it. CrowdStrike has been very transparent in dealing with this issue. But what we're discussing is why this issue manifested the communication troubles some of you saw last week. For that, we'll talk about how alarm communicators are supervised.
Generally speaking, when an alarm system has a signal to report, it does so using whatever channel or channels are available to it. This could be through a POTS phone line, a WIFI or Ethernet connection, an LTE or LTEM Cellular connection, or some combination of these paths. The important thing to know is that when an alarm panel sends a signal, it looks for an acknowledgment that the signal was received successfully. If it doesn't receive that acknowledgment, it will send the signal again (and again) until eventually it either reaches the destination, the retransmission limit, or the time limit, depending on the path used.
At the other end of this communication is the Alarm.com or AlarmNet server. This system receives those signals and processes them. This may include forwarding the information to a central station, to an online platform for logging and distribution to the end-user, or both.
Since the system may never have an alarm, there are measures in place to send periodic test messages from the alarm system communicator to the server. This ensures that all aspects of any communication pathways are open and working. This usually involves setting a communication test interval. For cellular communication in particular, it's desirable to minimize unnecessary signals, so this is customarily a "smart" test.
For example, you may set a system for a daily or 24-hour test. This is a setting at both the alarm panel and the signal processing server. That means every 24 hours, the system will send a test message to the server to verify communication. If no test message is received, the server generates a central station message that the system failed to properly test.
With a "smart" test, any signal sent by the system resets the test timer. So, the only time the server will receive an actual test message is if the system hasn't sent any other type of signal for 24 hours. In either case, based on this example, if the signal processing server goes 24 hours without receiving either a regular signal or a test signal, a trouble condition is generated.
From here, we can only assume that the signal processing server or servers were impacted by the CrowdStrike Falcon update. We can safely assume this because Verizon and AT&T LTE cellular communications were not affected by this issue. The way I see it, this incident was a blessing in disguise.
Though this probably seems like a catastrophic event, it's actually an opportunity. Because this was not a malicious attack, the least possible harm has come from it. Those with robust disaster recovery plans got a real-world chance to put them to use. Those without robust disaster recovery plans now know what's at stake and can plan accordingly. Catalysts for change and improvement are rarely painless, and this is no exception.