What Caused The Resideo Total Connect 2.0 Outage Sunday?
Posted By Julia RossI was able to speak with an industry insider familiar with the events at Resideo's data center on Sunday night into Monday evening. This person related to me that there was an HVAC failure at the primary data center. It was initially thought to be an easy fix, but that turned out to be false.
Things started to go wrong in Resideo's primary data center on Sunday night at around 7:00 PM Eastern Time. An HVAC failure allowed the temperature in the data center to climb to a dangerous level for the servers located there. The normal temperature is around 70℉ (21℃) but on Sunday it rose into the neighborhood of 130℉ (54.4℃). The servers are set to failsafe, so rather than continue running, and risk catastrophic damage, they began to shut down.
An automated system is in place which notifies engineering and other stakeholders when a serious event like this occurs. An HVAC technician responded. Initially, the technician believed this would be a quick and easy fix, so the decision was made not to switch to the secondary data center, which is located in the Chicago area. The switch takes a bit of time, somewhere around 20 minutes, and the thought was that it wouldn't be worthwhile at that point to make the switch.
However, the HVAC tech discovered that in order to implement a fix, he or she was going to require a part, which they didn't have and couldn't get at that time. So, at around 1:00 AM Eastern Time, the decision was made to switch things over to the secondary data center. By about 1:30 AM Eastern Time, the backup data center was in control.
At around daylight Monday morning the HVAC system in the primary data center had been fixed. Once it was fixed, there was a period of time where the temperature was coming down to an acceptable level. By approximately 11:00 AM Eastern Time, Resideo was ready to switch back to the primary data center. At this point, alarm signaling was back up and had been for some time. By around 2:00 PM AlarmNet360 was back up, and by about 6:00 PM Total Connect 2.0 was back online, though customers and our own testing show that it was somewhat sluggish at first.
This outage affected three (3) things. The most serious was alarm signaling. During the early hours of the outage, customer's systems were unable to send signals to the monitoring station, or to send notifications to the customers themselves. Total Connect 2.0, the customer-facing app and website for end-user remote control was also down. Lastly, AlarmNet360, the alarm dealer facing service used to create or cancel accounts and remotely troubleshoot issues was also affected. When things went wrong, the initial focus was on getting alarm signaling backup as quickly as possible. This was the focus when they initially switched to the Chicago area data center.
This is a fully redundant system, and it is tested regularly. According to my source, there were hourly notifications being sent to alarm dealers, but the database of email addresses for these notifications seems to be outdated. This is something they will address going forward. A root cause analysis will be completed in the coming days, and any processes or procedures that need to be updated will be dealt with at that time. The site at status.resideo.com doesn't have a section showing either AlarmNet360 or Total Connect 2.0 status. Hopefully, this is something that will change in the very near future as well. Finally, those dealers who did receive notification noted that the emails weren't flagged as containing particularly important information. This is also something that will be addressed in the future.