Service Disruption - US public cloud
Incident Report for Interact
Postmortem

At approximately 6:55pm EDT / 3:55pm PDT access to the Interact application encountered downtime for a number of customers in the US region. This was due to a hardware failure on the shared resource that handles file storage. During this time automatic failover steps did not respond within an acceptable timeframe and a number of webservers could no longer serve requests.

As part of standard procedures engineers were notified. After investigation engineers determined that the underlying incident occurred with our hosting provider Amazon (AWS) specifically affecting the underlying hardware we use to serve files in the Interact application. To recover the service as soon as possible engineers bypassed existing automated systems and replaced hardware manually. The Interact service became fully available at 7:10pm EDT / 4:10pm PDT.

We are reviewing processes around automatic failovers to improve resilience around the recovery of essential hardware components going forward, this includes the tolerance and time to recovery thresholds. We will also work with AWS in efforts to improve processes to ensure the Interact service is not affected by similar events in the future.

Posted about 1 month ago. Jul 20, 2018 - 09:31 UTC

Resolved
Service disruption encountered for customers on the the US public cloud at 6:55pm EDT / 3:55pm PDT (10.55pm UTC)
Posted about 1 month ago. Jul 19, 2018 - 22:55 UTC