At approximately 6:00am EDT / 3:00am PDT engineers began to rotate all web application servers in the US Public Cloud as part of standard update processes. At 6:22am EDT / 3:22am PDT the Interact application in this region encountered a major service outage.
At this time a service level incident occurred with our hosting provider Amazon (AWS) specifically affecting the load balancing service we utilise to serve the Interact application. New servers added into service were not registering within an acceptable time frame where older servers were rotated out of service before registration on newer servers was completed. These factors caused a loss of service as no servers were available to service requests, with no remediation options immediately available in the load balancing service available to engineers.
At 6:45am EDT / 3:45am PDT Engineers instigated a disaster recovery plan to resolve the service level incident to bypass affected service. Before this was fully implemented AWS resolved the service issue and the Interact application was once again accessible at 7:05am EDT /4:05am PDT.
We are reviewing processes going forward to improve resilience around the deployment processes specifically with the time period between initial deployment and retirement of older servers to help mitigate similar issues going forward. We will also work with AWS in efforts to improve processes to ensure the Interact service is not affected by similar events going forward.
Further details of the underlying Amazon service incident is detailed on the AWS status page: http://status.aws.amazon.com/ - Amazon Elastic Load Balancing (N. Virginia)