US Public Cloud

Incident Report for Interact

Postmortem

At approximately 6:00am EDT / 3:00am PDT engineers began to rotate all web application servers in the US Public Cloud as part of standard update processes. At 6:22am EDT / 3:22am PDT the Interact application in this region encountered a major service outage.

At this time a service level incident occurred with our hosting provider Amazon (AWS) specifically affecting the load balancing service we utilise to serve the Interact application. New servers added into service were not registering within an acceptable time frame where older servers were rotated out of service before registration on newer servers was completed. These factors caused a loss of service as no servers were available to service requests, with no remediation options immediately available in the load balancing service available to engineers.

At 6:45am EDT / 3:45am PDT Engineers instigated a disaster recovery plan to resolve the service level incident to bypass affected service. Before this was fully implemented AWS resolved the service issue and the Interact application was once again accessible at 7:05am EDT /4:05am PDT.

We are reviewing processes going forward to improve resilience around the deployment processes specifically with the time period between initial deployment and retirement of older servers to help mitigate similar issues going forward. We will also work with AWS in efforts to improve processes to ensure the Interact service is not affected by similar events going forward.

Further details of the underlying Amazon service incident is detailed on the AWS status page: http://status.aws.amazon.com/ - Amazon Elastic Load Balancing (N. Virginia)

Posted Apr 18, 2018 - 12:19 UTC

Resolved

Issues affecting connectivity to US Public Cloud are now resolved. This status will be updated with a postmortem once all investigations have been completed.

Posted Apr 18, 2018 - 11:23 UTC

Monitoring

Service levels on US Public Cloud have been restored. We are continuing to monitor the service while traffic levels return to normal levels.

Posted Apr 18, 2018 - 11:10 UTC

Identified

Due to a service level incident with our hosting provider (AWS) engineers are unable to provision new infrastructure to all load balancers used by US Public Cloud. Engineers are currently putting work arounds in place to restore service levels; updates to follow.

Posted Apr 18, 2018 - 10:56 UTC

Investigating

Engineers are investigating issues affecting access to US Public Cloud. Updates to follow.

Posted Apr 18, 2018 - 10:49 UTC