Service disruption - EU
Incident Report for Interact
Postmortem

At approximately 13:50pm BST engineers began to rotate all web application servers in the EU Public Cloud as part of standard update processes. At 14:14pm BST the Interact application in this region encountered a significant service outage.

As part of an update cycle, multiple web servers are rotated into usage before existing web servers are rotated out of usage and decommissioned. This ensures we can update the application without any downtime or loss of service. During this period there are usually double the number of servers in use serving requests, and opening database connections to the SQL Server cluster in use.

At 14:20pm BST Interact engineers had identified the problem as a significantly higher than expected connections to the SQL Server connection resulting in connections being stalled and web servers becoming unresponsive. This was caused by the rotation process described above. At this point, engineers manually closed all SQL connections from the old web servers.

At 14:28 BST this process had been completed, and the old web servers were taken fully out of rotation. This resulted in service being resumed as normal. By 14:31 BST all servers were responding well within acceptable parameters and the service was fully restored.

We are reviewing processes going forward to improve resilience around the deployment processes specifically with the time period between initial deployment and retirement of older servers to help mitigate similar issues going forward.

The US Public Cloud environment was unaffected during this time.

Posted 4 months ago. Jun 07, 2018 - 10:03 UTC

Resolved
Issue effecting the EU Pods are now resolved.
Posted 4 months ago. Jun 06, 2018 - 14:07 UTC
Monitoring
Issues affecting connectivity to EU public cloud are now resolved. Engineers will continue to monitor
Posted 4 months ago. Jun 06, 2018 - 13:34 UTC
Identified
Engineers have identified and corrected a fault with database connectivity. Affected customers should see connectivity issues reduce shortly.
Posted 4 months ago. Jun 06, 2018 - 13:28 UTC
Investigating
Engineers are investigating reports of 500 errors on EU public cloud. Updates to follow.
Posted 4 months ago. Jun 06, 2018 - 13:16 UTC
This incident affected: Europe Public Cloud 1 and Europe Public Cloud 2.