At approximately 13:50pm BST engineers began to rotate all web application servers in the EU Public Cloud as part of standard update processes. At 14:14pm BST the Interact application in this region encountered a significant service outage.
As part of an update cycle, multiple web servers are rotated into usage before existing web servers are rotated out of usage and decommissioned. This ensures we can update the application without any downtime or loss of service. During this period there are usually double the number of servers in use serving requests, and opening database connections to the SQL Server cluster in use.
At 14:20pm BST Interact engineers had identified the problem as a significantly higher than expected connections to the SQL Server connection resulting in connections being stalled and web servers becoming unresponsive. This was caused by the rotation process described above. At this point, engineers manually closed all SQL connections from the old web servers.
At 14:28 BST this process had been completed, and the old web servers were taken fully out of rotation. This resulted in service being resumed as normal. By 14:31 BST all servers were responding well within acceptable parameters and the service was fully restored.
We are reviewing processes going forward to improve resilience around the deployment processes specifically with the time period between initial deployment and retirement of older servers to help mitigate similar issues going forward.
The US Public Cloud environment was unaffected during this time.