The site became unavailable for a number of customers as a result of an unexpected shutdown of a SQL Service on a single SQL Server which required manual intervention to bring back online. No data was lost in the process. The service became unreachable and stopped processing new incoming requests, initially only affecting customers who were hosted on the affected SQL Server. The connection failure, combined with high volume of incoming traffic caused the web servers to get stuck when trying to establish a connection with the shutdown SQL Service, until the attempts were interrupted by the connection timeouts (3 mins) – this means that all the incoming requests were accumulating on the server and using up the resources until the connection timeout stopped them. This caused a big spike in latency across the web servers and affected the usability for other customers on the affected pod.
Resolution and Mitigation:
The SQL Service instance was spotted quickly and brought back online. Unfortunately, at this stage, the latency was already increasing due to the failed connections and accumulating requests on the web servers. To remedy this, our Cloud team replaced the affected servers with new instances by doubling the size of the server group and shutting down the affected instances once the new servers were running and responding to requests – to avoid total loss of service. The process took around 15-20 mins to completely remove the affected servers from rotation. This restored the system to normal latency and response times. We apologise for the inconvenience caused.