Interact's monitoring services alerted engineers to an issue with increased latency within the EU pod meaning a subset of users were unable to access their Intranet or experienced high latency. The latency issues were affecting most of the web servers across the web cluster.
Investigation and Root Cause
Following the standard investigation playbook, Interact’s engineers have identified that majority of the web instances in rotation reported unhealthy and were therefore no longer receiving traffic from the AWS Load Balancers. This persisted for a period of 1-2 minutes after which all boxes immediately reported healthy again. While the boxes were taken out of load balancing, the requests were queuing at the load balancer and awaiting healthy instances to process them. Once the web boxes came back into load balancer rotation, they were hit with 3 times higher request volume than normal which caused increased CPU utilisation on a number of servers and very long web server request queues. This has resulted in noticeable latency problems for a number of EU customers. Interact’s engineers have been able to access the web instances along with their respective health checks which reported healthy when pinged manually – however, due to an infrastructure glitch, the health probes reported them unhealthy. This points to an issue with the infrastructure provider which is currently under investigation as the boxes were operational and accepting direct traffic even when the load balancer briefly seen them as unhealthy.
Resolution and Mitigation Steps
Interact has identified the issue with health checks which resolved itself quite quickly but led to latency spikes. Our engineers have added more instances to the cluster to spread the load and even out the demand across the cluster and restore acceptable performance. An in-depth investigation has followed and led us to believe that the infrastructure provider was at fault which is currently under investigation. We have taken extra precautions by separating out high volume request endpoints into separate physical clusters to split the traffic and better tolerate enormous spikes in request volume in cases like this. This will also prevent web request queues from filling up so quickly during a request flood as the high volume traffic will be handled by a separate set of web instances. This is being monitored closely and our investigation playbook has been updated accordingly.