Latency issues impacting a subset of EU Customers

Incident Report for Interact

Postmortem

At 9:05am Interact’s Engineers were alerted to increased latency affecting the EU customers. Interact’s Engineers followed the standard operating procedures to troubleshoot and restore the service. No obvious candidate for the issues was found as all resources seemed to be equally impacted which pointed the problem at either SQL or NFS which are more central to the architecture.

Interact’s Engineers focused their efforts on SQL given the CPU utilisation metrics were significantly lower than usual. Investigation of SQL yielded no actionable results apart from the observation that all metrics were lower than usual but no indication of the bottleneck at SQL level was found – SQL waits, locks, queues and other metrics looked extremely healthy.

Naturally, efforts were moved to NFS, which showed all signs of health, from CPU usage, Network In, Network Out, indicative latency measured by the agents, etc. Interact’s Engineers proceeded with a restart of the Server by Stopping the instance and Starting it back up. The instance stopped successfully and as it was immediately started back up, AWS threw an error stating “Insufficient Capacity” (explanation and root cause below.) This start-up procedure was attempted multiple times in succession with no success.

The instance type was changed as a result to a different type, which resulted in the machine successfully provisioning and starting back up. This has restored the service back to operational health and healthy latency. Interact’s Engineers have continued to troubleshoot the problem further and monitor the metrics, health and behaviour of individual resources. The issue was completely resolved as of 11:40am. The SQL metrics promptly returned to normal following the provisioning of NFS on a different instance type.

Further analysis has shown that the bottleneck created by the NFS server instance type, was consistently slowing down the web requests which relied on on-disk data, which was causing IIS queues to grow, meaning the requests had to wait before being processed. This has reduced the strain on SQL which significantly lowered the usage metrics. Following the resolution of the NFS issue, the queue quickly shrank and operation returned to normal and all service metrics returned to their healthy ranges.

The root cause of the problem was the capacity shortage in AWS for the c5n.4xlarge instances which interfered with the workings of the provisioned NFS server in the EU hosting pod. Given the shortage, it is likely that resource starvation and noisy neighbours were interfering with the hardware running the NFS server. This is being addressed by Interact’s Engineering team, by removing the dependency on NFS in favour of native S3 storage backend capabilities which will decouple the system from shared EC2 based storage mechanisms. This project is near completion and we are aiming to release it to production in the near future.

Excerpt from AWS Documentation on the “Insufficient Capacity” Issue

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html#troubleshooting-launch-capacity

Description

You get the InsufficientInstanceCapacity error when you try to launch a new instance or restart a stopped instance.

Cause

If you get this error when you try to launch an instance or restart a stopped instance, AWS does not currently have enough available On-Demand capacity to fulfil your request.

Posted Nov 03, 2020 - 15:31 UTC

Resolved

We sincerely apologise for the disruption to your intranet this morning. Our team have identified and implemented a fix. After a number of checks we believe this incident is now resolved. An RCA for this outage will be added as soon as our investigations have been completed.

Posted Nov 03, 2020 - 11:44 UTC

Identified

The issue has been identified and a fix is being implemented. Engineers will continue to monitor and will continue to provide updates

Posted Nov 03, 2020 - 11:08 UTC

Update

Our team are continuing to investigate the issue.

Posted Nov 03, 2020 - 09:56 UTC

Investigating

Interact Engineers are currently investigating issues within the EU of higher than normal latency which is causing slowness and intermittent service disruption across all EU customers.

Posted Nov 03, 2020 - 09:30 UTC

This incident affected: EMEA Public Cloud.