Latency issues impacting a subset of EU Customers

Incident Report for Interact

Postmortem

Summary

On 24th September 2019 at 8:15am (UTC) Interact engineers were alerted to an issue with increased CPU usage within the EU pod in the services used for storing temporary session state. This resulted in increased latency and high error rates from 8:30am (UTC) which continued until approximately 9:20am (UTC). During this time, customers hosted in the EU Pod experienced a disruption to the service resulting in users being presented with slow response times and high error rates.

Investigation and Root Cause

This increase in latency was caused by an issue with the session store, resulting in the application being unable to generate new sessions for new users attempting to access and log in to the service.

Interact engineers followed established processes for reduction of latency times relating to the session store. This involves the rotation of the application servers used to provide the login process to the service. The rotation process involves first doubling the number of servers in rotation and available to the load balancers, before removing the older servers from use. Ordinarily this would release the high demand on the session store, and the CPU usage would fall into acceptable parameters without any loss or degradation of service. In this instance the CPU usage continued to rise, and so engineers moved onto the next stage in the playbook. This involves deleting the Shard that is running into difficulty and automatically balancing the load between the remaining shards.

Once this was completed, the relevant application servers were rotated again and the shared session store was able to process all requests as normal. Following this error rates dropped and latency returned to normal standards.

‌Resolution and Mitigation Steps

Subsequent to the issue, Interact has updated its deployment run book, adding a check to identify earlier in the process whether the shard needs to be deleted to minimise any impact.

Posted Sep 24, 2019 - 16:20 UTC

Resolved

This incident has now been resolved and a Root Cause Analysis report will be issued ASAP.

Posted Sep 24, 2019 - 09:25 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 24, 2019 - 09:14 UTC

Identified

The issue has been identified and a fix is being implemented. Some users may still experience issues while the fix is being implemented.

Posted Sep 24, 2019 - 08:55 UTC

Monitoring

A fix has been implemented and engineers are closely monitoring the service.

Posted Sep 24, 2019 - 08:36 UTC

Identified

Interact Engineers are currently investigating issues within the EU of higher than normal latency which is causing slowness and service disruption across EU sites.

Posted Sep 24, 2019 - 08:32 UTC

This incident affected: EMEA Public Cloud.