Summary
On 24th September 2019 at 8:15am (UTC) Interact engineers were alerted to an issue with increased CPU usage within the EU pod in the services used for storing temporary session state. This resulted in increased latency and high error rates from 8:30am (UTC) which continued until approximately 9:20am (UTC). During this time, customers hosted in the EU Pod experienced a disruption to the service resulting in users being presented with slow response times and high error rates.
Investigation and Root Cause
This increase in latency was caused by an issue with the session store, resulting in the application being unable to generate new sessions for new users attempting to access and log in to the service.
Interact engineers followed established processes for reduction of latency times relating to the session store. This involves the rotation of the application servers used to provide the login process to the service. The rotation process involves first doubling the number of servers in rotation and available to the load balancers, before removing the older servers from use. Ordinarily this would release the high demand on the session store, and the CPU usage would fall into acceptable parameters without any loss or degradation of service. In this instance the CPU usage continued to rise, and so engineers moved onto the next stage in the playbook. This involves deleting the Shard that is running into difficulty and automatically balancing the load between the remaining shards.
Once this was completed, the relevant application servers were rotated again and the shared session store was able to process all requests as normal. Following this error rates dropped and latency returned to normal standards.
Resolution and Mitigation Steps
Subsequent to the issue, Interact has updated its deployment run book, adding a check to identify earlier in the process whether the shard needs to be deleted to minimise any impact.