On 16th July 2019 at 7:45am (UTC) Interact engineers identified an issue with increased latency within the EU pod. Following this, between 7:45am (UTC) and 08:30am (UTC) customers hosted in the EU Pod experienced a disruption to the service resulting in users being presented with slow response times and high error rates.
Investigation and Root Cause
This increase in latency was caused by a management service failing to release transaction locks, and related data connections on a number of queuing processes used for offline processing and service management.
Interact engineers followed established processes for reduction of latency times, which involves the rotation of the application servers used to provide the service. The rotation process involves first doubling the number of servers in rotation and available to the load balancers, before removing the older servers from use. Unfortunately, this occurred during a busy part of the day, and therefore as servers were being added this resulted in a significant increase in the number of connections open to the primary SQL cluster. This combined with the management service locks caused an excessive number of data connections to be opened and not being returned to the pool of available connections resulting in the application servers being unable to connect to the data services and respond to requests in a timely manner.
Once this was identified, the connections were manually cleared and returned to the pool, and the service returned to normal latency levels.
Resolution and Mitigation Steps
Subsequent to the issue, Interact has updated its deployment run book, adding a check to ensure that data connection limits are not exceeded before rotating application servers. Additionally, an automated maintenance task has been implemented to ensure connections are returned to the pool on a regular basis