Service Disruption impacting our EU customers
Incident Report for Interact
Postmortem

Summary

On 16th July 2019 at 7:45am (UTC) Interact engineers identified an issue with increased latency within the EU pod. Following this, between 7:45am (UTC) and 08:30am (UTC) customers hosted in the EU Pod experienced a disruption to the service resulting in users being presented with slow response times and high error rates.

Investigation and Root Cause

This increase in latency was caused by a management service failing to release transaction locks, and related data connections on a number of queuing processes used for offline processing and service management.

Interact engineers followed established processes for reduction of latency times, which involves the rotation of the application servers used to provide the service. The rotation process involves first doubling the number of servers in rotation and available to the load balancers, before removing the older servers from use. Unfortunately, this occurred during a busy part of the day, and therefore as servers were being added this resulted in a significant increase in the number of connections open to the primary SQL cluster. This combined with the management service locks caused an excessive number of data connections to be opened and not being returned to the pool of available connections resulting in the application servers being unable to connect to the data services and respond to requests in a timely manner.

Once this was identified, the connections were manually cleared and returned to the pool, and the service returned to normal latency levels.

Resolution and Mitigation Steps

Subsequent to the issue, Interact has updated its deployment run book, adding a check to ensure that data connection limits are not exceeded before rotating application servers. Additionally, an automated maintenance task has been implemented to ensure connections are returned to the pool on a regular basis

Posted about 1 month ago. Jul 18, 2019 - 14:37 UTC

Resolved
This incident has now been resolved and a full post mortem will be issued within 5 Working days.

Thank you for your patience.
Posted about 1 month ago. Jul 16, 2019 - 08:48 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted about 1 month ago. Jul 16, 2019 - 08:36 UTC
Investigating
Investigating - Interact Engineers are currently investigating issues within the EU of higher than normal latency which is causing slowness and service disruption across all EU customers.

Engineers are working on a resolution at highest priority and will continue to update this page with updates.

We apologise for the inconvenience.

Interact Support
Tel: 0161 9273223
Posted about 1 month ago. Jul 16, 2019 - 08:30 UTC
This incident affected: Europe Public Cloud 1 and Europe Public Cloud 2.