Latency issues impacting a subset of EU Customers

Incident Report for Interact

Postmortem

Summary

On 17th October 2019 at 09:20am (UTC) Interact's monitoring services alerted engineers to an issue with increased latency within the EU pod meaning a subset of users were unable to access their Intranet or experienced high latency.

‌Investigation and Root Cause

Following the standard investigation playbook, it became clear that the issue was related to a number of web servers being unable to access the shared file store within Interact. This resulted in a number of users receiving errors when accessing their intranet, and extremely high latency for users on the affected servers. Interact engineers immediately detached the affected servers and started new instances to ensure the existing estate was able to handle the current demand. This process resulted in a spike of traffic to the shared file store, resulting in the service becoming unresponsive, at which point all application services returned errors.

‌Resolution and Mitigation Steps

In order to resolve this issue the shared file server was restarted and a full rotation of all application server instances was executed to reset the internal flow of traffic. Following this, all services were reporting healthy, latency and error rates returned to normal acceptable parameters and engineers moved to a monitoring status.

Interact continues to monitor this closely and have adjusted alerting thresholds to help identify and mitigate this issue should it reoccur.

Posted Oct 29, 2019 - 10:10 UTC

Resolved

The incident has been resolved and engineers will provide a full Root Cause Analysis shortly.

Posted Oct 17, 2019 - 10:48 UTC

Update

We are continuing to monitor for any further issues.

Posted Oct 17, 2019 - 10:47 UTC

Monitoring

Engineers have implemented a fix and are monitoring the results.

Posted Oct 17, 2019 - 10:09 UTC

Update

The issue has been identified and engineers are implementing a fix.

Posted Oct 17, 2019 - 09:50 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Oct 17, 2019 - 09:39 UTC

Investigating

Interact Engineers are currently investigating issues within the EU of higher than normal latency which is causing slowness and intermittent service disruption across all EU customers.

Posted Oct 17, 2019 - 09:28 UTC

This incident affected: EMEA Public Cloud.