Summary
On 17th October 2019 at 09:20am (UTC) Interact's monitoring services alerted engineers to an issue with increased latency within the EU pod meaning a subset of users were unable to access their Intranet or experienced high latency.
Investigation and Root Cause
Following the standard investigation playbook, it became clear that the issue was related to a number of web servers being unable to access the shared file store within Interact. This resulted in a number of users receiving errors when accessing their intranet, and extremely high latency for users on the affected servers. Interact engineers immediately detached the affected servers and started new instances to ensure the existing estate was able to handle the current demand. This process resulted in a spike of traffic to the shared file store, resulting in the service becoming unresponsive, at which point all application services returned errors.
Resolution and Mitigation Steps
In order to resolve this issue the shared file server was restarted and a full rotation of all application server instances was executed to reset the internal flow of traffic. Following this, all services were reporting healthy, latency and error rates returned to normal acceptable parameters and engineers moved to a monitoring status.
Interact continues to monitor this closely and have adjusted alerting thresholds to help identify and mitigate this issue should it reoccur.