On 29th August 2019 at 10:00am (UTC) Interact engineers identified an issue with increased latency within the EU pod meaning a subset of users were unable to access their Intranet.
Investigation and Root Cause
Interact’s alerting and monitoring alerted our infrastructure team to a spike in performance. Our web server infrastructure was immediately investigated and logs identified that a sudden increase in load caused service disruption for a subset of users.
Resolution and Mitigation Steps
The issue was identified to be an increased load on the NFS drive which acts as a document store for Interact as a result of unforeseen load spike. This has caused an increase in latency in the EU region. All requests using the data on the NFS drive were affected by the sub-optimal performance. The service was restored fully by restarting the NFS service and rotating the web server instances to drain the requests and reset the internal flow of traffic. All services were reporting healthy and no errors were thrown during the incident. We have not experience this issue since restarting the NFS service and rotating the web boxes, even with increased loads, therefore it suggests this issue was an intermittent issue at a hardware level which was resolved by rotating the instances and restarting the NFS service.
Interact continues to monitor this closely and our engineers rotate the instances on a regular basis to mitigate the risk of hardware level issues or fluctuations on long running instances.