EU - Increased Error Rates

Incident Report for Interact

Postmortem

As part of automated build deployment a faulty set of updated servers that host the Interact application were put into rotation at approximately 2pm UTC.

The service was first affected from 2.10pm with the peak issues occurring between 2.30pm and 3pm as hardware was being rotated. The EU/US/HIPAA pods were all affected to varying degrees; the EU pod in particular encountered a high level of errors and service degradation in the period while we were ascertaining the root cause and triggered resolution steps. US & HIPAA pods did not suffer the same level of service impact but they did encounter some similar symptoms such as the timeline rendering empty but with a significantly reduced number of errors.

As soon as the issue was determined engineers reverted to the last healthy configuration. Our investigation showed that the a transform which controls some additional keys and settings (such as shared session state & other Amazon services utilised by the timeline feature) were not correctly applied to the new build. This unfortunately did not flag as an error during the build process and as the build passed preliminary testing it was subsequently fully rotated into service.

We believe the issue was caused by a build process dependency (Microsoft/nuget) that has recently been updated: https://www.nuget.org/packages/Microsoft.Web.Xdt/. This external tool/dependency is downloaded with each build to update Interact settings specific to each pod. In this case a recent update to this tool appears to have been packaged with a beta dll which in turn caused any transform utilising it to fail.

To address this we have changed the build process to no longer rely on the external source; we will always utilise a fully tested version stored locally to the process eliminating the external dependency. We are also updating relevant build scripts to better flag & fail a build when similar issues occur in the future.

Posted Feb 20, 2018 - 11:54 UTC

Resolved

Issues around increased error rates on EU public cloud are now resolved.

Posted Feb 19, 2018 - 15:58 UTC

Monitoring

Engineers have successfully rotated affected servers and error rates have subsided. We will continue to monitor this incident to ensure resolution.

Posted Feb 19, 2018 - 15:17 UTC

Update

Engineers have isolated an issue around the timeline feature which is causing an increased error rates. Relevant servers are currently being replaced. Updates to follow.

Posted Feb 19, 2018 - 14:56 UTC

Identified

Engineers have isolated an issue around the timeline feature which is causing an increased error rate. Relevan

Posted Feb 19, 2018 - 14:55 UTC

Investigating

Engineers are investigating reports of increased error rates in the EU pod. Updates to follow.

Posted Feb 19, 2018 - 14:47 UTC