As part of automated build deployment a faulty set of updated servers that host the Interact application were put into rotation at approximately 2pm UTC.
The service was first affected from 2.10pm with the peak issues occurring between 2.30pm and 3pm as hardware was being rotated. The EU/US/HIPAA pods were all affected to varying degrees; the EU pod in particular encountered a high level of errors and service degradation in the period while we were ascertaining the root cause and triggered resolution steps. US & HIPAA pods did not suffer the same level of service impact but they did encounter some similar symptoms such as the timeline rendering empty but with a significantly reduced number of errors.
As soon as the issue was determined engineers reverted to the last healthy configuration. Our investigation showed that the a transform which controls some additional keys and settings (such as shared session state & other Amazon services utilised by the timeline feature) were not correctly applied to the new build. This unfortunately did not flag as an error during the build process and as the build passed preliminary testing it was subsequently fully rotated into service.
We believe the issue was caused by a build process dependency (Microsoft/nuget) that has recently been updated: https://www.nuget.org/packages/Microsoft.Web.Xdt/. This external tool/dependency is downloaded with each build to update Interact settings specific to each pod. In this case a recent update to this tool appears to have been packaged with a beta dll which in turn caused any transform utilising it to fail.
To address this we have changed the build process to no longer rely on the external source; we will always utilise a fully tested version stored locally to the process eliminating the external dependency. We are also updating relevant build scripts to better flag & fail a build when similar issues occur in the future.