Intermittent Service Disruption impacting a subset of our EU customers
Incident Report for Interact
Postmortem

Summary
Over recent weeks Interact engineers have been investigating short, intermittent episodes of service disruption across a subset of customers within the EU. Symptoms included increased page load times and service unavailable messages (503) for some end users of the service.

Investigations & Root Cause
Interact utilises an in-memory data store to hold session data. Sessions are the way in which web applications store state information for logged in users between page requests. Interact's in memory data store was under provisioned for peak usage times and configured in a way that prevented the auto-scaling/failover of the service (currently fails over to read-only).
Interact took steps to address the stability of service by isolating sign-in requests to dedicated servers. Whilst this recovered the service for most users, it inadvertently this degraded the ability to sign into the service.

Resolution and Mitigation Steps
Interact has increased the capacity of its in-memory data store to 4 times its previous capacity. Additional changes will be made over the coming weeks to implement an active-active in memory data store to prevent a re-occurrence.
Additionally, during this period, Interact has refreshed it’s underlying hardware and made improvements to the internal caching of assets to improve the responsiveness of the application.

Posted about 1 month ago. Dec 19, 2018 - 13:50 UTC

Resolved
This incident has been resolved.
Posted about 1 month ago. Dec 19, 2018 - 13:49 UTC
Update
Engineers identified the root cause of the issue yesterday and a fix was implemented over night. We are monitoring the results closely and once the fix has been confirmed a full post mortem will be issued.

Once again we would like to thank you very much for your continued patience during this frustrating period.
Posted about 1 month ago. Dec 13, 2018 - 09:25 UTC
Update
Interact engineers have now identified root cause and implemented a workaround with the full permanent fix being implemented overnight. As soon as this is in place a full post mortem will be issued.
Once again thank you very much for your patience during this period and many apologies for the disruption this has caused.

Kind Regards

Victoria Hamblin
Head of Technical Support
Posted about 1 month ago. Dec 12, 2018 - 17:03 UTC
Update
We are continuing to monitor for any further issues.
Posted about 1 month ago. Dec 12, 2018 - 08:54 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted about 1 month ago. Dec 11, 2018 - 15:53 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted about 1 month ago. Dec 11, 2018 - 15:30 UTC
Update
We are continuing to investigate this issue.
Posted about 1 month ago. Dec 11, 2018 - 15:25 UTC
Investigating
Investigating - Interact Engineers are currently investigating on-going issues within the EU which is causing intermittent episodes of service disruption across a subset of EU customers. Errors which users may experience are 'service unavailable' and '500 errors''

Engineers are working relentlessly on a resolution as highest priority and will continue to update this page with updates. As soon as engineers have identified a permanent solution we will issue a full post mortem.

We apologise for the inconvenience and if you would like to discuss this in more detail with an Interact engineer please do not hesitate to contact us.

Once again we thank you for your continued patience while engineers work on a full resolution.

Interact Support
Tel: 0161 9273223
Posted about 1 month ago. Dec 11, 2018 - 15:25 UTC
This incident affected: Europe Public Cloud 1 and Europe Public Cloud 2.