Over recent weeks Interact engineers have been investigating short, intermittent episodes of service disruption across a subset of customers within the EU. Symptoms included increased page load times and service unavailable messages (503) for some end users of the service.
Investigations & Root Cause
Interact utilises an in-memory data store to hold session data. Sessions are the way in which web applications store state information for logged in users between page requests. Interact's in memory data store was under provisioned for peak usage times and configured in a way that prevented the auto-scaling/failover of the service (currently fails over to read-only).
Interact took steps to address the stability of service by isolating sign-in requests to dedicated servers. Whilst this recovered the service for most users, it inadvertently this degraded the ability to sign into the service.
Resolution and Mitigation Steps
Interact has increased the capacity of its in-memory data store to 4 times its previous capacity. Additional changes will be made over the coming weeks to implement an active-active in memory data store to prevent a re-occurrence.
Additionally, during this period, Interact has refreshed it’s underlying hardware and made improvements to the internal caching of assets to improve the responsiveness of the application.