Intermittent Error 500 for a subset of users
Incident Report for Interact
Postmortem

SUMMARY OF IMPACT: Between 13:40 UTC on 25 Mar 2019 and approximately 13:56 UTC on 25 Mar 2019, a subset of customers leveraging Interact HIPAA Cloud may have experienced intermittent service availability issues.

ROOT CAUSE AND MITIGATION: At 13:40 UTC on 25 Mar 2019, a web application pool contained with a single node crashed because of what we believe to be a rare race condition with a third party library. Over the next several minutes, the system attempted to balance this load to other servers. The system attempted to heal by shifting the traffic load to other servers, and remove the failing node from the web server cluster. Unfortunately, during this incident, the load balancer responsible for traffic routing correctly stopped routing new traffic to the failing node but importantly failed to drain connections correctly. This subsequently impacted end users that had been allocated to that server.

NEXT STEPS: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Interact service and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):

1. Fix for the race condition which is currently in progress for development

2. Improve existing failed node detection mechanism which detects the overall service health

Posted 6 months ago. Mar 27, 2019 - 10:23 UTC

Resolved
This incident has now been resolved. A Full Post Mortem will be published within 24 hours.

We apologise for the disruption caused.
Posted 6 months ago. Mar 25, 2019 - 14:00 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted 6 months ago. Mar 25, 2019 - 13:54 UTC
Update
We are continuing to investigate this issue.
Posted 6 months ago. Mar 25, 2019 - 13:44 UTC
Investigating
We are currently investigating an issue with customers accessing the Intranet on our HIPAA server. Engineers are working to resolve this issue as highest priority.
Posted 6 months ago. Mar 25, 2019 - 13:42 UTC
This incident affected: HIPAA Cloud and HIPAA Cloud - Mobile app and API.