SUMMARY OF IMPACT: Between 13:40 UTC on 25 Mar 2019 and approximately 13:56 UTC on 25 Mar 2019, a subset of customers leveraging Interact HIPAA Cloud may have experienced intermittent service availability issues.
ROOT CAUSE AND MITIGATION: At 13:40 UTC on 25 Mar 2019, a web application pool contained with a single node crashed because of what we believe to be a rare race condition with a third party library. Over the next several minutes, the system attempted to balance this load to other servers. The system attempted to heal by shifting the traffic load to other servers, and remove the failing node from the web server cluster. Unfortunately, during this incident, the load balancer responsible for traffic routing correctly stopped routing new traffic to the failing node but importantly failed to drain connections correctly. This subsequently impacted end users that had been allocated to that server.
NEXT STEPS: We sincerely apologize for the impact to affected customers. We are continuously taking steps to improve the Interact service and our processes to help ensure such incidents do not occur in the future. In this case, this includes (but is not limited to):
1. Fix for the race condition which is currently in progress for development
2. Improve existing failed node detection mechanism which detects the overall service health