On Monday morning June 14th, our cloud team was alerted to higher than average 500 errors around 10:30 AM ET. Engineers quickly identified the issue with Interact's Network File System (NFS). This system is responsible for storing file-based content. The team subsequently restarted NFS while checking for any unhealthy web instances. Any servers in an unhealthy state were detached and replaced after an initial IIS reset.
By 11:00 AM ET, Interact's status page notified customers of the known outage, and SRE's were actively working on a fix. By this point, the team had already observed latency subsiding and web servers showing healthy. The team advised systems were back and operational at 12:30EST.
We have a project underway to fully replace our NFS solution with a more robust AWS S3 solution for file storage. We hope to have this project completed in the Q3 2021 timeframe, and, once in place, this type of issue should be minimized if not eliminated.