On 15th April 2020 at 08:30am UTC Interact's monitoring services alerted engineers of increasing error rates within the EU pod affecting a subset of customers. At the same time, Interact technical support engineers received a number of customer reports of widgets displaying blank content, with most pages being unaffected and usable.
Investigation, Root Cause and Resolution
Interact Engineers were alerted to other monitoring tools not receiving data from inside of the network since the failure at 08:30am UTC. This combined with the troubleshooting playbook steps led Interact Engineers to identify the issue at 9:10am UTC, as a failure of outbound connectivity from within the EU cluster. The issue caused a partial disruption of functionality, as most of the service does not require outbound connectivity to function. Following troubleshooting steps, the issue was narrowed down to a failure in internal DNS services which were preventing any outbound connectivity due to DNS resolution failure errors. The sites were mostly operational due to cached content and support of services which do not rely on outbound connectivity and rely internal networking only.
After an attempt to restore the DNS Services failed, Interact Engineers restored the DNS capabilities and outbound connectivity by performing a failover to AWS DNS Services using new DHCP settings and by utilising Private Hosting Zones to replace internal DNS name mappings inside of the VPC.
The affected internal services were rotated or rebooted to ensure that the new DNS Server settings were fetched and applied. This restored the DNS capability and network connectivity. This restored the service for all customers by 9:28am UTC. The error rates and all other metrics fell into the respective healthy ranges.
At 10:18am UTC, Interact monitoring tools alerted engineers to an influx of intermittent errors (specifically Error 500 messages) for a select number of customers. This was investigated and identified as a problem with one of the database storage services which hosts a subset of Interact’s EU customers. The problem was resolved by rebooting the service after which the error rate stabilised and returned to its normal parameters. Interact monitored the issue for a significant amount of time, with the services reporting a stable and healthy error rate. The issue was confirmed resolved at 11:53am UTC.
At 12:57pm UTC, Interact experienced a reoccurrence of the same symptoms as the previous issue at 10:18am UTC. As before, the issue was very intermittent and affected a small subset of customers. This triggered an in-depth investigation and diagnostic sweep of the related infrastructure. The fault was identified as a failing scheduled task which was dependent on a Directory Account to run. The directory services were unreachable from the affected hosts due to the failure of the DNS Services and Domain Controllers earlier. This job was swiftly remapped to a local account and resumed operations without issues after which error rates returned to standard parameters and the issue was resolved. This was closely monitored into the evening to ensure scheduled executions remained successful and reported healthy.
After no more issues were identified and after numerous hours of stability, Interact Engineers marked the issue as resolved at 4:24pm UTC. Active enhanced monitoring was put into place for the next 24 hours following the issue and no issues have occurred in that time frame, with all systems running within their respective healthy parameters.
Interact Engineers have expanded the documentation to include troubleshooting and diagnostic steps which would result in a swift identification of this issue if it occurs again.
Additional monitoring and alerting has been implemented to directly identify the Domain Controller and DNS failures. Relevant recovery and failover steps have been documented and added to the Interact Playbook for a fast resolution. Automated measures are being evaluated.
All recurring jobs have been hardened by issuing a fallback local account to ensure they are not dependent on a Directory Account to execute. This will allow core jobs to function, even if Domain Controllers and the Directory services fail.
Interact is working closely with the AWS Team to perform a detailed analysis of the DNS Server failure to understand the cause and better prevention tactics for the future. Interact’s services have been fully migrated to use Amazon Default DNS Servers and a secondary set of DNS Servers will be provisioned for failover and redundancy purposes.