Between 3:15 AM PST and 3:20 AM PST on November 23, 2020, some users may have experienced difficulties while working in Box. The issue occurred due to a software bug in service to service communication which was triggered by a hardware failure. Following a period of about 2 minutes, a monitoring service that runs locally on the hosts noticed the faulty service and restored it which in turn restored the site health.
At approximately 3:15 AM on November 23, 2020, a host crashed due to a hardware failure. At the time of the crash, this host was the quorum leader and as a result a new leader election was triggered. Leader election is a normal behavior and is how the quorum stays highly available. Unfortunately, Nerve, our service registration agent crashed due to the abrupt loss of connection to the quorum leader. As a result, the hosts where Nerve crashed were considered unavailable and clients stopped sending traffic to them.
This issue revealed a problem in how Nerve deals with abrupt connection loss to the quorum leader. We will be fixing this bug and follow up with comprehensive testing to make sure it deals with connection losses correctly.
The following corrective actions have been completed or are planned:
Introduce improved connection retry behavior in Nerve
Improve metrics and alerts to detect issues in Nerve
Do comprehensive testing on various connection loss scenarios with Nerve
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
The Box Team