[Outage] Issue Loading Box Site
Incident Report for Box
Postmortem
We recently addressed issues affecting Box on November 23, 2020. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 3:15 AM PST and 3:20 AM PST on November 23, 2020, some users may have experienced difficulties while working in Box. The issue occurred due to a software bug in service to service communication which was triggered by a hardware failure. Following a period of about 2 minutes, a monitoring service that runs locally on the hosts noticed the faulty service and restored it which in turn restored the site health.

Analysis 

At approximately 3:15 AM on November 23, 2020, a host crashed due to a hardware failure. At the time of the crash, this host was the quorum leader and as a result a new leader election was triggered. Leader election is a normal behavior and is how the quorum stays highly available. Unfortunately, Nerve, our service registration agent crashed due to the abrupt loss of connection to the quorum leader. As a result, the hosts where Nerve crashed were considered unavailable and clients stopped sending traffic to them.

This issue revealed a problem in how Nerve deals with abrupt connection loss to the quorum leader. We will be fixing this bug and follow up with comprehensive testing to make sure it deals with connection losses correctly. 

Corrective Actions

The following corrective actions have been completed or are planned:

  • Introduce improved connection retry behavior in Nerve

  • Improve metrics and alerts to detect issues in Nerve

  • Do comprehensive testing on various connection loss scenarios with Nerve

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

 

Sincerely,

The Box Team

Posted Nov 24, 2020 - 12:31 PST

Resolved
From approximately 3:14 AM to 3:21 AM US Pacific time, we observed issues loading the Box site. Our systems automatically detected and corrected the underlying issue. There is no current impact and no further updates will be provided here. If you are currently seeing any issues, please let us know at https://support.box.com.
Posted Nov 23, 2020 - 03:30 PST