[Minor] Some users unable to login
Incident Report for Box
Postmortem

We recently addressed issues affecting Box Webapp, Public and Uploads & Downloads. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 10:28 AM PST and 10:38 AM PST on December 13, 2023 and between 4:30 AM PST and 5:05 AM PST on December 15, 2023, some users may have experienced difficulties while working in Box. During these times, some users may have experienced slowness and failures interacting with parts of the Box Webapp and public API, including Uploads and Downloads. The issue occurred due to an issue with the deployment of a new DB Access Control service on our database fleet. We were able to resolve the issue by temporarily disabling the problematic service. In addition, we are working to improve our testing and rollout process for services being deployed to our database fleet to prevent similar issues from occurring in the future.

Analysis

The DB Access Control service is a service that enables new user access controls to be applied automatically and frequently. This allows for faster development of new DB functionality that requires access control changes, resulting in faster team velocity and ultimately more stable and reliable infrastructure.

On the morning of December 13th, an operator started the rollout of the DB Access Control service by deploying it to a single database pod. At that time, it was configured to execute its work every 10 minutes. Approximately 10 minutes after the deployment, we saw degradation on that pod, causing the impact seen on December 13. The operator suspected that the service frequency was the cause and remediated by configuring the service to execute its work once a day at a low-traffic time. The next steps should have been to validate the changes in a way that would not result in customer impact and then deploy it again on a single pod. However, the standard process was unintentionally not adhered to in this case and the change was instead deployed across the fleet, leading to additional impact as seen on December 15.

In addition to optimizing the DB Access Control service to be less resource-intensive, we intend to make some process changes in response to this issue. There is already a general standard process that should be followed when deploying services to the Database fleet. However, to minimize the likelihood of similar situations occurring again, we are updating our documentation to better ensure that all operators have a comprehensive understanding of the standard process and that such process is consistently followed.

Corrective Actions

The following corrective actions have been completed or are planned:

  • The relevant DB service was temporarily disabled to remediate the issue.
  • The relevant DB service has been optimized to be less resource-intensive and will automatically refuse to run during high-traffic times
  • Rollout process for all DB services will be more thoroughly documented
  • Operators will receive additional training to ensure that they have a comprehensive understanding of the documented rollout process

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,

The Box Team

Posted Jan 18, 2024 - 13:51 PST

Resolved
From approximately 10:45 AM to 10:50 AM US Pacific time, we had an issue impacting Logins to Box. Our systems automatically detected and corrected the underlying issue. There is no current impact and no further updates will be provided here. If you are currently seeing any issues, please let us know at https://support.box.com
Posted Dec 13, 2023 - 10:45 PST