[MAJOR] Issues with All Files page, Downloads, API
Incident Report for Box
Postmortem

Background

We recently addressed issues affecting Box Webapp, Public and Uploads & Downloads. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 4:30 AM PST and 5:05 AM PST on December 15, 2023, some users may have experienced difficulties while working in Box. During these times, some users may have experienced slowness and failures interacting with parts of the Box Webapp and public API, including Uploads and Downloads. The issue occurred due to an issue with the deployment of a new DB Access Control service on our database fleet. We were able to resolve the issue by temporarily disabling the problematic service. In addition, we are working to improve our testing and rollout process for services being deployed to our database fleet to prevent similar issues from occurring in the future.

Analysis

The DB Access Control service is a service that enables new user access controls to be applied automatically and frequently. This allows for faster development of new DB functionality that requires access control changes, resulting in faster team velocity and ultimately more stable and reliable infrastructure.

On the morning of December 13th, an operator started the rollout of the DB Access Control service by deploying it to a single database pod. At that time, it was configured to execute its work every 10 minutes. Approximately 10 minutes after the deployment, we saw degradation on that pod, causing the impact seen on December 13. The operator suspected that the service frequency was the cause and remediated by configuring the service to execute its work once a day at a low-traffic time. The next steps should have been to validate the changes in a way that would not result in customer impact and then deploy it again on a single pod. However, the standard process was unintentionally not adhered to in this case and the change was instead deployed across the fleet, leading to additional impact as seen on December 15.

In addition to optimizing the DB Access Control service to be less resource-intensive, we intend to make some process changes in response to this issue. There is already a general standard process that should be followed when deploying services to the Database fleet. However, to minimize the likelihood of similar situations occurring again, we are updating our documentation to better ensure that all operators have a comprehensive understanding of the standard process and that such process is consistently followed.

Next Steps and Corrective Actions

The following corrective actions have been completed or are planned:

  • The relevant DB service was temporarily disabled to remediate the issue.
  • The relevant DB service has been optimized to be less resource-intensive and will automatically refuse to run during high-traffic times
  • Rollout process for all DB services will be more thoroughly documented
  • Operators will receive additional training to ensure that they have a comprehensive understanding of the documented rollout process

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,

The Box Team

Posted Jan 19, 2024 - 08:10 PST

Resolved
After further monitoring, this incident is now considered resolved. Service has been restored to full functionality. If you continue to experience any issues, please contact Box Support at https://support.box.com.
Posted Dec 15, 2023 - 06:35 PST
Update
Our team has taken steps to remediate this issue and our services should be returned to full functionality. We are continuing to monitor for any additional impact.
Posted Dec 15, 2023 - 05:34 PST
Monitoring
Our team has taken steps to remediate this issue and is seeing improvement for the services that have been affected. We are continuing to monitor for any additional impact.
Posted Dec 15, 2023 - 05:22 PST
Investigating
Our team is investigating an issue that could affect All Files page, Downloads, API calls. Users may see errors when attempting to preview files or share files. We will provide additional information as it becomes available.
Posted Dec 15, 2023 - 05:12 PST
This incident affected: Box Platform / API (Content API) and Box Web Application (Login/SSO, Uploads/Downloads, Preview).