[Major] Issues with Logins
Incident Report for Box
Postmortem

We recently addressed issues affecting access to Box files. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 1:00PM PT and 1:40PM PT on March 7, 2024, some users may have experienced difficulties while working in Box. During this time, users were not able to access files stored in Box folders. The issue occurred as a result of a deployment of a data access service in our ongoing effort to improve the scalability of Box’s infrastructure. While deploying this service, a central control plane component became overloaded and was unable to propagate new data access instance information to the applications. We were able to resolve the issue by scaling up the control plane component. In addition, we are working to over-provision the control plane component and improve the monitoring and alerting to prevent similar issues from occurring in the future.

Analysis

Box utilizes a communication layer called "Service Mesh" to facilitate effective communication between its services. However, a recent increase in the number of applications onboarded to Service Mesh caused the control plane component to become overloaded, resulting in delays propagating instance changes for impacted services. Before the incident occurred, several core applications underwent deployment simultaneously, including the Data Access Service. Unfortunately, the control plane failed to promptly propagate instance changes for this service. Since most user interactions with Box rely on connecting with the Data Access Service, users were unable to access files stored in Box folders during this time due to the propagation delay.

To address this issue and prevent future occurrences of control plane overload, we have scaled up its capacity by three times and implemented improved monitoring and alerting mechanisms.

Corrective Actions

The following corrective actions have been completed or are planned:

  • Scaling up control plane capacity to prevent future overload issues.
  • Enhancing monitoring and alerting for the control plane status.
  • Conduct benchmark and scalability testing of the control plane in Pre-Prod environment.
  • Arrange detailed sessions with our technology vendor to optimize Box's Service Mesh infrastructure.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,

The Box Team

Posted Mar 25, 2024 - 10:04 PDT

Resolved
After further monitoring, this incident is now considered resolved. Box services have been restored to full functionality. Please contact Box Support at https://support.box.com/ if you continue to experience any issues.
Posted Mar 07, 2024 - 16:11 PST
Update
We are continuing to monitor for any further issues.
Posted Mar 07, 2024 - 15:00 PST
Monitoring
Our team has taken steps to remediate this issue and the login service should be returning to full functionality. We are continuing to monitor for any additional impact.
Posted Mar 07, 2024 - 13:57 PST
Update
We are continuing to investigate this issue.
Posted Mar 07, 2024 - 13:36 PST
Investigating
Our team is investigating an issue with Logins within Box. Users attempting to log in to Box may see errors or timeouts. We will provide additional information as it becomes available
Posted Mar 07, 2024 - 13:36 PST
This incident affected: Box Web Application (Login/SSO).