[Outage] Issues with Logins, Uploads, Downloads, API calls
Incident Report for Box
Postmortem

Background

On December 15, 2023 Box experienced a severe service degradation. During this time, users experienced slowness and failures interacting with all parts of the Box webapp and public API; including Logins, Uploads and Downloads. Additionally, some customers might have seen small subsets of their content created during the incident become sporadically hidden.

The issue was triggered by a partial power loss that impacted persistent disks in one of the cloud infrastructure availability zones in which we operate, leading to multiple active databases being impacted. Although our database infrastructure is architected with zonal redundancy, the unique nature of the failure surfaced a gap in our zonal health monitoring and presented unforeseen challenges for our automatic recovery tooling, extending the resolution time. We were ultimately able to resolve the issue by manually shifting traffic out of the impacted availability zone.

Analysis

Box’s database infrastructure is designed to be able to withstand problems impacting a single availability zone. Our database replicas are spread across multiple availability zones for redundancy and we have automated tooling in place that can detect faults and shift traffic to other replicas should that become necessary. Both the redundant replicas and automated tooling are exercised regularly as part of our normal operations.

In this case, when the partial power loss took place, it impacted several of our databases in a way that compromised their ability to serve high traffic volumes. Diagnosing this type of partial zonal failure was challenging due to the high variability of impact across multiple databases. Because of this unique confluence of factors, the root cause proved challenging to diagnose and mitigate both for our automation and incident responders, extending time to resolution. Once we confirmed the impact was limited to a single availability zone, we were able to effectively mitigate all active databases out of the impacted zone by leveraging other availability zones.

During the course of our mitigation efforts, replication topologies may have caused certain newly created files or file versions to intermittently not appear in a limited number of Box accounts. These intermittent visibility issues were resolved as replication topologies were normalized and caches were later cleared.

Next Steps and Corrective Actions

We are continuing to conduct a comprehensive engineering postmortem. Therefore, this report is subject to change based on our further analysis and findings.

The following corrective actions have been, or continue to be, implemented:

  • improve our ability to diagnose partial availability zone degradations like the one we encountered in this incident;
  • improve our automatic incident mitigation system to better handle partial zone degradations;
  • improve our testing strategies for the database tier mitigation systems to cover this type of failure scenario; and
  • conduct a comprehensive audit of our production systems to validate their resilience to zonal failures and degradations.

The above noted corrective actions will strengthen our efforts to safeguard against partial zone degradations, reduce mitigation timeframes and support enhanced testing and prevention efforts.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,

The Box Team

Posted Dec 15, 2023 - 19:16 PST

Resolved
Box's FTP Server has been restored to full functionality. All Box Services are now operating as expected, therefore, this incident is considered resolved. If you continue to experience any issues, please contact Box Support at https://support.box.com.
Posted Dec 15, 2023 - 17:08 PST
Update
We are continuing to see errors subside for users connecting to Box's FTP Server. We are still actively monitoring for additional impact as services return back to full functionality.
Posted Dec 15, 2023 - 15:08 PST
Update
After further monitoring, our teams identified some impact for users attempting to connect to Box's FTP Server which were resulting in errors or failed logins. Our teams have taken steps to swiftly remediate these errors and we are already seeing improvement. We are continuing to monitor for additional impact as services return back to full functionality.
Posted Dec 15, 2023 - 14:00 PST
Update
All Box Services should now be restored. We are still continuing to monitor for any additional impact.
Posted Dec 15, 2023 - 13:12 PST
Update
All Box Services should now be restored. We are continuing to monitor for any additional impact.
Posted Dec 15, 2023 - 12:06 PST
Update
We are continuing to monitor for any additional impact.
Posted Dec 15, 2023 - 11:02 PST
Update
Box Services continue to show rapid improvement, but users may experience some latency as services continue to restore back to full functionality. We are continuing to monitor for any additional impact.
Posted Dec 15, 2023 - 10:25 PST
Monitoring
Our team has taken steps to remediate this issue and seeing improvement across all Box Services. We are continuing to monitor for any additional impact as services return back to full functionality.
Posted Dec 15, 2023 - 09:48 PST
Identified
We have identified the underlying cause of this issue and our teams continue to work towards a full restoration of services as soon as possible. We will provide additional updates as they become available.
Posted Dec 15, 2023 - 09:27 PST
Update
Service restoration efforts are still ongoing. We will provide additional information as it becomes available.
Posted Dec 15, 2023 - 09:07 PST
Update
We are continuing to investigate this issue. Box services remain completely inaccessible, but our teams are actively working to get services restored as soon as possible. We will provide additional information as it becomes available.
Posted Dec 15, 2023 - 08:36 PST
Update
Our team is continuing to investigate and have confirmed this issue is impacting all Box Services. Users attempting to use Box may see errors, timeouts, but for the most part services will be completely inaccessible. We will provide additional information as it becomes available.
Posted Dec 15, 2023 - 07:31 PST
Update
We are continuing to investigate this issue.
Posted Dec 15, 2023 - 07:14 PST
Investigating
Our team is investigating an issue with Logins, Uploads, Downloads, All Files page, API calls. Users attempting to use these services may see errors or timeouts. We will provide additional information as it becomes available.
Posted Dec 15, 2023 - 06:38 PST
This incident affected: Box Platform / API (Content API, Content Preview, Search, Uploads/Downloads), Box Web Application (Login/SSO, Uploads/Downloads, Search, Preview), Desktop Applications (Login/SSO, Box Sync, Box Drive), Mobile Applications (Login/SSO, Preview, Search, Uploads/Downloads), Box Notes (Web Application), and FTP.