[Critical] Issue Impacting All Services
Incident Report for Box
Postmortem

We recently addressed issues affecting Uploads, Web Logins, Downloads, Sign, and Public API. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

Between 10:11AM and 10:32AM PDT on October 6, 2023, some users may have experienced difficulties while working in Box. During this time, users may not have been able to log in, upload/download content, or interact with some components of Sign or some API endpoints. The issue occurred as the result of a configuration change affecting internal routing and DNS infrastructure. We were able to resolve the issue by restoring the original configuration.

Analysis

This issue was caused by a configuration change intended to shift traffic between our cloud provider and on-premise infrastructure to a redundant path on the network. This configuration change inadvertently impacted our internal routing and DNS infrastructure, which certain Box services rely on for internal name resolution during network maintenance. The inability for services to resolve internal resources lead to a cascading effect of errors across several services. After reviewing the incident it was discovered that the redundant path was missing the key route for our DNS infrastructure. As such, part of the network path between the public cloud and on-premise datacenter were temporarily disrupted.

As part of Box’s migration to public cloud, our traditional datacenter network was connected to our public cloud vendor with multiple high-bandwidth connections. As Box has completed its service migration to the public cloud, we are working to reduce the number of these connections being used. In connection with this effort, we migrated traffic off of two links that were serving as a connection between the cloud and our DNS infrastructure. Due to a missing configuration, the DNS-specific route was not being propagated throughout the network as expected; as a result, when these two links were turned off, our cloud infrastructure temporarily lost access to our DNS infrastructure. This caused our internal services to stop being able to perform DNS lookups between each other. At 10:32AM PDT, these links were turned back on, which allowed us to restore the original configuration and remediate this issue. However, because our configuration pipelines were also impacted by this issue and unable to connect to our DNS infrastructure, these links had to be turned back on manually, which impacted our time to remediation.

Box has been migrating to a public cloud platform over the course of 2022-2023 to improve overall performance and reliability. Our DNS infrastructure is one of the final components that will be migrated to be hosted in public cloud. This migration is expected to be completed in the coming quarter and will give our public cloud infrastructure the ability to have this DNS locally instead of relying on these connections.

Corrective Actions

The following corrective actions have been completed or are planned:

  • Improve change procedure for routing changes involving interim hybrid connectivity critical infrastructure. This will provide extra oversight over changes of this type to minimize the likelihood of similar issues occurring in the future.
  • Finish the migration of DNS infrastructure into public cloud.
  • Improve resilience of our configuration pipelines so that impact to production does not also impact these pipelines.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter. 

Sincerely,
The Box Team

Posted Oct 13, 2023 - 10:23 PDT

Resolved
After further monitoring, this incident is now considered resolved. All Box services have been restored to full functionality. Please contact Box Support at https://support.box.com/ if you continue to experience any issues.
Posted Oct 06, 2023 - 10:55 PDT
Update
We are continuing to monitor for any further issues.
Posted Oct 06, 2023 - 10:49 PDT
Monitoring
Users may experience issues when attempting to access or use Box. Our team has taken remediating action and is currently monitoring for any additional impact. We will provide additional information as it becomes available.
Posted Oct 06, 2023 - 10:49 PDT
This incident affected: Box Web Application (Login/SSO, Uploads/Downloads, Collaboration, Search, Preview, Sharing (Shared Links), Email Notifications, Admin Console & Functionality, Governance (Retention), Governance (Legal Holds), Workflows and Automations, Comments and Tasks, Accessible Site (a.box.com), Box Sign, Box Canvas, Box Shield (Threat Detection), Box Shield (Virus Detection), Box Shield (Auto Classification), Box Shuttle, Watermarking).