[Outage] Issues with Logins, Uploads, Downloads, API calls
Incident Report for Box
Postmortem

We recently addressed issues affecting most parts of the Box webapp and public API; including Logins, Uploads and Downloads. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.

On January 9, 2024, between 11:10 AM PT and 12:15 PM PT, users experienced difficulties while working in Box. During this time, users experienced slowness and failures interacting with most parts of the Box webapp and public API, including Logins, Uploads and Downloads. The issue was triggered by a change that impacted Kubernetes clusters in one of the regions where we run our critical services. At 11:18 AM PT this change was rolled back. After rolling back the change, systems began to recover around 11:27 AM.

Analysis

On January 9th, 2024, we made a configuration change to deploy a new daemonset to the dedicated Kubernetes clusters running our core data services, which caused resource exhaustion and eviction of our core data services on those nodes. The daemonset deployment pipeline erroneously allowed the change to be promoted to an entire region instead of the planned gradual rollout, which resulted in service interruption impacting most of our customers. To address this issue, we rolled back the daemonset deployment. After the daemonset was rolled back, the evicted core data services began to be restored and our systems began to recover; however, in two of our three clusters, this process was delayed by a scheduler limitation, elongating the duration of impact. We adjusted the resource requests for our core data services to speed up recovery of these two clusters, restoring normal operations at 12:15pm.

Corrective Actions

The following corrective actions have been completed or are planned:

  • We have paused the daemonset pipeline until we can ensure it employs safe incremental rollout processes.
  • We are updating the daemonset pipeline to validate resource allocations before deployment and ensure a gradual production rollout to reduce the likelihood of similar issues occurring in the future.
  • We have adjusted the priority of our application services, including the core data services, to be higher than daemonsets to ensure their capacity is protected.
  • We are adjusting our Kubernetes configurations to enable faster restoration of service through guaranteed capacity and buffers.

We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.

Sincerely,
The Box Team

Posted Jan 11, 2024 - 07:33 PST

Resolved
After further monitoring, this incident is now considered resolved. All services have been restored to full functionality. If you continue to experience any issues, please contact Box Support at https://support.box.com.
Posted Jan 09, 2024 - 12:57 PST
Update
Our team has taken steps to remediate this issue and all services should be returned to full functionality. We are continuing to monitor for any additional impact.
Posted Jan 09, 2024 - 12:19 PST
Monitoring
Our team has taken steps to remediate this issue and we are seeing improvement across all Box Services. We are continuing to monitor for any additional impact as services return back to full functionality.
Posted Jan 09, 2024 - 12:11 PST
Update
Our team continues remediation efforts to restore full functionality to affected services. We will provide additional updates as they become available.
Posted Jan 09, 2024 - 11:50 PST
Identified
Our team has identified the underlying cause of this issue and is working to take remediating steps. You may begin seeing improvement at this time. We will provide additional updates as they become available.
Posted Jan 09, 2024 - 11:30 PST
Investigating
Our team is investigating an issue with Logins, Uploads, Downloads, All Files page, API calls. Users attempting to use these services may see errors or timeouts. We will provide additional information as it becomes available.
Posted Jan 09, 2024 - 11:20 PST
This incident affected: Box Platform / API (Content API), Box Web Application (Login/SSO, Uploads/Downloads), and Box Website.