We recently addressed issues affecting the internal Monolith service. We would like to take the opportunity to further explain these issues and the steps we have taken to keep them from happening in the future.
Between 9:22am PDT and 9:50am PDT on August 20th 2021, some users may have experienced difficulties while working in Box. During this time, the Box User Events API endpoint may have been delayed. The issue occurred due to internal coordination and configuration system being in a degraded state. We were able to resolve the issue by restarting the leader of the coordination system manually. To further prevent similar issues from occurring in the future we are implementing an automatic restart mechanism for this condition.
Our internal messaging system at Box utilizes a common orchestration system to control various aspects of the service. This orchestration system became degraded due to resource contention issues, which caused other systems to be negatively impacted. As a result, our database cluster that powers the User Events API was delayed in returning responses to user requests. Upon restarting the impacted systems we were able to successfully restore service.
The following corrective actions have been completed or are planned:
Automatic detection and remediation of the impacted coordination service.
Reduction of extraneous logging to reduce resource contention.
We are continuously working to improve Box and want to make sure we are delivering the best product and user experience we can. We hope we have provided some clarity here and we would be happy to answer any questions you may still have regarding this matter.
The Box Team