BSODiag: A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems

January 31, 2025 Β· Declared Dead Β· πŸ› 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Tao Duan, Runqing Chen, Pinghui Wang, Junzhou Zhao, Jiongzhou Liu, Shujie Han, Yi Liu, Fan Xu arXiv ID 2502.15728 Category cs.DC: Distributed Computing Citations 1 Venue 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) Last Checked 4 months ago
Abstract
Cloud infrastructure is the collective term for all physical devices within cloud systems. Failures within the cloud infrastructure system can severely compromise the stability and availability of cloud services. Particularly, batch servers outage, which is the most fatal failure, could result in the complete unavailability of all upstream services. In this work, we focus on the batch servers outage diagnosis problem, aiming to accurately and promptly analyze the root cause of outages to facilitate troubleshooting. However, our empirical study conducted in a real industrial system indicates that it is a challenging task. Firstly, the collected single-modal coarse-grained failure monitoring data (i.e., alert, incident, or change) in the cloud infrastructure system is insufficient for a comprehensive failure profiling. Secondly, due to the intricate dependencies among devices, outages are often the cumulative result of multiple failures, but correlations between failures are difficult to ascertain. To address these problems, we propose BSODiag, an unsupervised and lightweight diagnosis framework for batch servers outage. BSODiag provides a global analytical perspective, thoroughly explores failure information from multi-source monitoring data, models the spatio-temporal correlations among failures, and delivers accurate and interpretable diagnostic results. Experiments conducted on the Alibaba Cloud infrastructure system show that BSODiag achieves 87.5% PR@3 and 46.3% PCR, outperforming baseline methods by 10.2% and 3.7%, respectively.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Distributed Computing

Died the same way β€” πŸ‘» Ghosted