BSODiag: A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems
January 31, 2025 Β· Declared Dead Β· π 2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Tao Duan, Runqing Chen, Pinghui Wang, Junzhou Zhao, Jiongzhou Liu, Shujie Han, Yi Liu, Fan Xu
arXiv ID
2502.15728
Category
cs.DC: Distributed Computing
Citations
1
Venue
2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)
Last Checked
4 months ago
Abstract
Cloud infrastructure is the collective term for all physical devices within cloud systems. Failures within the cloud infrastructure system can severely compromise the stability and availability of cloud services. Particularly, batch servers outage, which is the most fatal failure, could result in the complete unavailability of all upstream services. In this work, we focus on the batch servers outage diagnosis problem, aiming to accurately and promptly analyze the root cause of outages to facilitate troubleshooting. However, our empirical study conducted in a real industrial system indicates that it is a challenging task. Firstly, the collected single-modal coarse-grained failure monitoring data (i.e., alert, incident, or change) in the cloud infrastructure system is insufficient for a comprehensive failure profiling. Secondly, due to the intricate dependencies among devices, outages are often the cumulative result of multiple failures, but correlations between failures are difficult to ascertain. To address these problems, we propose BSODiag, an unsupervised and lightweight diagnosis framework for batch servers outage. BSODiag provides a global analytical perspective, thoroughly explores failure information from multi-source monitoring data, models the spatio-temporal correlations among failures, and delivers accurate and interpretable diagnostic results. Experiments conducted on the Alibaba Cloud infrastructure system show that BSODiag achieves 87.5% PR@3 and 46.3% PCR, outperforming baseline methods by 10.2% and 3.7%, respectively.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Distributed Computing
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Reproducing GW150914: the first observation of gravitational waves from a binary black hole merger
R.I.P.
π»
Ghosted
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems
R.I.P.
π»
Ghosted
Adaptive Federated Learning in Resource Constrained Edge Computing Systems
R.I.P.
π»
Ghosted
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
R.I.P.
π»
Ghosted
iFogSim: A Toolkit for Modeling and Simulation of Resource Management Techniques in Internet of Things, Edge and Fog Computing Environments
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted