Pinpointing Anomaly Events in Logs from Stability Testing -- N-Grams vs. Deep-Learning

February 18, 2022 · Declared Dead · 🏛 International Conference on Software Testing, Verification and Validation Workshops

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Mika Mäntylä, Martín Varela, Shayan Hashemi arXiv ID 2202.09214 Category cs.SE: Software Engineering Citations 11 Venue International Conference on Software Testing, Verification and Validation Workshops Last Checked 4 months ago

Abstract

As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual log-events for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.