A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments
October 07, 2020 Β· Declared Dead Β· π Journal of Systems and Software
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Mehdi Golzadeh, Alexandre Decan, Damien Legay, Tom Mens
arXiv ID
2010.03303
Category
cs.SE: Software Engineering
Cross-listed
cs.LG
Citations
89
Venue
Journal of Systems and Software
Last Checked
3 months ago
Abstract
Bots are frequently used in Github repositories to automate repetitive activities that are part of the distributed software development process. They communicate with human actors through comments. While detecting their presence is important for many reasons, no large and representative ground-truth dataset is available, nor are classification models to detect and validate bots on the basis of such a dataset. This paper proposes a ground-truth dataset, based on a manual analysis with high interrater agreement, of pull request and issue comments in 5,000 distinct Github accounts of which 527 have been identified as bots. Using this dataset we propose an automated classification model to detect bots, taking as main features the number of empty and non-empty comments of each account, the number of comment patterns, and the inequality between comments within comment patterns. We obtained a very high weighted average precision, recall and F1-score of 0.98 on a test set containing 40% of the data. We integrated the classification model into an open source command-line tool to allow practitioners to detect which accounts in a given Github repository actually correspond to bots.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Software Engineering
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Microservices: yesterday, today, and tomorrow
π
π
The Cartographer
A Survey of Machine Learning for Big Code and Naturalness
R.I.P.
π»
Ghosted
An Overview on Smart Contracts: Challenges, Advances and Platforms
R.I.P.
π»
Ghosted
Slither: A Static Analysis Framework For Smart Contracts
R.I.P.
π»
Ghosted
ContractFuzzer: Fuzzing Smart Contracts for Vulnerability Detection
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted