Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization
July 12, 2022 ยท Declared Dead ยท ๐ ESEC/SIGSOFT FSE
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Lin Shi, Fangwen Mu, Xiao Chen, Song Wang, Junjie Wang, Ye Yang, Ge Li, Xin Xia, Qing Wang
arXiv ID
2207.05579
Category
cs.SE: Software Engineering
Citations
75
Venue
ESEC/SIGSOFT FSE
Last Checked
1 month ago
Abstract
Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Software Engineering
R.I.P.
๐ป
Ghosted
R.I.P.
๐ป
Ghosted
GraphCodeBERT: Pre-training Code Representations with Data Flow
R.I.P.
๐ป
Ghosted
DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars
R.I.P.
๐ป
Ghosted
Microservices: yesterday, today, and tomorrow
R.I.P.
๐ป
Ghosted
Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks
R.I.P.
๐ป
Ghosted
A Survey of Machine Learning for Big Code and Naturalness
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Language Models are Few-Shot Learners
R.I.P.
๐ป
Ghosted
PyTorch: An Imperative Style, High-Performance Deep Learning Library
R.I.P.
๐ป
Ghosted
XGBoost: A Scalable Tree Boosting System
R.I.P.
๐ป
Ghosted