Representation Learning for Stack Overflow Posts: How Far are We?
March 13, 2023 Β· Declared Dead Β· π ACM Transactions on Software Engineering and Methodology
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Junda He, Zhou Xin, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Irsan, David Lo
arXiv ID
2303.06853
Category
cs.SE: Software Engineering
Citations
30
Venue
ACM Transactions on Software Engineering and Methodology
Last Checked
4 months ago
Abstract
The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content.The performance of such solutions hinges significantly on the selection of representation model for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers' interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon trendy neural networks such as convolutional neural network (CNN) and Transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. To find more suitable representation models for the posts, we further explore a diverse set of BERT-based models, including (1) general domain language models (RoBERTa and Longformer) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, and seBERT). However, it also illustrates the ``No Silver Bullet'' concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple-yet-effective strategy to improve the best-performing model by continuing the pre-training phase with the textual artifact from Stack Overflow.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Software Engineering
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Microservices: yesterday, today, and tomorrow
π
π
The Cartographer
A Survey of Machine Learning for Big Code and Naturalness
R.I.P.
π»
Ghosted
An Overview on Smart Contracts: Challenges, Advances and Platforms
R.I.P.
π»
Ghosted
Slither: A Static Analysis Framework For Smart Contracts
R.I.P.
π»
Ghosted
ContractFuzzer: Fuzzing Smart Contracts for Vulnerability Detection
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted