Output Format Biases in the Evaluation of Large Language Models for Code Translation
March 25, 2024 Β· Declared Dead Β· π 2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge) Conference Acronym:
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Marcos Macedo, Yuan Tian, Filipe R. Cogo, Bram Adams
arXiv ID
2403.17214
Category
cs.SE: Software Engineering
Cross-listed
cs.AI
Citations
32
Venue
2024 IEEE/ACM First International Conference on AI Foundation Models and Software Engineering (Forge) Conference Acronym:
Last Checked
4 months ago
Abstract
Code translation between programming languages (PLs) is a critical task in software engineering, facilitating the modernization of legacy systems, ensuring cross-platform compatibility, and enhancing software performance. Most existing studies instruct LLMs to perform code translation and evaluate their performance by either running the generated outputs through test suites or comparing them to reference outputs (ground truth). These outputs, however, may contain not only executable source code but also additional non-code elements, such as natural language explanations or formatting tokens. We refer to the combination of source code and non-code elements as the output format. It is crucial to understand and address variations in output format, as non-code elements can interfere with evaluation metrics, resulting in biased assessments of model performance and comparisons. We conduct an empirical analysis of the outputs from eleven instruct-tuned open-source LLMs, across five PLs: C, C++, Go, Java, and Python. The results show that between 26.4% and 73.7% of outputs produced by our evaluated LLMs necessitate post-processing. To mitigate output format bias, we propose a strategic combination of prompt engineering and regular expressions that effectively extracts source code from mixed-format outputs, enabling the eleven open-source models to achieve an average Code Extraction Success Rate (CSR) of 92.73%. Our empirical study confirms that output format bias affects widely used execution-based metrics, i.e., Computational Accuracy (CA), and text-based metrics, i.e., BLEU, CodeBLEU and CrystalBLEU. Additionally, we test five closed-source LLMs and observe that they also generate varying distributions of output formats, which could lead to output format biases. Our results highlight the need to mitigate the output format bias to enable reliable evaluations in LLMs for code translation.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Software Engineering
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Microservices: yesterday, today, and tomorrow
π
π
The Cartographer
A Survey of Machine Learning for Big Code and Naturalness
R.I.P.
π»
Ghosted
An Overview on Smart Contracts: Challenges, Advances and Platforms
R.I.P.
π»
Ghosted
Slither: A Static Analysis Framework For Smart Contracts
R.I.P.
π»
Ghosted
ContractFuzzer: Fuzzing Smart Contracts for Vulnerability Detection
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted