Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
October 28, 2024 Β· Declared Dead Β· π arXiv.org
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Eitan Farchi, Shmulik Froimovich, Rami Katan, Orna Raz
arXiv ID
2410.21071
Category
cs.SE: Software Engineering
Citations
11
Venue
arXiv.org
Last Checked
4 months ago
Abstract
LLMs can be used in a variety of code related tasks such as translating from one programming language to another, implementing natural language requirements and code summarization. Artifacts generated by state of the art LLM technology are expected to be useful in the sense that a user will be able to use the LLM generated artifact after a small number of easy modifications. Quantifying this vague notion is challenging and it is thus hard to determine the quality of code related LLM solutions. We refer to evaluation of LLM solutions using LLM judgment as "LLM as a Judge", or LaaJ for short. In this work we introduce a methodology to generate and evaluate LaaJ implementations, utilizing an automatically generated benchmark. The purpose of the benchmark is two fold, namely, it is used both to develop and validate the LaaJs and to validate and test the LLM code related solution using the LaaJs. To that end, we developed an automated benchmark generation engine, which generates code in multiple programming languages for multiple code related tasks and which serves as the input for LaaJ evaluation. We utilize a graph representation, G, of the potential code related generations. The graph vertices are generated artifacts and edges represent possible generations, e.g., the generation of a Java program from its natural language requirements. Utilizing a chain of LLM agents and G we generate code related artifacts. Using cycles in G we formulate expectations on the generated artifacts. Taking advantage of these formulated expectations enables the development and testing of reliable LLM judgement for usefulness of the artifacts generated by the solution. Our approach enables the creation of high quality code task solutions.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
π Similar Papers
In the same crypt β Software Engineering
R.I.P.
π»
Ghosted
R.I.P.
π»
Ghosted
Microservices: yesterday, today, and tomorrow
π
π
The Cartographer
A Survey of Machine Learning for Big Code and Naturalness
R.I.P.
π»
Ghosted
An Overview on Smart Contracts: Challenges, Advances and Platforms
R.I.P.
π»
Ghosted
Slither: A Static Analysis Framework For Smart Contracts
R.I.P.
π»
Ghosted
ContractFuzzer: Fuzzing Smart Contracts for Vulnerability Detection
Died the same way β π» Ghosted
R.I.P.
π»
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
π»
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
π»
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
π»
Ghosted