Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation

June 26, 2025 · Declared Dead · 🏛 40th Annual AAAI Conference on Artificial Intelligence (AAAI), 2026

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Thomas Valentin, Ardi Madadi, Gaetano Sapia, Marcel Böhme arXiv ID 2507.00057 Category cs.PL: Programming Languages Cross-listed cs.AI, cs.LG, cs.SE Citations 2 Venue 40th Annual AAAI Conference on Artificial Intelligence (AAAI), 2026 Last Checked 4 months ago

Abstract

Generating code from a natural language programming task is one of the most successful applications of Large Language Models (LLMs). Yet, the generated program may be buggy. Without an oracle, such as an existing, correct implementation or a formal specification, can we somehow estimate how likely the generated program is correct? In this paper, we propose a measure of incorrectness, called *incoherence*, that can be estimated efficiently in the absence of an oracle and allows us to establish a lower bound on the error, i.e., the probability that the LLM-generated program for that specification is incorrect. In our experiments, our incoherence-based methodology can automatically identify about two-thirds of incorrect programs without reports of false positives for the average task. In fact, *an oracle-based evaluation of LLMs can be reliably replaced by an incoherence-based evaluation*. In particular, we find a very strong agreement between the ranking of LLMs by the number of programs deemed correct via an oracle (pass@1) and the ranking of LLMs by the number of programs deemed correct via incoherence.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — Programming Languages

R.I.P. 👻 Ghosted

Ascertaining Uncertainty for Efficient Exact Cache Analysis

Valentin Touzeau, Claire Maïza, ... (+2 more)

cs.PL 🏛 CAV 📚 816 cites 8 years ago

R.I.P. 👻 Ghosted

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Nicolas Vasilache, Oleksandr Zinenko, ... (+7 more)

cs.PL 🏛 arXiv 📚 472 cites 8 years ago

R.I.P. 👻 Ghosted

Glow: Graph Lowering Compiler Techniques for Neural Networks

Nadav Rotem, Jordan Fix, ... (+16 more)

cs.PL 🏛 arXiv 📚 318 cites 8 years ago

R.I.P. 👻 Ghosted

Learnable Programming: Blocks and Beyond

David Bau, Jeff Gray, ... (+3 more)

cs.PL 🏛 CACM 📚 298 cites 9 years ago

R.I.P. 👻 Ghosted

Scenic: A Language for Scenario Specification and Scene Generation

Daniel J. Fremont, Tommaso Dreossi, ... (+4 more)

cs.PL 🏛 ACM-SIGPLAN Symposium on Programming Language Design and Implementation 📚 297 cites 7 years ago

R.I.P. 👻 Ghosted

Vandal: A Scalable Security Analysis Framework for Smart Contracts

Lexi Brent, Anton Jurisevic, ... (+6 more)

cs.PL 🏛 arXiv 📚 296 cites 7 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečný, H. Brendan McMahan, ... (+4 more)

cs.LG 🏛 arXiv 📚 5.2K cites 9 years ago

R.I.P. 👻 Ghosted

In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, Cliff Young, ... (+73 more)

cs.AR 🏛 ISCA 📚 5.1K cites 9 years ago

R.I.P. 👻 Ghosted

Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning

Hoo-Chang Shin, Holger R. Roth, ... (+7 more)

cs.CV 🏛 IEEE TMI 📚 4.9K cites 10 years ago

R.I.P. 👻 Ghosted

Explanation in Artificial Intelligence: Insights from the Social Sciences

Tim Miller

cs.AI 🏛 AI 📚 4.9K cites 9 years ago