(Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs

November 18, 2023 Β· Declared Dead Β· πŸ› 2024 IEEE/ACM 3rd International Conference on AI Engineering – Software Engineering for AI (CAIN)

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Wanqin Ma, Chenyang Yang, Christian KΓ€stner arXiv ID 2311.11123 Category cs.SE: Software Engineering Cross-listed cs.CL Citations 27 Venue 2024 IEEE/ACM 3rd International Conference on AI Engineering – Software Engineering for AI (CAIN) Last Checked 4 months ago
Abstract
Large Language Models (LLMs) are increasingly integrated into software applications. Downstream application developers often access LLMs through APIs provided as a service. However, LLM APIs are often updated silently and scheduled to be deprecated, forcing users to continuously adapt to evolving models. This can cause performance regression and affect prompt design choices, as evidenced by our case study on toxicity detection. Based on our case study, we emphasize the need for and re-examine the concept of regression testing for evolving LLM APIs. We argue that regression testing LLMs requires fundamental changes to traditional testing approaches, due to different correctness notions, prompting brittleness, and non-determinism in LLM APIs.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Software Engineering

Died the same way β€” πŸ‘» Ghosted