Retrieval-Augmented Test Generation: How Far Are We?

September 19, 2024 · Declared Dead · 🏛 arXiv.org

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jiho Shin, Nima Shiri Harzevili, Reem Aleithan, Hadi Hemmati, Song Wang arXiv ID 2409.12682 Category cs.SE: Software Engineering Cross-listed cs.AI Citations 13 Venue arXiv.org Last Checked 4 months ago

Abstract

Retrieval Augmented Generation (RAG) has advanced software engineering tasks but remains underexplored in unit test generation. To bridge this gap, we investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs and analyze the impact of different knowledge sources on their effectiveness. We examine three domain-specific sources for RAG: (1) API documentation (official guidelines), (2) GitHub issues (developer-reported resolutions), and (3) StackOverflow Q&As (community-driven solutions). Our study focuses on five widely used Python-based ML/DL libraries, TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost, targeting the most-used APIs. We evaluate four state-of-the-art LLMs -- GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B -- across three strategies: basic instruction prompting, Basic RAG, and API-level RAG. Quantitatively, we assess syntactical and dynamic correctness and line coverage. While RAG does not enhance correctness, RAG improves line coverage by 6.5% on average. We found that GitHub issues result in the best improvement in line coverage by providing edge cases from various issues. We also found that these generated unit tests can help detect new bugs. Specifically, 28 bugs were detected, 24 unique bugs were reported to developers, ten were confirmed, four were rejected, and ten are awaiting developers' confirmation. Our findings highlight RAG's potential in unit test generation for improving test coverage with well-targeted knowledge sources. Future work should focus on retrieval techniques that identify documents with unique program states to optimize RAG-based unit test generation further.