Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML

September 15, 2025 · Declared Dead · 🏛 International Conference on Automated Software Engineering

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Yash Mundhra, Max Valk, Maliheh Izadi arXiv ID 2509.12395 Category cs.SE: Software Engineering Cross-listed cs.AI Citations 0 Venue International Conference on Automated Software Engineering Last Checked 4 months ago

Abstract

Large language models have shown impressive performance in various domains, including code generation across diverse open-source domains. However, their applicability in proprietary industrial settings, where domain-specific constraints and code interdependencies are prevalent, remains largely unexplored. We present a case study conducted in collaboration with the leveling department at ASML to investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment. We developed an evaluation framework tailored to ASML's proprietary codebase and introduced a new benchmark. Additionally, we proposed a new evaluation metric, build@k, to assess whether LLM-generated code successfully compiles and integrates within real industrial repositories. We investigate various prompting techniques, compare the performance of generic and code-specific LLMs, and examine the impact of model size on code generation capabilities, using both match-based and execution-based metrics. The findings reveal that prompting techniques and model size have a significant impact on output quality, with few-shot and chain-of-thought prompting yielding the highest build success rates. The difference in performance between the code-specific LLMs and generic LLMs was less pronounced and varied substantially across different model families.