VLATest: Testing and Evaluating Vision-Language-Action Models for Robotic Manipulation

September 19, 2024 · Declared Dead · 🏛 Proc. ACM Softw. Eng.

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Zhijie Wang, Zhehua Zhou, Jiayang Song, Yuheng Huang, Zhan Shu, Lei Ma arXiv ID 2409.12894 Category cs.SE: Software Engineering Cross-listed cs.RO Citations 15 Venue Proc. ACM Softw. Eng. Last Checked 4 months ago

Abstract

The rapid advancement of generative AI and multi-modal foundation models has shown significant potential in advancing robotic manipulation. Vision-language-action (VLA) models, in particular, have emerged as a promising approach for visuomotor control by leveraging large-scale vision-language data and robot demonstrations. However, current VLA models are typically evaluated using a limited set of hand-crafted scenes, leaving their general performance and robustness in diverse scenarios largely unexplored. To address this gap, we present VLATest, a fuzzing framework designed to generate robotic manipulation scenes for testing VLA models. Based on VLATest, we conducted an empirical study to assess the performance of seven representative VLA models. Our study results revealed that current VLA models lack the robustness necessary for practical deployment. Additionally, we investigated the impact of various factors, including the number of confounding objects, lighting conditions, camera poses, unseen objects, and task instruction mutations, on the VLA model's performance. Our findings highlight the limitations of existing VLA models, emphasizing the need for further research to develop reliable and trustworthy VLA applications.