Multimodal Framework for Explainable Autonomous Driving: Integrating Video, Sensor, and Textual Data for Enhanced Decision-Making and Transparency

July 10, 2025 Β· Declared Dead Β· πŸ› arXiv.org

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Abolfazl Zarghani, Amirhossein Ebrahimi, Amir Malekesfandiari arXiv ID 2507.07938 Category cs.MM: Multimedia Citations 2 Venue arXiv.org Last Checked 3 months ago
Abstract
Autonomous vehicles (AVs) are poised to redefine transportation by enhancing road safety, minimizing human error, and optimizing traffic efficiency. The success of AVs depends on their ability to interpret complex, dynamic environments through diverse data sources, including video streams, sensor measurements, and contextual textual information. However, seamlessly integrating these multimodal inputs and ensuring transparency in AI-driven decisions remain formidable challenges. This study introduces a novel multimodal framework that synergistically combines video, sensor, and textual data to predict driving actions while generating human-readable explanations, fostering trust and regulatory compliance. By leveraging VideoMAE for spatiotemporal video analysis, a custom sensor fusion module for real-time data processing, and BERT for textual comprehension, our approach achieves robust decision-making and interpretable outputs. Evaluated on the BDD-X (21113 samples) and nuScenes (1000 scenes) datasets, our model reduces training loss from 5.7231 to 0.0187 over five epochs, attaining an action prediction accuracy of 92.5% and a BLEU-4 score of 0.75 for explanation quality, outperforming state-of-the-art methods. Ablation studies confirm the critical role of each modality, while qualitative analyses and human evaluations highlight the model's ability to produce contextually rich, user-friendly explanations. These advancements underscore the transformative potential of multimodal integration and explainability in building safe, transparent, and trustworthy AV systems, paving the way for broader societal adoption of autonomous driving technologies.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Multimedia

R.I.P. πŸ‘» Ghosted

Video Generation From Text

Yitong Li, Martin Renqiang Min, ... (+3 more)

cs.MM πŸ› AAAI πŸ“š 300 cites 8 years ago

Died the same way β€” πŸ‘» Ghosted