Joint Perception and Prediction for Autonomous Driving: A Survey

December 18, 2024 · The Cartographer · 🏛 IEEE transactions on intelligent transportation systems (Print)

"No code URL or promise found in abstract"
"Title-pattern auto-detect: Joint Perception and Prediction for Autonomous Driving: A Survey"

Evidence collected by the PWNC Scanner

Authors Lucas Dal'Col, Miguel Oliveira, Vítor Santos arXiv ID 2412.14088 Category cs.CV: Computer Vision Cross-listed cs.RO Citations 8 Venue IEEE transactions on intelligent transportation systems (Print) Last Checked 3 days ago

Abstract

Perception and prediction modules are critical components of autonomous driving systems, enabling vehicles to navigate safely through complex environments. The perception module is responsible for perceiving the environment, including static and dynamic objects, while the prediction module is responsible for predicting the future behavior of these objects. These modules are typically divided into three tasks: object detection, object tracking, and motion prediction. Traditionally, these tasks are developed and optimized independently, with outputs passed sequentially from one to the next. However, this approach has significant limitations: computational resources are not shared across tasks, the lack of joint optimization can amplify errors as they propagate throughout the pipeline, and uncertainty is rarely propagated between modules, resulting in significant information loss. To address these challenges, the joint perception and prediction paradigm has emerged, integrating perception and prediction into a unified model through multi-task learning. This strategy not only overcomes the limitations of previous methods, but also enables the three tasks to have direct access to raw sensor data, allowing richer and more nuanced environmental interpretations. This paper presents the first comprehensive survey of joint perception and prediction for autonomous driving. We propose a taxonomy that categorizes approaches based on input representation, scene context modeling, and output representation, highlighting their contributions and limitations. Additionally, we present a qualitative analysis and quantitative comparison of existing methods. Finally, we discuss future research directions based on identified gaps in the state-of-the-art.