Towards Safer and Understandable Driver Intention Prediction

View Interactive Sunburst Chart -->
Overall architecture of the proposed VCBM

Overall architecture of the proposed VCBM. The dual video encoder first generate the spatio-temporal features (tubelet embeddings) for the video pair of ego-vehicle and gaze input sequence. These, tubelets are concatenated along the channel dimension and then fed into the proposed learnable token merging block to produces K-cluster centers based on composite distances. These clusters are then fed into localised concept bottleneck to disentangled and predict the maneuver label and one or more explanations to justify the maneuver decision

Abstract

Autonomous driving (AD) systems increasingly rely on deep learning, but understanding why they make decisions remains a challenge. To address this, we introduce Driver Intent Prediction (DIP) — the task of anticipating driver maneuvers before they happen, with a focus on interpretability. We present DAAD-X, a new multimodal, egocentric video dataset offering high-level, hierarchical textual explanations grounded in both driver eye-gaze and vehicle perspective. To model this, we propose the Video Concept Bottleneck Model (VCBM) — a framework that inherently generates spatio-temporally coherent explanations without post-hoc methods. Through extensive experiments, we show that transformer-based models offer superior interpretability over CNNs. We also introduce a multilabel t-SNE visualization to highlight the causal structure among explanations.

Video Presentation

Another Carousel

Poster

BibTeX

BibTex Code Here