Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout

1Nanyang Technological University 2Hubei University 3Osaka University
Nanyang Technological University logo AutoMan Lab logo

Driver-WM moves beyond recognition-only driver monitoring and scene-centric world models by rolling out future in-cabin driver dynamics conditioned on synchronized out-cabin traffic observations. The page highlights controlled interventions where external traffic context is removed, swapped, or gated to probe how driver motion changes over the prediction horizon.

Abstract

Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition.

Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate robust long-horizon geometric forecasting for reactive high-motion maneuvers and improved semantic alignment for both driver and traffic states. The explicit external-to-internal conditioning further enables controlled test-time interventions for systematic mechanism analysis.

World model Driver monitoring system Vision-language model Human-in-the-loop driving automation Autonomous driving

Highlights

Driver-centric latent world model for dynamics rollout. Driver-WM performs multi-step forecasting of in-cabin driver dynamics, unifying kinematic trajectories and auxiliary semantic factors under traffic-conditioned temporal rollout.

Directionally coupled dual-stream dynamics with gated injection. External traffic context is injected into internal driver dynamics through a time-causal, learned gate, supporting controllable conditioning and intervention analysis.

Foundation-derived state interface with unified decoding. Frozen Qwen3-VL features provide a compact perceptual interface for geometric rollout, physical priors, and semantic regularization.

Method / Architecture

Overall architecture of Driver-WM

From synchronized in-/out-cabin videos, a frozen Qwen3-VL encoder extracts dual-stream latent features. Pooled external history perturbs the internal transition through gated causal injection, yielding autoregressive internal rollouts decoded into skeleton trajectories and auxiliary semantic predictions.

Results / Demos

71.47
All MPJPE (px)
3.24%
All d-nMPJPE
71.66
PCK@0.05
90.15
Traffic Context F1
Model All MPJPE All d-nMPJPE HM MPJPE PCK@0.05 DBR F1 DER F1 TCR F1 VCR F1
MotionBERT 73.51 3.34 141.53 78.01 56.70 52.71 - -
Static pooling 68.50 3.11 134.56 72.55 54.04 60.07 26.39 14.00
Cross-Attn only 80.14 3.64 142.41 66.43 62.52 68.75 87.94 69.09
Driver-WM (main) 71.47 3.24 138.03 71.66 68.07 72.61 90.15 68.34
Non-causal reference 72.15 3.27 136.50 71.39 73.35 74.46 88.65 68.82

Main AIDE results under the fixed 5→5 causal rollout protocol. HM denotes the top 10% high-motion clips ranked by future-window joint displacement.

Controlled interventions and high-motion horizon curve

Mechanism and Dynamics

Swapping external context or disabling injection changes reactive hand motion. Horizon-wise MPJPE shows Driver-WM mitigates the long-horizon degradation observed in motion-only baselines on the high-motion tail.

Renderer rollouts under traffic-context interventions

Renderer Rollout Under Interventions

Five-step visual rollouts compare ground truth, Driver-WM, and controlled interventions that remove external context or disable cross-attention injection, showing how traffic-conditioned cues shape in-cabin dynamics.

Qualitative filmstrip of factual and intervened rollouts

Qualitative Rollout and Intervention

Predicted skeletons for future frames f6-f10 are overlaid on ground-truth in-cabin frames. Removing external context yields a visibly different hand and upper-body trajectory.

Controlled Interventions

Driver-WM enables post-hoc test-time mechanism probes. External context can be swapped, removed, temporally shifted, or partially dropped; the injection path can also be clamped through the scalar override λCA. These interventions do not claim causal effect estimation, but they expose whether the trained rollout actually uses traffic context and whether the gated pathway is necessary.

Intervention Target ΔAll ΔHM Δh=5 ΔHead ΔHands
Factual Driver-WM None 0.000 0.000 0.000 0.000 0.000
do(Ext=Swap_clip) Cross-video content swap 5.363 4.794 8.042 5.462 3.502
do(Ext=empty) Remove out-cabin features 12.953 8.691 15.208 7.385 12.937
do(lambda_CA=0) Disable pathway 89.641 62.232 116.751 36.783 95.666
do(lambda_CA=1) Force injection 26.733 11.480 42.170 15.861 26.727

Larger deviations indicate stronger sensitivity to the corresponding intervention. Disabling the injection pathway produces the largest rollout deviation, especially on hand-related joints.

Media Coverage

Featured by 自动驾驶之心

Driver-WM was covered by 自动驾驶之心, a Chinese media platform focusing on autonomous driving research and industry.

Read the Chinese Coverage

BibTeX

@misc{chi2026driverwmdrivercentrictrafficconditionedlatent,
      title={Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout},
      author={Haozhuang Chi and Daosheng Qiu and Hao Su and Haochen Liu and Zirui Li and Haoruo Zhang and Chen Lv},
      year={2026},
      eprint={2605.05092},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.05092},
}

Acknowledgements

Supported by

AUMOVIO-NTU Corporate Lab logo

We thank the AUMOVIO-NTU Corporate Lab for its support and collaborative research environment.