Driver-WM

Abstract

Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition.

Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate robust long-horizon geometric forecasting for reactive high-motion maneuvers and improved semantic alignment for both driver and traffic states. The explicit external-to-internal conditioning further enables controlled test-time interventions for systematic mechanism analysis.

World model Driver monitoring system Vision-language model Human-in-the-loop driving automation Autonomous driving

Highlights

Driver-centric latent world model for dynamics rollout. Driver-WM performs multi-step forecasting of in-cabin driver dynamics, unifying kinematic trajectories and auxiliary semantic factors under traffic-conditioned temporal rollout.

Directionally coupled dual-stream dynamics with gated injection. External traffic context is injected into internal driver dynamics through a time-causal, learned gate, supporting controllable conditioning and intervention analysis.

Foundation-derived state interface with unified decoding. Frozen Qwen3-VL features provide a compact perceptual interface for geometric rollout, physical priors, and semantic regularization.

Method / Architecture

From synchronized in-/out-cabin videos, a frozen Qwen3-VL encoder extracts dual-stream latent features. Pooled external history perturbs the internal transition through gated causal injection, yielding autoregressive internal rollouts decoded into skeleton trajectories and auxiliary semantic predictions.

Results / Demos

71.47

All MPJPE (px)

3.24%

All d-nMPJPE

71.66

PCK@0.05

90.15

Traffic Context F1

Model	All MPJPE	All d-nMPJPE	HM MPJPE	PCK@0.05	DBR F1	DER F1	TCR F1	VCR F1
MotionBERT	73.51	3.34	141.53	78.01	56.70	52.71	-	-
Static pooling	68.50	3.11	134.56	72.55	54.04	60.07	26.39	14.00
Cross-Attn only	80.14	3.64	142.41	66.43	62.52	68.75	87.94	69.09
Driver-WM (main)	71.47	3.24	138.03	71.66	68.07	72.61	90.15	68.34
Non-causal reference	72.15	3.27	136.50	71.39	73.35	74.46	88.65	68.82

Main AIDE results under the fixed 5→5 causal rollout protocol. HM denotes the top 10% high-motion clips ranked by future-window joint displacement.

Controlled interventions and high-motion horizon curve

Mechanism and Dynamics

Swapping external context or disabling injection changes reactive hand motion. Horizon-wise MPJPE shows Driver-WM mitigates the long-horizon degradation observed in motion-only baselines on the high-motion tail.

Renderer rollouts under traffic-context interventions

Renderer Rollout Under Interventions

Five-step visual rollouts compare ground truth, Driver-WM, and controlled interventions that remove external context or disable cross-attention injection, showing how traffic-conditioned cues shape in-cabin dynamics.

Qualitative filmstrip of factual and intervened rollouts

Qualitative Rollout and Intervention

Predicted skeletons for future frames f6-f10 are overlaid on ground-truth in-cabin frames. Removing external context yields a visibly different hand and upper-body trajectory.

Controlled Interventions

Driver-WM enables post-hoc test-time mechanism probes. External context can be swapped, removed, temporally shifted, or partially dropped; the injection path can also be clamped through the scalar override λ_CA. These interventions do not claim causal effect estimation, but they expose whether the trained rollout actually uses traffic context and whether the gated pathway is necessary.

Intervention	Target	ΔAll	ΔHM	Δh=5	ΔHead	ΔHands
Factual Driver-WM	None	0.000	0.000	0.000	0.000	0.000
do(Ext=Swap_clip)	Cross-video content swap	5.363	4.794	8.042	5.462	3.502
do(Ext=empty)	Remove out-cabin features	12.953	8.691	15.208	7.385	12.937
do(lambda_CA=0)	Disable pathway	89.641	62.232	116.751	36.783	95.666
do(lambda_CA=1)	Force injection	26.733	11.480	42.170	15.861	26.727

Larger deviations indicate stronger sensitivity to the corresponding intervention. Disabling the injection pathway produces the largest rollout deviation, especially on hand-related joints.

Media Coverage

Featured by 自动驾驶之心

Driver-WM was covered by 自动驾驶之心, a Chinese media platform focusing on autonomous driving research and industry.

Read the Chinese Coverage

BibTeX

@misc{chi2026driverwmdrivercentrictrafficconditionedlatent,
      title={Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout},
      author={Haozhuang Chi and Daosheng Qiu and Hao Su and Haochen Liu and Zirui Li and Haoruo Zhang and Chen Lv},
      year={2026},
      eprint={2605.05092},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.05092},
}

Acknowledgements

Supported by

We thank the AUMOVIO-NTU Corporate Lab for its support and collaborative research environment.