ACWM-Phys: Generalized Physical Interaction in Action-Conditioned Video World Models

ACWM-Phys Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

A benchmark of 8 physics-rich environments spanning four interaction regimes — rigid-body, deformable, particle, and kinematics — paired with ACWM-DiT, a latent diffusion transformer trained with flow matching. We probe whether modern action-conditioned video models actually learn physics, or merely appearance.

Physics
environments

Interaction
regimes

15K+

Simulated
trajectories

InD · OoD

Controlled
splits

ACWM-Phys is designed to answer two questions about action-conditioned video world models:

How well can ACWMs learn different types of physics?

Existing benchmarks are confined to egocentric navigation or narrow rigid-body manipulation. ACWM-Phys spans rigid-body, deformable, particle, and kinematic regimes so we can finally compare prediction quality across physical phenomena rather than within a single domain.

Can they generalize beyond the training distribution?

Every environment ships a physically meaningful, exactly reproducible distribution shift — doubled particle counts, unseen cube counts, larger cloth, expanded goal regions — so we can measure the InD → OoD gap and pinpoint where diffusion world models still rely on appearance shortcuts.

Category	Environment	In-distribution	Out-of-distribution	ΔSSIM
Rigid	Push Cube	2.92	0.955	25.35	2.95	0.954	25.30	−0.001
Stack Cube	5.52	0.889	22.58	7.00	0.872	21.55	−0.017
Deformable	Push Rope	0.21	0.988	36.70	0.33	0.985	34.83	−0.003
Cloth Move	10.67	0.920	19.72	23.82	0.864	16.23	−0.056
Particle	Push Sand	0.52	0.975	32.85	1.53	0.941	28.16	−0.034
Pour Water	2.63	0.911	25.80	3.49	0.874	24.57	−0.037
Kinematics	Robot Arm	1.43	0.969	28.43	6.56	0.902	21.83	−0.067
Reacher	0.26	0.992	35.85	0.27	0.992	35.65	0.000

What actually moves the needle?

We probe four design axes: action conditioning, latent space, action dimensionality, and model scale. Each finding is anchored on the same evaluation protocol as the main results.

Action conditioning

Cross-attention scales with action dimensionality — AdaLN doesn't

At d_a = 2 (Push Cube, Push Rope), AdaLN and cross-attention tie. At d_a = 7 (Robot Arm) cross-attention recovers +3.18 dB InD and +1.55 dB OoD. At d_a = 8 (Cloth Move), it slightly hurts InD but yields a striking +6.59 dB OoD: AdaLN's single global vector can't carry per-arm signals.

Environment	d_a	AdaLN · OoD PSNR	Cross-Attn · OoD PSNR	Δ
Push Cube	2	25.30	25.18	≈
Push Rope	2	34.83	34.77	≈
Robot Arm	7	21.83	23.38	+1.55
Cloth Move	8	16.23	22.82	+6.59

Latent space

Temporally-aware causal VAE > frame-independent VAE

A causal video VAE with 4× temporal compression outperforms a frame-independent image VAE on both InD and OoD — even on a highly stochastic particle task like Pour Water. Temporal coupling in latent space matters more than higher spatial fidelity per frame.

VAE	Temp. compression	InD PSNR	OoD PSNR
FluxVAE	1×	24.89	24.09
WanVAE (ours)	4×	25.80	24.57

Action dimensionality

Richer actions hurt InD but unlock OoD generalization

On Cloth Move, expanding the action space from a shared 3-DoF displacement to full 8-DoF per-arm control lifts OoD PSNR from 16.23 → 21.60 dB (a +5.37 dB jump), at a modest InD cost. Richer control signals also act as richer observations.

Action space	d_a	InD PSNR	OoD PSNR
Shared Δxyz	3	19.72	16.23
Per-arm Δpose + grasp	8	18.91	21.60

Model scale

Bigger helps OoD more than InD — with diminishing returns

Scaling DiT-S → DiT-M → DiT-L (200M → 600M → 800M) yields the largest gains on the OoD split: Robot Arm OoD PSNR climbs 21.84 → 23.51 dB. The S→M jump dominates; M→L is incremental at the current data scale.

Model	Params	Cloth Move · InD	Cloth Move · OoD	Robot Arm · InD	Robot Arm · OoD
DiT-S	~200M	19.68	16.49	28.38	21.84
DiT-M	~600M	19.89	16.98	29.04	23.11
DiT-L	~800M	20.01	17.24	29.27	23.51

BibTeX

@article{xue2026acwm,
  title={ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models},
  author={Xue, Haotian and Chen, Yipu and Ma, Liqian and Zhao, Zelin and Moukheiber, Lama and Zhu, Yuchen and Che, Yongxin},
  journal={arXiv preprint arXiv:2605.08567},
  year={2026}
}

ACWM-Phys Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

How well can ACWMs learn different types of physics?

Can they generalize beyond the training distribution?

Jump to

ACWM-Phys: 8 environments, 4 physics regimes

Push Cube a ∈ ℝ²

Stack Cube a ∈ ℝ⁷

Push Rope a ∈ ℝ²

Cloth Move a ∈ ℝ³

Push Sand a ∈ ℝ⁷

Pour Water a ∈ ℝ⁴

Robot Arm a ∈ ℝ⁷

Reacher a ∈ ℝ²

ACWM-DiT

In-distribution & out-of-distribution rollouts

Push Cube

Stack Cube

Push Rope

Cloth Move

Push Sand

Pour Water

Robot Arm

Reacher

Full quantitative results

What actually moves the needle?

Cross-attention scales with action dimensionality — AdaLN doesn't

Temporally-aware causal VAE > frame-independent VAE

Richer actions hurt InD but unlock OoD generalization

Bigger helps OoD more than InD — with diminishing returns

What we learned about diffusion ACWMs & physics

Generalization tracks state dimensionality, not physics category.

Cross-attention beats AdaLN once actions are high-dimensional.

Temporal VAE + richer actions consistently improve OoD.

Models still capture visual statistics, not physical laws.

Stress-test your world model on ACWM-Phys

BibTeX

Category	Environment	In-distribution			Out-of-distribution			ΔSSIM
Category	Environment	MSE↓	SSIM↑	PSNR↑	MSE↓	SSIM↑	PSNR↑	ΔSSIM
Rigid	Push Cube	2.92	0.955	25.35	2.95	0.954	25.30	−0.001
Rigid	Stack Cube	5.52	0.889	22.58	7.00	0.872	21.55	−0.017
Deformable	Push Rope	0.21	0.988	36.70	0.33	0.985	34.83	−0.003
Deformable	Cloth Move	10.67	0.920	19.72	23.82	0.864	16.23	−0.056
Particle	Push Sand	0.52	0.975	32.85	1.53	0.941	28.16	−0.034
Particle	Pour Water	2.63	0.911	25.80	3.49	0.874	24.57	−0.037
Kinematics	Robot Arm	1.43	0.969	28.43	6.56	0.902	21.83	−0.067
Kinematics	Reacher	0.26	0.992	35.85	0.27	0.992	35.65	0.000