Benchmark Diffusion world model Physics · video

ACWM-Phys Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

A benchmark of 8 physics-rich environments spanning four interaction regimes — rigid-body, deformable, particle, and kinematics — paired with ACWM-DiT, a latent diffusion transformer trained with flow matching. We probe whether modern action-conditioned video models actually learn physics, or merely appearance.

8
Physics
environments
4
Interaction
regimes
15K+
Simulated
trajectories
InD · OoD
Controlled
splits
ACWM-Phys teaser: diverse physical scenes and their InD / OoD variants.
ACWM-Phys covers four physical interaction regimes — rigid-body, deformable, particle, and kinematics — each with a controlled out-of-distribution shift (more / fewer water particles, additional cubes, larger cloth, expanded workspace).

ACWM-Phys is designed to answer two questions about action-conditioned video world models:

Q1

How well can ACWMs learn different types of physics?

Existing benchmarks are confined to egocentric navigation or narrow rigid-body manipulation. ACWM-Phys spans rigid-body, deformable, particle, and kinematic regimes so we can finally compare prediction quality across physical phenomena rather than within a single domain.

Q2

Can they generalize beyond the training distribution?

Every environment ships a physically meaningful, exactly reproducible distribution shift — doubled particle counts, unseen cube counts, larger cloth, expanded goal regions — so we can measure the InD → OoD gap and pinpoint where diffusion world models still rely on appearance shortcuts.

Jump to

§ 1 — Benchmark

ACWM-Phys: 8 environments, 4 physics regimes

Each environment ships 1,000 InD training trajectories, 50 InD test trajectories, and 100 OoD test trajectories with a physically meaningful distribution shift — unseen cube counts, expanded workspaces, doubled particle counts, larger cloth, or shifted goal regions. Every shift is exactly reproducible inside the simulator. Videos below are the ground-truth simulator rollouts, side-by-side InD vs OoD, so the reader can see what the distribution shift is.

Rigid-body

Push Cube a ∈ ℝ2

A disk pusher translates 1–5 colored cubes to a target.

InD 1 cube
OoD 4+ cubes
Shiftunseen workspace regions and cube counts above 3
Rigid-body

Stack Cube a ∈ ℝ7

A Franka Panda picks the red cube and stacks it on the green one.

InD seen target
OoD unseen placement
Shifttarget placement region withheld during training
Deformable

Push Rope a ∈ ℝ2

A pole pusher deforms a flexible rope (PyFlex).

InD trained length
OoD unseen length
Shiftrope length values held out of training
Deformable

Cloth Move a ∈ ℝ3

Two arms drag a cloth over a fixed sphere with a shared 3D displacement.

InD trained cloth size
OoD larger cloth
Shiftcloth dimensions outside the training distribution
Particle

Push Sand a ∈ ℝ7

A board pusher rearranges granular sand particles (PyFleX).

InD ≤ 73K particles
OoD ~145K particles
Shiftparticle count roughly doubled
Particle

Pour Water a ∈ ℝ4

An arm pours water into a cup via tilt and translation.

InD 20–32 fill layers
OoD 14 / 48 layers
Shiftfluid quantity outside the trained range
Kinematics

Robot Arm a ∈ ℝ7

A 7-DoF Franka Panda reaches targets via cuRobo planning (Isaac Sim).

InD trained workspace
OoD expanded workspace
Shiftgoal workspace expanded beyond training extent
Kinematics

Reacher a ∈ ℝ2

A 2-link MuJoCo arm reaches goals via joint torques.

InD central goals
OoD corner-sector goals
Shiftgoals confined to corner sectors withheld during training

Each pane shows the ground-truth simulator rollout (left side of the original recording). Use the pills above to filter by physics regime.

§ 2 — Model

ACWM-DiT

A bidirectional Diffusion Transformer that denoises future latent video tokens conditioned on the past frame and an action sequence, trained end-to-end with flow matching in the latent space of a frozen causal video VAE.

ACWM-DiT architecture
ACWM-DiT: noisy latent tokens are processed by stacked DiT blocks with alternating spatial and temporal self-attention, modulated by a joint timestep–action embedding.
§ 3 — Main results

In-distribution & out-of-distribution rollouts

For each environment we evaluate ACWM-DiT-S (100k steps, 50 denoising steps) on a matched InD and OoD split. The pattern is striking: simple geometry generalizes; deformation, particle physics, and high-DoF kinematics do not. Every clip shows ground truth (left) | prediction (right); InD on top, OoD below.

Rigid-body

Push Cube

Rigid-body trajectories transfer cleanly across the distribution shift. Cubes occasionally vanish in the final frames of 4-cube OoD scenes — an appearance-shortcut tell.

Rigid-body

Stack Cube

Coarse pick-and-place survives; the Franka end-effector blurs as target positions move outside the trained set.

Deformable

Push Rope

Coarse rope shape is preserved across unseen lengths, but the motion-region (masked) error roughly doubles — the model bends the rope plausibly while missing fine contact response.

Deformable

Cloth Move

The largest OoD drop in the benchmark. Contact-rich, large-scale deformation outruns what an AdaLN-conditioned diffusion model captures from appearance alone.

Particle

Push Sand

Doubling particle count exposes a fine-grained redistribution gap — overall pile structure persists, micro-dynamics blur.

Particle

Pour Water

Pouring trajectories repeat, so error stays bounded — but under unseen volumes the model hallucinates plausible-but-wrong fill levels.

Kinematics

Robot Arm

7-DoF Franka articulation breaks first. AdaLN's single conditioning vector becomes a bottleneck — see the cross-attention ablation below.

Kinematics

Reacher

The cleanest generalizer in the suite: 2-link joint trajectories live on a low-dimensional manifold the model has truly learned.

Full quantitative results

ACWM-DiT-S, 100k training steps, 50 denoising steps. MSE values are scaled by 10−3.

Category Environment In-distribution Out-of-distribution ΔSSIM
MSE↓SSIM↑PSNR↑ MSE↓SSIM↑PSNR↑
RigidPush Cube 2.920.95525.352.950.95425.30−0.001
Stack Cube 5.520.88922.587.000.87221.55−0.017
DeformablePush Rope0.210.98836.700.330.98534.83−0.003
Cloth Move 10.670.92019.7223.820.86416.23−0.056
ParticlePush Sand0.520.97532.851.530.94128.16−0.034
Pour Water 2.630.91125.803.490.87424.57−0.037
KinematicsRobot Arm 1.430.96928.436.560.90221.83−0.067
Reacher 0.260.99235.850.270.99235.650.000
§ 4 — Ablations

What actually moves the needle?

We probe four design axes: action conditioning, latent space, action dimensionality, and model scale. Each finding is anchored on the same evaluation protocol as the main results.

Action conditioning

Cross-attention scales with action dimensionality — AdaLN doesn't

At da = 2 (Push Cube, Push Rope), AdaLN and cross-attention tie. At da = 7 (Robot Arm) cross-attention recovers +3.18 dB InD and +1.55 dB OoD. At da = 8 (Cloth Move), it slightly hurts InD but yields a striking +6.59 dB OoD: AdaLN's single global vector can't carry per-arm signals.

EnvironmentdaAdaLN · OoD PSNRCross-Attn · OoD PSNRΔ
Push Cube225.3025.18
Push Rope234.8334.77
Robot Arm721.8323.38+1.55
Cloth Move816.2322.82+6.59
Latent space

Temporally-aware causal VAE > frame-independent VAE

A causal video VAE with 4× temporal compression outperforms a frame-independent image VAE on both InD and OoD — even on a highly stochastic particle task like Pour Water. Temporal coupling in latent space matters more than higher spatial fidelity per frame.

VAETemp. compressionInD PSNROoD PSNR
FluxVAE24.8924.09
WanVAE (ours)25.8024.57
Action dimensionality

Richer actions hurt InD but unlock OoD generalization

On Cloth Move, expanding the action space from a shared 3-DoF displacement to full 8-DoF per-arm control lifts OoD PSNR from 16.23 → 21.60 dB (a +5.37 dB jump), at a modest InD cost. Richer control signals also act as richer observations.

Action spacedaInD PSNROoD PSNR
Shared Δxyz319.7216.23
Per-arm Δpose + grasp818.9121.60
Model scale

Bigger helps OoD more than InD — with diminishing returns

Scaling DiT-S → DiT-M → DiT-L (200M → 600M → 800M) yields the largest gains on the OoD split: Robot Arm OoD PSNR climbs 21.84 → 23.51 dB. The S→M jump dominates; M→L is incremental at the current data scale.

ModelParamsCloth Move · InDCloth Move · OoDRobot Arm · InDRobot Arm · OoD
DiT-S~200M19.6816.4928.3821.84
DiT-M~600M19.8916.9829.0423.11
DiT-L~800M20.0117.2429.2723.51
§ 5 — Takeaways

What we learned about diffusion ACWMs & physics

  1. 01

    Generalization tracks state dimensionality, not physics category.

    Low-dimensional geometric dynamics (rigid-body translation, 2-link joints) transfer robustly. High-dimensional or stochastic dynamics (cloth, sand, 7-DoF arm) expose the model's reliance on appearance statistics.

  2. 02

    Cross-attention beats AdaLN once actions are high-dimensional.

    At da = 2 the two are indistinguishable. At da = 7–8, cross-attention recovers +3.18 dB InD on Robot Arm and +6.59 dB OoD on Cloth Move — AdaLN's single global vector becomes a per-joint bottleneck.

  3. 03

    Temporal VAE + richer actions consistently improve OoD.

    A causal VAE with 4× temporal compression beats a frame-independent encoder. Expanding Cloth Move to full per-arm control yields +5.37 dB OoD at a small InD cost — richer signals double as richer observations.

  4. 04

    Models still capture visual statistics, not physical laws.

    OoD failures — cubes vanishing, water levels hallucinated, articulated arms blurring — consistently surface where appearance shortcuts can no longer fake the answer. ACWM-Phys gives the community a clean lever to push diffusion world models toward physical structure.

Stress-test your world model on ACWM-Phys

Dataset, checkpoints, and evaluation code are released. Drop your model in — we report masked-MSE / SSIM / PSNR on the same InD & OoD splits.

BibTeX

@article{xue2026acwm,
  title={ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models},
  author={Xue, Haotian and Chen, Yipu and Ma, Liqian and Zhao, Zelin and Moukheiber, Lama and Zhu, Yuchen and Che, Yongxin},
  journal={arXiv preprint arXiv:2605.08567},
  year={2026}
}