Video Diffusion · Motion GAN

MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Haotian Xue^1,2, Qi Chen¹, Zhonghao Wang¹, Xun Huang¹, Eli Shechtman¹, Jinrong Xie¹, Yongxin Chen²

¹ Adobe Research · ² Georgia Tech

📄 Paper (arXiv) 💻 Code (GitHub) 🎥 Supplementary Videos

TL;DR We combine a GAN loss in optical-flow space with a few-step video diffusion generator, as a novel post-training strategy, we produce videos with more realistic motions.

Abstract

Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics, and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, so low-loss predictions can still exhibit poor motion.

We introduce MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built on a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to distinguish real versus generated motion, coupled with a distribution-matching regularizer to preserve visual fidelity.

Applied to Wan2.1-T2V-1.3B, our 3-step model delivers better motion quality and comparable visual quality compared with the 50-step baseline, while running over fifteen times faster. On VBench, our method improves smoothness (+0.6%) and significantly restores dynamics degree compared with the distilled model (+15.7% versus DMD), achieving the highest motion score (0.960 versus 0.915 for the 50-step model). Similar improvements are observed on VideoJAM-Bench.

A human evaluation further shows that our model is preferred for motion (52% versus 38% for the 50-step model, 56% versus 29% for DMD) and visual quality. Overall, our approach substantially improves motion realism without sacrificing visual fidelity or efficiency.

Wan2.1-T2V-1.3B (50 steps) - DMD (3 steps) - Ours (3 steps)

a person swimming in ocean

WAN2.1 FULL (50 STEPS)

DMD-3

MOGAN (OURS)

a car turning a corner

WAN2.1 FULL (50 STEPS)

DMD-3

MOGAN (OURS)

A woman performing an intricate dance on stage, illuminated by a single spotlight in the first frame. The woman dances Argentine flamenco.

WAN2.1 FULL (50 STEPS)

DMD-3

MOGAN (OURS)

Method

MoGAN Motion Adversarial Post-Training

MoGAN augments a distilled few-step video diffusion model with a motion-focused adversarial objective in optical-flow space. A DiT-based discriminator operates on RAFT flow fields of real and generated videos, while a distribution-matching regularizer preserves the original model’s appearance distribution.

Overall pipeline of MoGAN. The few-step generator produces videos that are converted to optical flow and scored by a DiT-based motion discriminator, combined with a distribution-matching regularizer.

Quantitative Results

Motion and Visual Quality

On VBench and VideoJAM-Bench, MoGAN significantly improves motion smoothness and dynamics score over the distilled baseline, while maintaining or slightly improving frame quality metrics compared with the original 50-step model.

Summary of quantitative results on VBench and VideoJAM-Bench, including motion smoothness, dynamics degree, and visual quality metrics for Wan2.1 full-step, DMD-distilled, and MoGAN.

Human Evaluation

Preference for Motion and Visual Quality

We conduct a human study comparing MoGAN with both the 50-step Wan2.1 model and the distilled DMD baseline. Annotators consistently prefer MoGAN for motion realism and also for overall visual quality, confirming that the motion-focused adversarial training does not sacrifice perceptual fidelity.

Human preference study showing win–tie–lose rates for MoGAN against 50-step Wan2.1 and DMD, across motion quality, visual quality, and text alignment.