MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
Abstract
Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics, and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, so low-loss predictions can still exhibit poor motion.
We introduce MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built on a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to distinguish real versus generated motion, coupled with a distribution-matching regularizer to preserve visual fidelity.
Applied to Wan2.1-T2V-1.3B, our 3-step model delivers better motion quality and comparable visual quality compared with the 50-step baseline, while running over fifteen times faster. On VBench, our method improves smoothness (+0.6%) and significantly restores dynamics degree compared with the distilled model (+15.7% versus DMD), achieving the highest motion score (0.960 versus 0.915 for the 50-step model). Similar improvements are observed on VideoJAM-Bench.
A human evaluation further shows that our model is preferred for motion (52% versus 38% for the 50-step model, 56% versus 29% for DMD) and visual quality. Overall, our approach substantially improves motion realism without sacrificing visual fidelity or efficiency.
a person swimming in ocean
a car turning a corner
A woman performing an intricate dance on stage, illuminated by a single spotlight in the first frame. The woman dances Argentine flamenco.
MoGAN Motion Adversarial Post-Training
MoGAN augments a distilled few-step video diffusion model with a motion-focused adversarial objective in optical-flow space. A DiT-based discriminator operates on RAFT flow fields of real and generated videos, while a distribution-matching regularizer preserves the original model’s appearance distribution.
Motion and Visual Quality
On VBench and VideoJAM-Bench, MoGAN significantly improves motion smoothness and dynamics score over the distilled baseline, while maintaining or slightly improving frame quality metrics compared with the original 50-step model.
Preference for Motion and Visual Quality
We conduct a human study comparing MoGAN with both the 50-step Wan2.1 model and the distilled DMD baseline. Annotators consistently prefer MoGAN for motion realism and also for overall visual quality, confirming that the motion-focused adversarial training does not sacrifice perceptual fidelity.