Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance

KAIST   

Overview


Text-to-video (T2V) models excel at producing high-quality, dynamic videos, and recent works have adapted these pre-trained T2V models for image-to-video (I2V) generation to enhance visual controllability. However, this adaptation often suppresses motion dynamics, yielding more static videos than their T2V counterparts.

In this work, we analyze this phenomenon and identify that:

The suppression of motion in I2V models stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image.

To address this, we propose Adaptive Low-pass Guidance (ALG), a simple fix to the I2V model sampling process to generate more dynamic videos without compromising video quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering to the input image at the early stage of denoising.

Description of image
Our method (ALG) mitigates the motion suppression in I2V models by adaptively modulating the frequency content of the input image.

Under VBench-I2V benchmark, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.

Problem: Suppressed Motion in I2V Models


Image-to-Video (I2V) models offer enhanced visual control by animating a user-provided image from text prompts. However, these models, frequently adapted from Text-to-Video (T2V) architectures, often produce much more static videos than their T2V counterparts, even for dynamic descriptions.

Observation: T2V-I2V motion dynamics gap.

We first systemically quantify this "motion suppression". Specifically, we compare T2V models to their I2V derivatives in a controlled setup: videos were first generated by T2V models, and we use their initial frames as I2V inputs with the same prompts. This lets us focus only on the difference of conditioning mechanism, and rule out other factors like model architecture or train data.

Quantitative evaluation in VBench reveals a consistent and significant reduction only in video dynamicness for I2V models (e.g., -18.6% for Wan 2.1), while other quality metrics remain stable. This indicates that the I2V conditioning mechanism itself is a primary contributor to the observed motion suppression.

Table comparing T2V and I2V motion dynamics
Significant drop in Dynamic Degree is seen for I2V models compared to their T2V counterparts, while other factors remain more or less similar; this gives us clue that the I2V conditioning mechanism is the problematic factor.

Hypothesis: Over-conditioning on high-frequency details lead to "shortcuts."

We hypothesize that this motion suppression stems from the I2V model's premature over-conditioning on high-frequency components (fine details, textures, sharp edges) present in the reference image.

To investigate further, we inspect the internal representation of DiT denoiser (Wan 2.1) and visualize them using PCA. We observe that the model rapidly "locks in" onto the static, fine-grained details, even after just one denoising step. This early completion ("shortcut") prematurely confines the generation trajectory, and hinders the development of large, dynamic motion over time (which is expected in natural coarse-to-fine generation trajectories).

Visualization of shortcut effect in I2V generation
I2V generation shows fine details locking in very early (t=0.02; single denoising step), limiting the flexibilty of the sampling trajectory.

Diagnosis: Low-pass filtering mitigates suppression at a cost.

To verify if the high-frequency detail indeed is the cause, we apply low-pass filters (e.g., image downsampling) of varying strengths to the input image before I2V generation. The results support our hypothesis: stronger low-pass filtering consistently increases the dynamic degree of the generated videos.

Effect of low-pass filtering on motion dynamics and quality
Low-pass filtering the input image improves motion but degrades quality. (a) Increasing filter strength boosts Dynamic Degree but reduces Aesthetic Quality (VBench). (b) Visual examples illustrate this trade-off.

Low-pass filtering mitigates the "shortcut" effect.

Alongside this dynamicness enhancement, we observe the elimination of the "shortcut" effect (bottom row):

Visualization of shortcut effect in I2V generation
Low-pass filtering mitigates the "shortcut" effect and reverts it back to the natural coarse-to-fine generation trajectory.

This reverts the trajectory back to a coarse-to-fine one, and allows more flexibility in the trajectory. Thus, the sampling results in a more dynamic video.

Method: Adaptive Low-Pass Guidance (ALG)


However, the simple low-pass filtering "solution" discussed above has an inherent trade-off: aggressive low-pass filtering degrades image fidelity, as the model is conditioned on an blurred reference (i.e., impossible to recover original image). This motivates a more nuanced sampling method:

If the early shortcut in the trajectory causes motion suppression, can we bypass it with low-pass filtering in early sampling steps, then reduce the filter strength later for image fidelity?
Description of image
We apply low-pass filtering early on for dynamicness, and reduce filter strength later for image fidelity.

Our method, Adaptive Low-pass Guidance (ALG), does exactly this by adaptively modulating the frequency content in images:

Result


CFG (no filter, default) Constant low-pass filter ALG (Ours)
Static motion, high input image fidelity Dynamic motion, low input image fidelity Dynamic motion, high input image fidelity

We observe that ALG (right) effectively enhances motion in generated videos without sacrificing the input image fidelity.

Description of image
We apply low-pass filtering early on for dynamicness, and reduce filter strength later for image fidelity.

Evaluation under VBench-I2V shows on average a 36% increase in Dynamic Degree across 4 commonly used open-source I2V models without a significant drop in image fidelity or video quality.

More qualitative examples can be found in the Gallery page   of our website.