Enhancing Motion Dynamics of Image-to-Video Models via Adaptive Low-Pass Guidance

June Suk Choi, Kyungmin Lee, Sihyun Yu, Yisol Choi, Jinwoo Shin, Kimin Lee

KAIST

arXiv Code Gallery

OpenAI Sora ALG (Our method) + LTX-Video

Prompt: Several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field. (Click to expand)

LTX-Video ALG (Our method) + LTX-Video

Prompt: A surfer riding a giant wave during a storm, water spraying in all directions. (Click to expand)

LTX-Video ALG (Our method) + LTX-Video

Prompt: A dog leaping through the air to catch a frisbee in a sunny park. (Click to expand)

OpenAI Sora ALG (Our method) + LTX-Video

Prompt: An adorable happy otter confidently stands on a surfboard wearing a yellow lifejacket, riding along. (Click to expand)

LTX-Video ALG (Our method) + LTX-Video

Prompt: A helicopter spinning its blades just before lift-off. (Click to expand)

LTX-Video ALG (Our method) + LTX-Video

Prompt: A sailboat tilting in strong wind during a regatta. (Click to expand)

LTX-Video ALG (Our method) + LTX-Video

Prompt: A group of motorcyclists racing on a dirt track. (Click to expand)

LTX-Video ALG (Our method) + LTX-Video

Prompt: A runner crossing the finish line during a marathon. (Click to expand)

LTX-Video ALG (Our method) + LTX-Video

Prompt: A firefighter spraying water at a burning building. (Click to expand)

Wan 2.1 ALG (Our method) + Wan 2.1

Prompt: A person is cooking food in a wok on a stove. (Click to expand)

LTX-Video ALG (Our method) + LTX-Video

Prompt: A water skier jumping over a ramp. (Click to expand)

Wan 2.1 ALG (Our method) + Wan 2.1

Prompt: A man swinging a tennis racquet at a tennis ball. (Click to expand)

Overview

Text-to-video (T2V) models excel at producing high-quality, dynamic videos, and recent works have adapted these pre-trained T2V models for image-to-video (I2V) generation to enhance visual controllability. However, this adaptation often suppresses motion dynamics, yielding more static videos than their T2V counterparts.

In this work, we analyze this phenomenon and identify that:

The suppression of motion in I2V models stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image.

To address this, we propose Adaptive Low-pass Guidance (ALG), a simple fix to the I2V model sampling process to generate more dynamic videos without compromising video quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering to the input image at the early stage of denoising.

Description of image — Our method (ALG) mitigates the motion suppression in I2V models by adaptively modulating the frequency content of the input image.

Under VBench-I2V benchmark, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.

Problem: Suppressed Motion in I2V Models

Image-to-Video (I2V) models offer enhanced visual control by animating a user-provided image from text prompts. However, these models, frequently adapted from Text-to-Video (T2V) architectures, often produce much more static videos than their T2V counterparts, even for dynamic descriptions.

Observation: T2V-I2V motion dynamics gap.

We first systemically quantify this "motion suppression". Specifically, we compare T2V models to their I2V derivatives in a controlled setup: videos were first generated by T2V models, and we use their initial frames as I2V inputs with the same prompts. This lets us focus only on the difference of conditioning mechanism, and rule out other factors like model architecture or train data.

Quantitative evaluation in VBench reveals a consistent and significant reduction only in video dynamicness for I2V models (e.g., -18.6% for Wan 2.1), while other quality metrics remain stable. This indicates that the I2V conditioning mechanism itself is a primary contributor to the observed motion suppression.

Table comparing T2V and I2V motion dynamics — Significant drop in Dynamic Degree is seen for I2V models compared to their T2V counterparts, while other factors remain more or less similar; this gives us clue that the I2V conditioning mechanism is the problematic factor.

Hypothesis: Over-conditioning on high-frequency details lead to "shortcuts."

We hypothesize that this motion suppression stems from the I2V model's premature over-conditioning on high-frequency components (fine details, textures, sharp edges) present in the reference image.

To investigate further, we inspect the internal representation of DiT denoiser (Wan 2.1) and visualize them using PCA. We observe that the model rapidly "locks in" onto the static, fine-grained details, even after just one denoising step. This early completion ("shortcut") prematurely confines the generation trajectory, and hinders the development of large, dynamic motion over time (which is expected in natural coarse-to-fine generation trajectories).

Visualization of shortcut effect in I2V generation — I2V generation shows fine details locking in very early (t=0.02; single denoising step), limiting the flexibilty of the sampling trajectory.

Diagnosis: Low-pass filtering mitigates suppression at a cost.

To verify if the high-frequency detail indeed is the cause, we apply low-pass filters (e.g., image downsampling) of varying strengths to the input image before I2V generation. The results support our hypothesis: stronger low-pass filtering consistently increases the dynamic degree of the generated videos.

Effect of low-pass filtering on motion dynamics and quality — Low-pass filtering the input image improves motion but degrades quality. (a) Increasing filter strength boosts Dynamic Degree but reduces Aesthetic Quality (VBench). (b) Visual examples illustrate this trade-off.

Low-pass filtering mitigates the "shortcut" effect.

Alongside this dynamicness enhancement, we observe the elimination of the "shortcut" effect (bottom row):

Visualization of shortcut effect in I2V generation — Low-pass filtering mitigates the "shortcut" effect and reverts it back to the natural coarse-to-fine generation trajectory.

This reverts the trajectory back to a coarse-to-fine one, and allows more flexibility in the trajectory. Thus, the sampling results in a more dynamic video.

Method: Adaptive Low-Pass Guidance (ALG)

However, the simple low-pass filtering "solution" discussed above has an inherent trade-off: aggressive low-pass filtering degrades image fidelity, as the model is conditioned on an blurred reference (i.e., impossible to recover original image). This motivates a more nuanced sampling method:

If the early shortcut in the trajectory causes motion suppression, can we bypass it with low-pass filtering in early sampling steps, then reduce the filter strength later for image fidelity?

Description of image — We apply low-pass filtering early on for dynamicness, and reduce filter strength later for image fidelity.

Our method, Adaptive Low-pass Guidance (ALG), does exactly this by adaptively modulating the frequency content in images:

we enhance video dynamism by applying strong low-pass filtering to the input image early on (i.e., t≈0),
and then reconstruct the fine image details by lowering the filter strength later (i.e., t≈1).

Result

CFG (no filter, default) Constant low-pass filter ALG (Ours)

Static motion, high input image fidelity Dynamic motion, low input image fidelity Dynamic motion, high input image fidelity

We observe that ALG (right) effectively enhances motion in generated videos without sacrificing the input image fidelity.

Description of image — We apply low-pass filtering early on for dynamicness, and reduce filter strength later for image fidelity.

Evaluation under VBench-I2V shows on average a 36% increase in Dynamic Degree across 4 commonly used open-source I2V models without a significant drop in image fidelity or video quality.

More qualitative examples can be found in the Gallery page of our website.