Text-to-video (T2V) models excel at producing high-quality, dynamic videos, and recent works have adapted these pre-trained T2V models for image-to-video (I2V) generation to enhance visual controllability. However, this adaptation often suppresses motion dynamics, yielding more static videos than their T2V counterparts.
In this work, we analyze this phenomenon and identify that:
The suppression of motion in I2V models stems from the premature exposure to high-frequency details in the input image, which biases the sampling process toward a shortcut trajectory that overfits to the static appearance of the reference image.
To address this, we propose Adaptive Low-pass Guidance (ALG), a simple fix to the I2V model sampling process to generate more dynamic videos without compromising video quality. Specifically, ALG adaptively modulates the frequency content of the conditioning image by applying low-pass filtering to the input image at the early stage of denoising.
Under VBench-I2V benchmark, ALG achieves an average improvement of 36% in dynamic degree without a significant drop in video quality or image fidelity.
Image-to-Video (I2V) models offer enhanced visual control by animating a user-provided image from text prompts. However, these models, frequently adapted from Text-to-Video (T2V) architectures, often produce much more static videos than their T2V counterparts, even for dynamic descriptions.
We first systemically quantify this "motion suppression". Specifically, we compare T2V models to their I2V derivatives in a controlled setup: videos were first generated by T2V models, and we use their initial frames as I2V inputs with the same prompts. This lets us focus only on the difference of conditioning mechanism, and rule out other factors like model architecture or train data.
Quantitative evaluation in VBench reveals a consistent and significant reduction only in video dynamicness for I2V models (e.g., -18.6% for Wan 2.1), while other quality metrics remain stable. This indicates that the I2V conditioning mechanism itself is a primary contributor to the observed motion suppression.
We hypothesize that this motion suppression stems from the I2V model's premature over-conditioning on high-frequency components (fine details, textures, sharp edges) present in the reference image.
To investigate further, we inspect the internal representation of DiT denoiser (Wan 2.1) and visualize them using PCA. We observe that the model rapidly "locks in" onto the static, fine-grained details, even after just one denoising step. This early completion ("shortcut") prematurely confines the generation trajectory, and hinders the development of large, dynamic motion over time (which is expected in natural coarse-to-fine generation trajectories).
To verify if the high-frequency detail indeed is the cause, we apply low-pass filters (e.g., image downsampling) of varying strengths to the input image before I2V generation. The results support our hypothesis: stronger low-pass filtering consistently increases the dynamic degree of the generated videos.
Alongside this dynamicness enhancement, we observe the elimination of the "shortcut" effect (bottom row):
This reverts the trajectory back to a coarse-to-fine one, and allows more flexibility in the trajectory. Thus, the sampling results in a more dynamic video.
However, the simple low-pass filtering "solution" discussed above has an inherent trade-off: aggressive low-pass filtering degrades image fidelity, as the model is conditioned on an blurred reference (i.e., impossible to recover original image). This motivates a more nuanced sampling method:
If the early shortcut in the trajectory causes motion suppression, can we bypass it with low-pass filtering in early sampling steps, then reduce the filter strength later for image fidelity?
Our method, Adaptive Low-pass Guidance (ALG), does exactly this by adaptively modulating the frequency content in images:
We observe that ALG (right) effectively enhances motion in generated videos without sacrificing the input image fidelity.
Evaluation under VBench-I2V shows on average a 36% increase in Dynamic Degree across 4 commonly used open-source I2V models without a significant drop in image fidelity or video quality.
More qualitative examples can be found in the Gallery page of our website.