In Image-to-Video (I2V) generation, a video is created using an input image as the first-frame condition. Existing I2V methods concatenate the full information of the conditional image with noisy latents to achieve high fidelity. However, the denoisers in these methods tend to shortcut the conditional image, which is known as conditional image leakage, leading to performance degradation issues such as slow motion and color inconsistency. In this work, we further clarify that conditional image leakage leads to overfitting to in-domain data and decreases the performance in out-of-domain scenarios. Moreover, we introduce Fourier-Guided Latent Shifting I2V, named FlashI2V, to prevent conditional image leakage. Concretely, FlashI2V consists of: (1) Latent Shifting. We modify the source and target distributions of flow matching by subtracting the conditional image information from the noisy latents, thereby incorporating the condition implicitly. (2) Fourier Guidance. We use high-frequency magnitude features obtained by the Fourier Transform to accelerate convergence and enable the adjustment of detail levels in the generated video. Experimental results show that our method effectively overcomes conditional image leakage and achieves the best generalization and performance on out-of-domain data among various I2V paradigms. With only 1.3B parameters, FlashI2V achieves a dynamic degree score of 53.01 on Vbench-I2V, surpassing CogVideoX1.5-5B-I2V and Wan2.1-I2V-14B-480P.
Existing I2V Methods involves Conditional image leakage. (a) Conditional image leakage causes performance degradation issues, where the videos are sampled from Wan2.1-I2V-14B-480P with Vbench-I2V text-image pairs. (b) In the existing I2V paradigm, we observe that chunk-wise FVD on in-domain data increases over time, while chunk-wise FVD on out-of-domain data remains consistently high, indicating that the law learned on in-domain data by the existing paradigm fails to generalize to out-of-domain data.
Based on the finding, we propose FlashI2V to introduce conditions implicitly. We extract features from the conditional image latents using a learnable projection, followed by the latent shifting to obtain a renewed intermediate state that implicitly contains the condition. Simultaneously, the conditional image latents undergo the Fourier Transform to extract high-frequency magnitude features as guidance, which are concatenated with noisy latents and injected into DiT. During inference, we begin with the shifted noise and progressively denoise following the ODE, ultimately decoding the video.
A curious cat peers intently into the lens, its wide eyes shimmering with intrigue. The monochromatic palette accentuates the feline’s delicate whiskers and soft fur ......
FlashI2V (1.3B)
Wan2.1-I2V-14B-480P
CogVideoX1.5-5B-I2V
Suspended in the vast silence of space, an astronaut drifts weightlessly above the glowing curve of Earth. The planet’s vibrant blues and swirling white clouds stretch beneath, illuminated by the soft ......
FlashI2V (1.3B)
Wan2.1-I2V-14B-480P
CogVideoX1.5-5B-I2V
A majestic hot-air balloon drifts gracefully above a sunlit desert, its striped envelope casting a gentle shadow over rugged hills and ancient rock formations. The vast expanse of golden sand stretches ......
FlashI2V (1.3B)
Wan2.1-I2V-14B-480P
CogVideoX1.5-5B-I2V
A powerful brown bear stands knee-deep in a shimmering river, its wet fur glistening in the sunlight. Clamped firmly in its jaws is a freshly caught fish, still dripping with water and struggling in vain ......
FlashI2V (1.3B)
Wan2.1-I2V-14B-480P
CogVideoX1.5-5B-I2V
A skilled rider navigates a rugged dirt bike course, expertly maneuvering over a massive concrete pipe and scattered rocks. Dressed in full protective gear and a vibrant helmet, the rider leans forward ......
FlashI2V (1.3B)
Wan2.1-I2V-14B-480P
CogVideoX1.5-5B-I2V
A vibrant red bus glides through a snow-blanketed city street, its headlights cutting through the swirling flakes and illuminating the icy road ahead. Towering buildings loom on either side ......
FlashI2V (1.3B)
Wan2.1-I2V-14B-480P
CogVideoX1.5-5B-I2V
A sleek red Alfa Romeo sports car tears down a winding road, its aerodynamic curves gleaming under the sunlight. The polished crimson body reflects the blur of trees and sky as it slices through the landscape with effortless speed ......
FlashI2V (1.3B)
Wan2.1-I2V-14B-480P
CogVideoX1.5-5B-I2V
A powerful steam locomotive thunders along narrow tracks, its iron body gleaming beneath a canopy of towering evergreens. Billowing clouds of smoke and steam swirl into the forest air ......
FlashI2V (1.3B)
Wan2.1-I2V-14B-480P
CogVideoX1.5-5B-I2V
Comparing the chunk-wise FVD variation patterns of different I2V paradigms on both the training and validation sets, it is observed that only FlashI2V exhibits the same time-increasing FVD variation pattern in both sets. This suggests that only FlashI2V is capable of applying the generation law learned from in-domain data to out-of-domain data. Additionally, FlashI2V has the lowest out-of-domain FVD, demonstrating its performance advantage.
A man dressed in a sharp black suit and an ornate sombrero stands out in a lively outdoor setting. His voice rings out powerfully, capturing the energy and spirit of traditional Mexican celebration ......
A traveler sits atop a pale horse, guided steadily by a companion who leads the way across a windswept, sandy plain. Dust swirls around their boots as they move forward, framed by distant hills and a misty sky ......
A graceful woman stands elegantly in a flowing blue sari, her long hair cascading over her shoulder as she gently plays with its strands. Gold bangles adorn her wrist, adding a touch of shimmer to her poised look ......
An open book lies engulfed in vivid flames, its pages curling and blackening as fire consumes the words. Bright orange tongues of fire dance across the paper, casting dramatic shadows and illuminating ......
Towering flames leap into the darkness, casting a golden glow across the night. Thick logs crackle and pop as the bonfire roars, sending a shower of glowing embers swirling upward into the cool air ......
Dressed in a traditional Mexican charro suit adorned with intricate silver embroidery and a large, elegant bow tie, a musician stands confidently among lush greenery and rustic decor. He holds an acoustic guitar ......
A determined cyclist powers his mountain bike up a rugged, rocky hilltop, surrounded by sweeping views and dramatic skies. Wearing a backpack and dressed in athletic gear, he navigates the uneven terrain ......
A man sits on the worn stone steps of a rustic yellow house, strumming an acoustic guitar. Surrounded by overgrown grass and climbing vines, he creates a tranquil scene, his relaxed posture and casual attire ......
A rider grips the handlebars of a motorcycle, cruising down a winding, sun-dappled road flanked by lush green trees. The world blurs at the edges, capturing the exhilarating sense of speed and freedom ......
A rider in vibrant racing gear leans into a sharp turn, expertly maneuvering an ATV across a rugged dirt track. Dust billows dramatically behind the vehicle, capturing the intensity and speed of the moment ......
A woman with flowing, jet-black hair stands with her back slightly turned, exuding an air of quiet confidence. Sunlight softly illuminates her bare shoulder, revealed by a loosely draped beige sweater that slips ......
A young boy stands in an open field beneath a brilliant blue sky, his dark hair catching the sunlight. Behind him, a group of horses graze peacefully on the grass, their coats glowing in the afternoon light ......
A charming penguin waddles confidently along the shoreline, its sleek black-and-white plumage standing out against the dark, pebbled sand. Gentle waves lap at the beach just behind, creating a dynamic ......
Steam rises as hot water is carefully poured from a rustic kettle over fresh coffee grounds, blooming in a paper filter atop a glass pot. The rich aroma of brewing coffee mingles with the crisp outdoor air ......
A graceful sea turtle glides effortlessly through crystal-clear turquoise waters, its patterned shell catching the sunlight that filters down from above. With powerful, paddle-like flippers, it moves serenely ......
Two women share a lively meal at a cozy restaurant, savoring slices of fresh pizza. The table is set with elegant wine glasses and plates, creating an inviting atmosphere for conversation and laughter ......