Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human-identity consistent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V.
Shallow (e.g., low-level, low-frequency) features are essential for pixel-level prediction tasks in diffusion models, as they ease model training. U-Net facilitates model convergence by aggregating shallow features to the decoder via long skip connections, a mechanism that DiT does not incorporate.
Transformers have limited perception of high-frequency information, which is important for preserving facial features. The encoder-decoder architecture of U-Net naturally possesses multi-scale features (e.g., richness in high-frequency), while DiT lacks a comparable structure.
Based on Findings of DiT, low-frequency facial information is embedded into the shallow layers, while high-frequency information is incorporated into the vision tokens within the attention blocks. The ID-preserving Recipe is applied to ease training and improve generalization. The cross face, DropToken and Dropout are executed based on probability.
The video depicts a woman sitting at an office desk, engaged in her work. she is dressed in a formal suit and appears to be focused on her computer screen. The office environment is well-organized, with shelves filled with binders and other office supplies neatly arranged ......
Consis-ID
DreamVideo
ID-Animator
MagicMe
MotionBooth
Vidu
The video features a man sitting in the driver's seat of a car. he is wearing glasses and a dark-colored dress, and his hair is neatly styled. The interior of the car appears to be modern, with a light-colored dashboard and a steering wheel that has some buttons on it ......
Consis-ID
DreamVideo
ID-Animator
MagicMe
MotionBooth
Vidu
A man gently clutching a bouquet of vibrant flowers, his eyes radiating a serene contentment as he glances at the camera. His slightly upturned lips convey a sense of calm joy, accompanied by a faint twinkle in his eye. The scene is set in a lush garden ......
Consis-ID
DreamVideo
ID-Animator
MagicMe
MotionBooth
Vidu
The video depicts a young woman sitting at a wooden desk, deeply engrossed in her work. She is wearing glasses and has long hair that falls over her shoulders. The woman appears to be focused on a document or piece of paper in front of her, as she writes with a pen ......
Consis-ID
DreamVideo
ID-Animator
MagicMe
MotionBooth
Vidu
The video features a man sitting at a desk in front of a large screen displaying an American flag. he is wearing a plaid shirt and appears to be delivering a news report or commentary. The background behind his consists of a large screen with the American flag displayed prominently ......
Consis-ID
DreamVideo
ID-Animator
MagicMe
MotionBooth
Vidu
A man gazing thoughtfully at far away, with a serene expression that reveals a slight furrowing of the brow and a softening around the eyes. His lips part subtly, as if caught in a moment of contemplation or inspiration. The open field around him is expansive ......
Consis-ID
DreamVideo
ID-Animator
MagicMe
MotionBooth
Vidu
The video features a man standing in front of a large screen displaying the words ""Tech Minute"" and the logo for CNET. he is wearing a purple top and appears to be presenting or speaking about technology-related topics ......
Consis-ID
DreamVideo
ID-Animator
MagicMe
MotionBooth
Vidu
The video features a young man who appears to be a content creator or streamer. he is wearing a green sleeveless top and red headphones. The background is illuminated with vibrant neon lights, predominantly in shades of purple and blue, creating a lively and energetic atmosphere ......
Consis-ID
DreamVideo
ID-Animator
MagicMe
MotionBooth
Vidu
The video features a woman in exquisite hybrid armor adorned with iridescent gemstones, standing amidst gently falling cherry blossoms. Her piercing yet serene gaze hints at quiet determination, as a breeze catches a loose strand of her hair ......
The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge ......
The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured ......
The video features a man standing at an easel, focused intently as his brush dances across the canvas. His expression is one of deep concentration, with a hint of satisfaction as each brushstroke adds color and form ......
A woman sitting by a window in a cozy cafe, enjoying a cup of coffee. Her eyes softly gaze out the window, reflecting a sense of contentment. Her lips part slightly into a serene smile, the corners of her mouth gently lifting ......
The video features a young man standing outdoors in a snowy park. he is wearing a colorful winter jacket with a floral pattern and a white knit hat And he smiles and gives a thumbs-up gesture towards the camera ......
The video features a woman standing outdoors in a serene, natural setting. The woman has long blonde hair and is wearing a light-colored, long dress with a delicate pattern ......
The video features a young woman with long blonde hair standing in front of a lush, green bush adorned with white flowers. She is wearing a black top and appears to be enjoying the natural surroundings ......
The video features a man walking down a city street at night, engrossed in his smartphone. He is dressed in a formal suit and tie, suggesting he might be a professional or businessman ......
The video features a woman standing in front of a large screen displaying the words ""Tech Minute"" and the logo for CNET. She is wearing a purple top and appears to be presenting or speaking about technology-related topics ......
The video features a man standing next to an airplane, engaged in a conversation on his cell phone. he is wearing sunglasses and a black top, and he appears to be talking seriously ......
The video features a woman with blonde hair standing on a beach near the water's edge. She is wearing a black swimsuit and appears to be enjoying her time by the sea. The sky above is clear with some clouds ......
The video features a woman with blonde hair, wearing a blue tank top and holding a pink tank top on a hanger. She appears to be in a clothing store or a similar retail environment, as there are racks of clothes visible in the background ......
The video features a man sitting in a red armchair, enjoying a cup of coffee or tea. he is dressed in a light-colored outfit and has long dark-haired hair. The setting appears to be indoors ......
The video shows a young boy sitting at a table, eating a piece of food. He appears to be enjoying his meal, as he takes a bite and chews it. The boy is wearing a blue shirt and has short hair. The background is dark ......
A woman with an anticipatory smile, her eyes twinkling with excitement as she holds a camera, poised to capture a perfect moment. Her face is animated with enthusiasm; her lips slightly parted as she concentrates on framing the shot ......
A woman wearing a colorful scarf and cozy sweater, her eyes sparkling with a hint of wonder as she looks around at the falling leaves. Her lips curl into a slight, content smile, adding a touch of warmth to the cool air ......
The video features a news reporter who is walking down a city street at night while holding a microphone and speaking to the camera. The reporter is wearing a white coat and a blue tie ......
The video features a man jogging along a grassy path next to a body of water, likely a lake or river. he is wearing a white sports bra and appears to be focused on his run.The background shows a serene outdoor setting with green trees ......
The video features a woman dressed as a mermaid, swimming underwater. She is wearing a silver tail and a matching top, which is adorned with colorful patterns. The woman has long, wavy hair that flows freely underwater ......
The video features a man dressed in a blue suit and tie, sitting in a newsroom setting. The background includes a large screen displaying various news graphics and text. The man appears to be a news anchor or host ......
The video depicts a young man engaged in an intense gaming session. he is seated in a gaming chair, which is designed to provide ergonomic support and comfort during extended periods of use. The chair has a distinctive design ......
The video features a young man walking through a park during sunset. he is wearing a sleeveless top with a geometric pattern and denim shorts. The man has long, dark hair that falls over his shoulders. In his hands, he holds a skateboard ......
The video depicts a young man sitting at a wooden desk, deeply engrossed in his work. he is wearing glasses and has long hair that falls over his shoulders. The man appears to be focused on a document or piece of paper in front of him ......
A woman adorned with a delicate flower crown, is standing amidst a field of gently swaying wildflowers. Her eyes sparkle with a serene gaze, and a faint smile graces her lips, suggesting a moment of peaceful contentment ......
The video features a news reporter who is walking down a city street at night while holding a microphone and speaking to the camera. The reporter is wearing a white coat and a blue tie, and he appears to be reporting on a story ......
The video depicts a young girl walking through a greenhouse filled with lush green plants and vegetables. She is wearing a white dress and appears to be enjoying her time in the garden. The greenhouse has a wooden structure ......
The video features a man standing in front of the camera, his gaze focused and intense as he holds the basketball. He is dressed in athletic gear, his posture exuding concentration. The basketball is positioned firmly in his hands ......
The video features a man with dark-haired hair, wearing a blue tank top and holding a pink tank top on a hanger. he appears to be in a clothing store or a similar retail environment, as there are racks of clothes visible in the background ......
The video features a woman walking down a city street at night, engrossed in her smartphone. she is dressed in a formal suit and tie, suggesting she might be a professional or businessman. The street is illuminated with various neon lights ......
The video features a woman dressed as a mermaid, swimming underwater. She is wearing a silver tail and a matching top, which is adorned with colorful patterns. The woman has long, wavy hair that flows freely underwater ......
The video features a news reporter who is walking down a city street at night while holding a microphone and speaking to the camera. The reporter is wearing a white coat and a blue tie, and he appears to be reporting on a story ......
The video features a woman standing outdoors in a serene, natural setting. She is leaning against a tall, white column that is part of a larger structure, possibly a gazebo or pavilion. The woman has long blonde hair ......
The video features a man walking down a city street at night, engrossed in his smartphone. He is dressed in a formal suit and tie, suggesting he might be a professional or businessman. The street is illuminated with various neon lights ......
The video depicts a man sitting at an office desk, engaged in his work. He is dressed in a formal suit and appears to be focused on his computer screen. The office environment is well-organized, with shelves filled with binders ......
The video depicts a young man sitting at a wooden desk, deeply engrossed in his work. he is wearing glasses and has long hair that falls over his shoulders. The man appears to be focused on a document or piece of paper in front of his ......
The video features a little girl with pigtails, a backpack, and a bright smile, skipping joyfully down a bustling city street. She holds a bunch of colorful balloons that sway as she moves, her eyes wide with excitement. The backdrop of vibrant buildings ......
The video features a girl sitting on a stool, playing a guitar with her fingers moving skillfully across the strings, creating a soulful melody. Her eyes are closed, and there's a slight smile on her face ......
The video features a man with a rugged beard, wearing a leather jacket, riding a vintage motorcycle along a desert highway. His expression is focused, eyes narrowed slightly against the wind, as the setting sun casts a warm glow ......
The video shows a man celebrating his birthday, holding a piece of cake while he prepares to blow out the candles. He is smiling warmly, and as he closes his eyes to make a wish, a hint of emotion crosses his face—perhaps a moment of nostalgia ......
@misc{yuan2024identitypreservingtexttovideogenerationfrequency,
title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
author={Shenghai Yuan and Jinfa Huang and Xianyi He and Yunyuan Ge and Yujun Shi and Liuhan Chen and Jiebo Luo and Li Yuan},
year={2024},
eprint={2411.17440},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.17440},
}