A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

Shenghai Yuan^1,3,*, Xianyi He^1,3,*, Yufan Deng¹, Yang Ye^1,3, Jinfa Huang²,

Bin Lin^1,3, Jiebo Luo², Li Yuan^1,†

¹Peking University, ²University of Rochester, ³Rabbitpre AI

arXiv GitHub

LeaderBoard

OpenS2V-5M

OpenS2V-Eval

OpenS2V-Nexus deliver a robust infrastructure to accelerate S2V !

Abstract

Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 18 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triplets. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.

Seven Categories of Subject-to-Video Generation

To construct OpenS2V-Nexus for subject-to-video that incorporates diverse visual concepts, we divide this task into seven categories: ① single-face-to-video, ② single-body-to-video, ③ single-entity-to-video, ④ multi-face-to-video, ⑤ multi-body-to-video, ⑥ multi-entity-to-video, and ⑦ human-entity-to-video.

Key Challenges for the S2V Model

Challenges 1

Poor generalization: These models often perform poorly when encountering subject categories not seen during training. For instance, a model trained exclusively on Western subjects typically performs worse when generating Asian subjects.

Challenges 2

Copy-paste issue: The model tends to directly transfer the pose, lighting, and contours from the reference image to the video, resulting in unnatural outcomes.

Challenges 3

Inadequate human fidelity: Current models often struggle to preserve human identity as effectively as they do non-human entities.

Regular Data vs Nexus Data

Unlike previous methods that rely solely on regular subject-text-video triples—where subject images are segmented from training frames, potentially causing the model to learn shortcuts rather than intrinsic knowledge—we enrich it with Nexus Data, through (1) building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations, to address the three core challenges mentioned above at the data level.

OpenS2V-Eval Pipeline

(Left) Our benchmark includes not only real subject images but also synthetic images constructed through GPT-Image-1 allowing for a more comprehensive evaluation. (Right) The metrics are tailored for subject-to-video generation, evaluating not only S2V characteristics (e.g., consistency) but also basic video elements (e.g., motion).

OpenS2V-Eval Results

We visualize the evaluation results of various Subject-to-Video generation models across Open-Domain, Human-Domain and Single-Object. The values have been normalized for better chart readability.

Open-Domain Evaluation Examples

A man joyfully playing with his dog on a sunny beach.

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

A man sitting on the grass in the park, a bird flying around him.

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video opens with a serene outdoor scene at 00:00, featuring two individuals standing in a field of tall grass. The setting is a rural area with a clear sky and a few scattered trees in the background. The person on the left, wearing a white shirt ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video showcases a serene indoor setting with a wooden table in the foreground, adorned with two white cushions, a white teapot, two white cups, and a vase with yellow ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video captures a serene outdoor scene featuring two squirrels interacting with a wooden crate. Initially, one squirrel is seen standing on the edge of the crate, while ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

Human-Domain Evaluation Examples

The video features a man walking down a city street at night, engrossed in his smartphone. He is dressed in a formal suit and tie, suggesting he might.....

Concat-ID

ConsisID

EchoVideo

FantasyID

Hailuo

HunyuanCustom

Kling1.6

Phantom-1.3B

The video features a man walking down a city street at night, engrossed in his smartphone. He is dressed in a formal suit and tie, suggesting he might.....

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video features a man sitting in the driver's seat of a car. he is wearing glasses and a dark-colored dress, and his hair is neatly styled. The interior of the car ...

Concat-ID

ConsisID

EchoVideo

FantasyID

Hailuo

HunyuanCustom

Kling1.6

Phantom-1.3B

The video features a man sitting in the driver's seat of a car. he is wearing glasses and a dark-colored dress, and his hair is neatly styled. The interior of the car ...

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video features a young man walking through a park during sunset. he is wearing a sleeveless top with a geometric pattern and denim shorts. The man ...

Concat-ID

ConsisID

EchoVideo

FantasyID

Hailuo

HunyuanCustom

Kling1.6

Phantom-1.3B

The video features a young man walking through a park during sunset. he is wearing a sleeveless top with a geometric pattern and denim shorts. The man ...

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video features a woman in exquisite hybrid armor adorned with iridescent gemstones, standing amidst gently falling cherry blossoms. Her piercing yet serene ...

Concat-ID

ConsisID

EchoVideo

FantasyID

Hailuo

HunyuanCustom

Kling1.6

Phantom-1.3B

The video features a woman in exquisite hybrid armor adorned with iridescent gemstones, standing amidst gently falling cherry blossoms. Her piercing yet serene ...

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

Single-Domain Evaluation Examples

The video features a woman with blonde hair standing on a beach near the water's edge. She is wearing a black swimsuit and appears to be enjoying her time ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video features a man standing at an easel, focused intently as his brush dances across the canvas. His expression is one of deep concentration, with a hint of ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

In the video, a girl is sitting on a grassy hill, overlooking a vast field with wildflowers in full bloom. She is wearing a casual t-shirt and denim shorts, with a ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video begins with a close-up of a teapot sitting on a cozy kitchen table, steam rising gently from its spout. The camera zooms in to capture the swirling ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video begins with a close-up of a t-shirt laid out on a sunny rock, the fabric gently catching the light of the sun. The camera zooms in to capture the ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

The video begins with a close-up of a vintage television resting on a classic wooden stand in a cozy living room. The camera zooms in to highlight the television ...

Kling1.6

Phantom-1.3B

Pika2.1

Skyreels-A2-P14B

VACE-P1.3B

VACE-1.3B

VACE-14B

Vidu2.0

Automatic Metrics References

AestheticsScore ≈ 15.07%

AestheticsScore ≈ 47.16%

AestheticsScore ≈ 73.23%

MotionScore ≈ 2.4%

MotionScore ≈ 55.19%

MotionScore ≈ 100%

FaceSim-Cur ≈ 13.07%

FaceSim-Cur ≈ 56.63%

FaceSim-Cur ≈ 88.53%

GmeScore ≈ 42.72%

A man is watching a bird in the park ...

GmeScore ≈ 68.55%

The video begins with a rooster and two hens standing on a grassy area in front of a wooden fence. The rooster, with its vibrant red ...

GmeScore ≈ 76.56%

A woman stands on the bridge, wearing a black one-piece swimsuit with the Nike Air logo across the front. The swimsuit fits snugly ...

NexusScore ≈ 10.98%

NexusScore ≈ 53.83%

NexusScore ≈ 99.99%

NaturalScore ≈ 40.00%

NaturalScore ≈ 60.00%

NaturalScore ≈ 80.00%

OpenS2V-5M Pipeline

First, we filter low-quality videos based on scores such as aesthetics and motion, then utilize GroundingDino and SAM2.1 to extract subject images and get Regular Data. Subsequently, we create Nexus Data through cross-video association and GPT-Image-1 to address the three core issues encountered by S2V models.

OpenS2V-5M Statistics

The dataset includes a diverse range of categories, clip durations and caption lengths, with most of videos being in high quality (e.g., resolution, aesthetic).

Samples of OpenS2V-5M

Verification of OpenS2V-5M

All videos are generated by Ours‡, which was trained on a subset of OpenS2V-5M.

BibTeX

@article{yuan2025opens2v,
  title={OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation},
  author={Yuan, Shenghai and He, Xianyi and Deng, Yufan and Ye, Yang and Huang, Jinfa and Lin, Bin and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2505.20292},
  year={2025}
}