A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

1Peking University, 2University of Rochester, 3Rabbitpre AI,

OpenS2V-Nexus deliver a robust infrastructure to accelerate S2V !

Abstract

Subject-to-Video (S2V) generation aims to create videos that faithfully incorporate reference content, providing enhanced flexibility in the production of videos. To establish the infrastructure for S2V generation, we propose OpenS2V-Nexus, consisting of (i) OpenS2V-Eval, a fine-grained benchmark, and (ii) OpenS2V-5M, a million-scale dataset. In contrast to existing S2V benchmarks inherited from VBench that focus on global and coarse-grained assessment of generated videos, OpenS2V-Eval focuses on the model's ability to generate subject-consistent videos with natural subject appearance and identity fidelity. For these purposes, OpenS2V-Eval introduces 180 prompts from seven major categories of S2V, which incorporate both real and synthetic test data. Furthermore, to accurately align human preferences with S2V benchmarks, we propose three automatic metrics, NexusScore, NaturalScore and GmeScore, to separately quantify subject consistency, naturalness, and text relevance in generated videos. Building on this, we conduct a comprehensive evaluation of 11 representative S2V models, highlighting their strengths and weaknesses across different content. Moreover, we create the first open-source large-scale S2V generation dataset OpenS2V-5M, which consists of five million high-quality 720P subject-text-video triplets. Specifically, we ensure subject-information diversity in our dataset by (1) segmenting subjects and building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations. Through OpenS2V-Nexus, we deliver a robust infrastructure to accelerate future S2V generation research.

Seven Categories of Subject-to-Video Generation

To construct OpenS2V-Nexus for subject-to-video that incorporates diverse visual concepts, we divide this task into seven categories: ① single-face-to-video, ② single-body-to-video, ③ single-entity-to-video, ④ multi-face-to-video, ⑤ multi-body-to-video, ⑥ multi-entity-to-video, and ⑦ human-entity-to-video.

Key Challenges for the S2V Model

Challenges 1

Poor generalization: These models often perform poorly when encountering subject categories not seen during training. For instance, a model trained exclusively on Western subjects typically performs worse when generating Asian subjects.

Challenges 2

Copy-paste issue: The model tends to directly transfer the pose, lighting, and contours from the reference image to the video, resulting in unnatural outcomes.

Challenges 3

Inadequate human fidelity: Current models often struggle to preserve human identity as effectively as they do non-human entities.

Regular Data vs Nexus Data

Unlike previous methods that rely solely on regular subject-text-video triples—where subject images are segmented from training frames, potentially causing the model to learn shortcuts rather than intrinsic knowledge—we enrich it with Nexus Data, through (1) building pairing information via cross-video associations and (2) prompting GPT-Image-1 on raw frames to synthesize multi-view representations, to address the three core challenges mentioned above at the data level.

OpenS2V-Eval Pipeline

(Left) Our benchmark includes not only real subject images but also synthetic images constructed through GPT-Image-1 allowing for a more comprehensive evaluation. (Right) The metrics are tailored for subject-to-video generation, evaluating not only S2V characteristics (e.g., consistency) but also basic video elements (e.g., motion).

OpenS2V-Eval Results

We visualize the evaluation results of various Subject-to-Video generation models across Open-Domain, Human-Domain and Single-Object. The values have been normalized for better chart readability.

Open-Domain Evaluation Examples

Human-Domain Evaluation Examples

Single-Domain Evaluation Examples

Automatic Metrics References

OpenS2V-5M Pipeline

First, we filter low-quality videos based on scores such as aesthetics and motion, then utilize GroundingDino and SAM2.1 to extract subject images and get Regular Data. Subsequently, we create Nexus Data through cross-video association and GPT-Image-1 to address the three core issues encountered by S2V models.

OpenS2V-5M Statistics

The dataset includes a diverse range of categories, clip durations and caption lengths, with most of videos being in high quality (e.g., resolution, aesthetic).

Samples of OpenS2V-5M

Verification of OpenS2V-5M

All videos are generated by MAGREF‡, which was trained on a subset of OpenS2V-5M, using about 0.5M high-quality data. (with fixed seed)

BibTeX

@article{yuan2025opens2v,
  title={OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation},
  author={Yuan, Shenghai and He, Xianyi and Deng, Yufan and Ye, Yang and Huang, Jinfa and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2505.20292},
  year={2025}
}