We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic knowledge skills in time-lapse video generation of the T2V models (e.g. Sora and Lumiere). Compared to existing benchmarks that focus on visual quality and text relevance of generated videos, ChronoMagic-Bench focuses on the models’ ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text control. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human creation, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization ensures a comprehensive evaluation of the models’ capacity to handle diverse and complex transformations. To accurately align human preference on the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of eighteen representative T2V models, revealing their strengths and weaknesses across different categories of prompts, providing a thorough evaluation framework that addresses current gaps in video generation research. More encouragingly, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions. Each caption ensures high physical content and large metamorphic amplitude, which have a far-reaching impact on the video generation community.
We visualize the evaluation results of various open-source T2V generation models across ChronoMagic-Bench.
The values have been normalized for better readability of the chart. The normalization process involves scaling each set of performance values to a common scale between 0.3 and 0.8. The formula used for normalization is: (value - min_value) / (max_value - min_value).
We visualize the evaluation results of various closed-source T2V generation models across ChronoMagic-Bench.
The values have been normalized for better readability of the chart. The normalization process involves scaling each set of performance values to a common scale between 0.3 and 0.8. The formula used for normalization is: (value - min_value) / (max_value - min_value).
MTScore ≈ 0.2 | MTScore ≈ 0.3 | MTScore ≈ 0.4 | MTScore ≈ 0.5 |
---|---|---|---|
CHScore ≈ 40.0 | CHScore ≈ 55.0 | CHScore ≈ 80.0 | CHScore ≈ 100.0 |
---|---|---|---|
It is evident that the proposed three metrics, MTScore, CHScore, and GPT4o-MTScore, are consistent with human perception and can accurately reflect the metamorphic amplitude and temporal coherence of T2V models. (ð and £ represent Kendall's↑ and Spearman's↑ coefficients, respectively. ↑ denotes higher is better.)
Method | "an unidentified plant growing [...]" | "day-to-night cycle over a city [...]" | "showing varying traffic flow [...]" |
---|---|---|---|
Gen-2 |
MTScore=0.21, CHScore=85.80 |
MTScore=0.35, CHScore=73.03 |
MTScore=0.48, CHScore=53.96 |
PiKa-1.0 |
MTScore=0.20, CHScore=91.93 |
MTScore=0.41, CHScore=54.77 |
MTScore=0.50, CHScore=47.67 |
KeLing |
MTScore=0.19, CHScore=106.32 |
MTScore=0.38, CHScore=82.91 |
MTScore=0.47, CHScore=48.87 |
Dream Machine |
MTScore=0.31, CHScore=85.35 |
MTScore=0.43, CHScore=85.15 |
MTScore=0.54, CHScore=53.31 |
AnimateDiff-V3 |
MTScore=0.32, CHScore=47.39 |
MTScore=0.28, CHScore=79.66 |
MTScore=0.49, CHScore=38.24 |
VideoCrafter2 |
MTScore=0.27, CHScore=59.45 |
MTScore=0.35, CHScore=54.14 |
MTScore=0.50, CHScore=50.05 |
MagicTime |
MTScore=0.49, CHScore=63.96 |
MTScore=0.34, CHScore=44.28 |
MTScore=0.51, CHScore=73.56 |
OpenSoraPlan v1.1 |
MTScore=0.17, CHScore=93.38 |
MTScore=0.18, CHScore=40.14 |
MTScore=0.31, CHScore=44.59 |
OpenSora 1.1 |
MTScore=0.19, CHScore=79.39 |
MTScore=0.15, CHScore=58.01 |
MTScore=0.21, CHScore=49.07 |
OpenSora 1.2 |
MTScore=0.13, CHScore=85.89 |
MTScore=0.22, CHScore=38.45 |
MTScore=0.41, CHScore=38.40 |
EasyAnimate-V3 |
MTScore=0.22, CHScore=81.44 |
MTScore=0.42, CHScore=41.65 |
MTScore=0.44, CHScore=39.68 |
CogVideoX |
MTScore=0.30, CHScore=50.91 |
MTScore=0.45, CHScore=75.92 |
MTScore=0.46, CHScore=36.47 |
Both ChronoMagic-Bench and ChronoMagic-Pro are based on four major natural phenomena. The "biological" category encompasses all content related to living organisms; The "human-created" category includes objects created or influenced by human activities; The "meteorological" category covers meteorological phenomena; The "physical" category pertains to non-biological physical phenomena, such as water flow and volcanic eruptions.
This image showcases the word cloud and word count range of the prompts in ChronoMagic-Bench, which mainly describe videos with large metamorphic amplitude and long persistence.
Video clips statistics in (Top) ChronoMagic-Pro and (Bottom) ChronoMagic-ProH. The dataset includes a diverse range of categories, durations and caption lengths, with most of the videos at the 720P resolution. ChronoMagic-ProH has higher quality and purity (e.g. Aesthetic Score)
If you find our work useful, please consider citing our paper:
@article{yuan2024chronomagic,
title={ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation},
author={Yuan, Shenghai and Huang, Jinfa and Xu, Yongqi and Liu, Yaoyang and Zhang, Shaofeng and Shi, Yujun and Zhu, Ruijie and Cheng, Xinhua and Luo, Jiebo and Yuan, Li},
journal={arXiv preprint arXiv:2406.18522},
year={2024}
}