ChronoMagic-Bench : A Benchmark for Metamorphic Evaluation of
Text-to-Time-lapse Video Generation

✨NeurIPS 2024 D&B Spotlight✨
1Peking University, 2Peng Cheng Laboratory, 3Rabbitpre Intelligence, 4University of Rochester, 5Shanghai Jiao Tong University, 6National University of Singapore, 7University of California Santa Cruz

ChronoMagic-Bench can reflect the physical prior capacity of the T2V model !

Biological Human Created Meteorological Physical
Biological Human Created Meteorological Physical
"Time-lapse of microgreens germinating and growing ..." "Time-lapse of a modern house being constructed in ..." "Time-lapse of a beach sunset capturing the sun's ..." "Time-lapse of an ice cube melting on a solid ..."
Biological Human Created Meteorological Physical
"Time-lapse of microgreens germinating and growing ..." "Time-lapse of a 3D printing process: starting with ..." "Time-lapse of a solar eclipse showing the moon's ..." "Time-lapse of a cake baking in an oven, depicting ..."
Biological Human Created Meteorological Physical
"Time-lapse of a butterfly metamorphosis from ..." "Time-lapse of a busy nighttime city intersection ..." "Time-lapse of a landscape transitioning from a ..." "Time-lapse of a strawberry rotting: starting with ..."
Backbone Type Visual Quality Text Relevance Metamorphic Amplitude Temporal Coherence
UCF-101 General ✔️ ✔️
Make-a-Video-Eval General ✔️ ✔️
MSR-VTT General ✔️ ✔️
FETV General ✔️ ✔️ ✔️
VBench General ✔️ ✔️ ✔️
T2VScore General ✔️ ✔️
ChronoMagic-Bench Time-lapse ✔️ ✔️ ✔️ ✔️

Overview of existing T2V benchmarks. We propose ChronoMagic-Bench, a benchmark for metamorphic evaluation of text-to- time-lapse video generation, which provides a comprehensive evaluation system for T2V. We specifically design four major categories for time-lapse videos (as shown below), including biological, human-created, meteorological, and physical videos, and extend these to 75 subcategories. Based on this, we construct ChronoMagic-Bench, comprising 1,649 prompts and their corresponding reference time-lapse videos. In contrast to existing benchmarks, ChronoMagic-Bench emphasizes generating videos with high persistence and strong variation, i.e., metamorphic videos with high physical prior content. Additionally, we develop MTScore for evaluating metamorphic amplitude and CHScore for temporal coherence to address the deficiencies in evaluation metrics and perspectives.

Abstract

We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic knowledge skills in time-lapse video generation of the T2V models (e.g. Sora and Lumiere). Compared to existing benchmarks that focus on visual quality and text relevance of generated videos, ChronoMagic-Bench focuses on the models’ ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text control. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human creation, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization ensures a comprehensive evaluation of the models’ capacity to handle diverse and complex transformations. To accurately align human preference on the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of eighteen representative T2V models, revealing their strengths and weaknesses across different categories of prompts, providing a thorough evaluation framework that addresses current gaps in video generation research. More encouragingly, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions. Each caption ensures high physical content and large metamorphic amplitude, which have a far-reaching impact on the video generation community.

ChronoMagic-Bench Results of Open-Source T2V Models

We visualize the evaluation results of various open-source T2V generation models across ChronoMagic-Bench.

The values have been normalized for better readability of the chart. The normalization process involves scaling each set of performance values to a common scale between 0.3 and 0.8. The formula used for normalization is: (value - min_value) / (max_value - min_value).

ChronoMagic-Bench Results of Closed-Source T2V Models

We visualize the evaluation results of various closed-source T2V generation models across ChronoMagic-Bench.

The values have been normalized for better readability of the chart. The normalization process involves scaling each set of performance values to a common scale between 0.3 and 0.8. The formula used for normalization is: (value - min_value) / (max_value - min_value).

Leaderboard

Metamorphic Amplitude References

Evaluation examples with varying MTScore. The larger score denotes a better performance.

MTScore ≈ 0.2 MTScore ≈ 0.3 MTScore ≈ 0.4 MTScore ≈ 0.5
Biological Human Created Meteorological Physical

Temporal Coherence References

Evaluation examples with varying CHScore. The larger score denotes a better performance.

CHScore ≈ 40.0 CHScore ≈ 55.0 CHScore ≈ 80.0 CHScore ≈ 100.0
Biological Human Created Meteorological Physical

Validation of the Automatic Metrics

We invited 171 participants to evaluate the videos across four dimensions, and select only five representative baseline results from which users can choose to enhance user satisfaction.

It is evident that the proposed three metrics, MTScore, CHScore, and GPT4o-MTScore, are consistent with human perception and can accurately reflect the metamorphic amplitude and temporal coherence of T2V models. (ð and £ represent Kendall's↑ and Spearman's↑ coefficients, respectively. ↑ denotes higher is better.)

Evaluation examples of Different Models

Method "an unidentified plant growing [...]" "day-to-night cycle over a city [...]" "showing varying traffic flow [...]"
Gen-2 Gen-2
MTScore=0.21, CHScore=85.80
Gen-2
MTScore=0.35, CHScore=73.03
Gen-2
MTScore=0.48, CHScore=53.96
PiKa-1.0 PiKa-1.0
MTScore=0.20, CHScore=91.93
PiKa-1.0
MTScore=0.41, CHScore=54.77
PiKa-1.0
MTScore=0.50, CHScore=47.67
KeLing KeLing
MTScore=0.19, CHScore=106.32
KeLing
MTScore=0.38, CHScore=82.91
KeLing
MTScore=0.47, CHScore=48.87
Dream Machine Dream Machine
MTScore=0.31, CHScore=85.35
Dream Machine
MTScore=0.43, CHScore=85.15
Dream Machine
MTScore=0.54, CHScore=53.31
AnimateDiff-V3 AnimateDiff-V3
MTScore=0.32, CHScore=47.39
AnimateDiff-V3
MTScore=0.28, CHScore=79.66
AnimateDiff-V3
MTScore=0.49, CHScore=38.24
VideoCrafter2 VideoCrafter2
MTScore=0.27, CHScore=59.45
VideoCrafter2
MTScore=0.35, CHScore=54.14
VideoCrafter2
MTScore=0.50, CHScore=50.05
MagicTime MagicTime
MTScore=0.49, CHScore=63.96
MagicTime
MTScore=0.34, CHScore=44.28
MagicTime
MTScore=0.51, CHScore=73.56
OpenSoraPlan v1.1 OpenSoraPlan v1.1
MTScore=0.17, CHScore=93.38
OpenSoraPlan v1.1
MTScore=0.18, CHScore=40.14
OpenSoraPlan v1.1
MTScore=0.31, CHScore=44.59
OpenSora 1.1 OpenSora 1.1
MTScore=0.19, CHScore=79.39
OpenSora 1.1
MTScore=0.15, CHScore=58.01
OpenSora 1.1
MTScore=0.21, CHScore=49.07
OpenSora 1.2 OpenSora 1.2
MTScore=0.13, CHScore=85.89
OpenSora 1.2
MTScore=0.22, CHScore=38.45
OpenSora 1.2
MTScore=0.41, CHScore=38.40
EasyAnimate-V3 EasyAnimate-V3
MTScore=0.22, CHScore=81.44
EasyAnimate-V3
MTScore=0.42, CHScore=41.65
EasyAnimate-V3
MTScore=0.44, CHScore=39.68
CogVideoX CogVideoX
MTScore=0.30, CHScore=50.91
CogVideoX
MTScore=0.45, CHScore=75.92
CogVideoX
MTScore=0.46, CHScore=36.47

ChronoMagic-Bench Statistics

Both ChronoMagic-Bench and ChronoMagic-Pro are based on four major natural phenomena. The "biological" category encompasses all content related to living organisms; The "human-created" category includes objects created or influenced by human activities; The "meteorological" category covers meteorological phenomena; The "physical" category pertains to non-biological physical phenomena, such as water flow and volcanic eruptions.

This image showcases the word cloud and word count range of the prompts in ChronoMagic-Bench, which mainly describe videos with large metamorphic amplitude and long persistence.

ChronoMagic-Pro/ProH Statistics

Video clips statistics in (Top) ChronoMagic-Pro and (Bottom) ChronoMagic-ProH. The dataset includes a diverse range of categories, durations and caption lengths, with most of the videos at the 720P resolution. ChronoMagic-ProH has higher quality and purity (e.g. Aesthetic Score)

BibTeX

If you find our work useful, please consider citing our paper:

@article{yuan2024chronomagic,
  title={ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation},
  author={Yuan, Shenghai and Huang, Jinfa and Xu, Yongqi and Liu, Yaoyang and Zhang, Shaofeng and Shi, Yujun and Zhu, Ruijie and Cheng, Xinhua and Luo, Jiebo and Yuan, Li},
  journal={arXiv preprint arXiv:2406.18522},
  year={2024}
}