ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

ChronoMagic-Bench : A Benchmark for Metamorphic Evaluation of
Text-to-Time-lapse Video Generation

¹Peking University, ²Peng Cheng Laboratory, ³University of Rochester, ⁴Rabbitpre Intelligence, ⁵Shanghai Jiao Tong University, ⁶National University of Singapore, ⁷University of California Santa Cruz

Biological	Human Created	Meteorological	Physical

"Time-lapse of microgreens germinating and growing ..."	"Time-lapse of a modern house being constructed in ..."	"Time-lapse of a beach sunset capturing the sun's ..."	"Time-lapse of an ice cube melting on a solid ..."

"Time-lapse of microgreens germinating and growing ..."	"Time-lapse of a 3D printing process: starting with ..."	"Time-lapse of a solar eclipse showing the moon's ..."	"Time-lapse of a cake baking in an oven, depicting ..."

"Time-lapse of a butterfly metamorphosis from ..."	"Time-lapse of a busy nighttime city intersection ..."	"Time-lapse of a landscape transitioning from a ..."	"Time-lapse of a strawberry rotting: starting with ..."

Biological

Human Created

Meteorological

Physical

"Time-lapse of microgreens germinating and growing ..."

"Time-lapse of a modern house being constructed in ..."

"Time-lapse of a beach sunset capturing the sun's ..."

"Time-lapse of an ice cube melting on a solid ..."

"Time-lapse of microgreens germinating and growing ..."

"Time-lapse of a 3D printing process: starting with ..."

"Time-lapse of a solar eclipse showing the moon's ..."

"Time-lapse of a cake baking in an oven, depicting ..."

"Time-lapse of a butterfly metamorphosis from ..."

"Time-lapse of a busy nighttime city intersection ..."

"Time-lapse of a landscape transitioning from a ..."

"Time-lapse of a strawberry rotting: starting with ..."

Backbone	Type	Visual Quality	Text Relevance	Metamorphic Amplitude	Temporal Coherence
UCF-101	General	✔️	✔️	❌	❌
Make-a-Video-Eval	General	✔️	✔️	❌	❌
MSR-VTT	General	✔️	✔️	❌	❌
FETV	General	✔️	✔️	❌	✔️
VBench	General	✔️	✔️	❌	✔️
T2VScore	General	✔️	✔️	❌	❌
ChronoMagic-Bench	Time-lapse	✔️	✔️	✔️	✔️

Backbone

Type

Visual Quality

Text Relevance

Metamorphic Amplitude

Temporal Coherence

UCF-101

General

✔️

❌

Make-a-Video-Eval

General

✔️

❌

MSR-VTT

General

✔️

❌

FETV

General

✔️

❌

✔️

VBench

General

✔️

❌

✔️

T2VScore

General

✔️

❌

ChronoMagic-Bench

Time-lapse

✔️

Abstract

We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic knowledge skills in time-lapse video generation of the T2V models (e.g. Sora and Lumiere). Compared to existing benchmarks that focus on visual quality and text relevance of generated videos, ChronoMagic-Bench focuses on the models’ ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text control. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human creation, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization ensures a comprehensive evaluation of the models’ capacity to handle diverse and complex transformations. To accurately align human preference on the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of eighteen representative T2V models, revealing their strengths and weaknesses across different categories of prompts, providing a thorough evaluation framework that addresses current gaps in video generation research. More encouragingly, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions. Each caption ensures high physical content and large metamorphic amplitude, which have a far-reaching impact on the video generation community.

MTScore ≈ 0.2

MTScore ≈ 0.3

MTScore ≈ 0.4

MTScore ≈ 0.5

CHScore ≈ 40.0

CHScore ≈ 55.0

CHScore ≈ 80.0

CHScore ≈ 100.0

Method	"an unidentified plant growing [...]"	"day-to-night cycle over a city [...]"	"showing varying traffic flow [...]"
Gen-2	MTScore=0.21, CHScore=85.80	MTScore=0.35, CHScore=73.03	MTScore=0.48, CHScore=53.96
Gen-2	MTScore=0.20, CHScore=91.93	MTScore=0.41, CHScore=54.77	MTScore=0.50, CHScore=47.67
Gen-2	MTScore=0.19, CHScore=106.32	MTScore=0.38, CHScore=82.91	MTScore=0.47, CHScore=48.87
Gen-2	MTScore=0.31, CHScore=85.35	MTScore=0.43, CHScore=85.15	MTScore=0.54, CHScore=53.31
Gen-2	MTScore=0.32, CHScore=47.39	MTScore=0.28, CHScore=79.66	MTScore=0.49, CHScore=38.24
Gen-2	MTScore=0.27, CHScore=59.45	MTScore=0.35, CHScore=54.14	MTScore=0.50, CHScore=50.05
Gen-2	MTScore=0.49, CHScore=63.96	MTScore=0.34, CHScore=44.28	MTScore=0.51, CHScore=73.56
Gen-2	MTScore=0.17, CHScore=93.38	MTScore=0.18, CHScore=40.14	MTScore=0.31, CHScore=44.59
Gen-2	MTScore=0.19, CHScore=79.39	MTScore=0.15, CHScore=58.01	MTScore=0.21, CHScore=49.07
Gen-2	MTScore=0.13, CHScore=85.89	MTScore=0.22, CHScore=38.45	MTScore=0.41, CHScore=38.40
Gen-2	MTScore=0.22, CHScore=81.44	MTScore=0.42, CHScore=41.65	MTScore=0.44, CHScore=39.68
Gen-2	MTScore=0.30, CHScore=50.91	MTScore=0.45, CHScore=75.92	MTScore=0.46, CHScore=36.47

Method

"an unidentified plant growing [...]"

"day-to-night cycle over a city [...]"

"showing varying traffic flow [...]"

Gen-2

MTScore=0.21, CHScore=85.80

MTScore=0.35, CHScore=73.03

MTScore=0.48, CHScore=53.96

Gen-2

MTScore=0.20, CHScore=91.93

MTScore=0.41, CHScore=54.77

MTScore=0.50, CHScore=47.67

Gen-2

MTScore=0.19, CHScore=106.32

MTScore=0.38, CHScore=82.91

MTScore=0.47, CHScore=48.87

Gen-2

MTScore=0.31, CHScore=85.35

MTScore=0.43, CHScore=85.15

MTScore=0.54, CHScore=53.31

Gen-2

MTScore=0.32, CHScore=47.39

MTScore=0.28, CHScore=79.66

MTScore=0.49, CHScore=38.24

Gen-2

MTScore=0.27, CHScore=59.45

MTScore=0.35, CHScore=54.14

MTScore=0.50, CHScore=50.05

Gen-2

MTScore=0.49, CHScore=63.96

MTScore=0.34, CHScore=44.28

MTScore=0.51, CHScore=73.56

Gen-2

MTScore=0.17, CHScore=93.38

MTScore=0.18, CHScore=40.14

MTScore=0.31, CHScore=44.59

Gen-2

MTScore=0.19, CHScore=79.39

MTScore=0.15, CHScore=58.01

MTScore=0.21, CHScore=49.07

Gen-2

MTScore=0.13, CHScore=85.89

MTScore=0.22, CHScore=38.45

MTScore=0.41, CHScore=38.40

Gen-2

MTScore=0.22, CHScore=81.44

MTScore=0.42, CHScore=41.65

MTScore=0.44, CHScore=39.68

Gen-2

MTScore=0.30, CHScore=50.91

MTScore=0.45, CHScore=75.92

MTScore=0.46, CHScore=36.47

BibTeX

If you find our work useful, please consider citing our paper:

@article{yuan2024chronomagic, title={ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation}, author={Yuan, Shenghai and Huang, Jinfa and Xu, Yongqi and Liu, Yaoyang and Zhang, Shaofeng and Shi, Yujun and Zhu, Ruijie and Cheng, Xinhua and Luo, Jiebo and Yuan, Li}, journal={arXiv preprint arXiv:2406.18522}, year={2024} }

ChronoMagic-Bench : A Benchmark for Metamorphic Evaluation of
Text-to-Time-lapse Video Generation

ChronoMagic-Bench can reflect the physical prior capacity of the T2V model !

Abstract

ChronoMagic-Bench Results of Open-Source T2V Models

ChronoMagic-Bench Results of Closed-Source T2V Models

Leaderboard

Metamorphic Amplitude References

Evaluation examples with varying MTScore. The larger score denotes a better performance.

Temporal Coherence References

Evaluation examples with varying CHScore. The larger score denotes a better performance.

Validation of the Automatic Metrics

We invited 171 participants to evaluate the videos across four dimensions, and select only five representative baseline results from which users can choose to enhance user satisfaction.

Evaluation examples of Different Models

ChronoMagic-Bench Statistics

ChronoMagic-Pro/ProH Statistics

BibTeX

ChronoMagic-Bench : A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation

ChronoMagic-Bench can reflect the physical prior capacity of the T2V model !

Abstract

ChronoMagic-Bench Results of Open-Source T2V Models

ChronoMagic-Bench Results of Closed-Source T2V Models

Leaderboard

Metamorphic Amplitude References

Evaluation examples with varying MTScore. The larger score denotes a better performance.

Temporal Coherence References

Evaluation examples with varying CHScore. The larger score denotes a better performance.

Validation of the Automatic Metrics

We invited 171 participants to evaluate the videos across four dimensions, and select only five representative baseline results from which users can choose to enhance user satisfaction.

Evaluation examples of Different Models

ChronoMagic-Bench Statistics

ChronoMagic-Pro/ProH Statistics

BibTeX

ChronoMagic-Bench : A Benchmark for Metamorphic Evaluation of
Text-to-Time-lapse Video Generation