diff --git a/assets/corr.jpg b/assets/corr.jpg new file mode 100644 index 0000000..cffda71 Binary files /dev/null and b/assets/corr.jpg differ diff --git a/assets/dataset.mp4 b/assets/dataset.mp4 new file mode 100644 index 0000000..4a9257e Binary files /dev/null and b/assets/dataset.mp4 differ diff --git a/assets/domain_gap.mp4 b/assets/domain_gap.mp4 new file mode 100644 index 0000000..657bde7 Binary files /dev/null and b/assets/domain_gap.mp4 differ diff --git a/assets/hf-logo.svg b/assets/hf-logo.svg new file mode 100644 index 0000000..ab959d1 --- /dev/null +++ b/assets/hf-logo.svg @@ -0,0 +1,8 @@ + diff --git a/assets/info.svg b/assets/info.svg new file mode 100644 index 0000000..3047279 --- /dev/null +++ b/assets/info.svg @@ -0,0 +1,3 @@ + diff --git a/assets/showlab.ico b/assets/showlab.ico new file mode 100644 index 0000000..95610d3 Binary files /dev/null and b/assets/showlab.ico differ diff --git a/assets/teaser.mp4 b/assets/teaser.mp4 new file mode 100644 index 0000000..fadda5a Binary files /dev/null and b/assets/teaser.mp4 differ diff --git a/assets/teaser/01.mp4 b/assets/teaser/01.mp4 new file mode 100644 index 0000000..65d1af2 Binary files /dev/null and b/assets/teaser/01.mp4 differ diff --git a/assets/teaser/02.mp4 b/assets/teaser/02.mp4 new file mode 100644 index 0000000..f6a94aa Binary files /dev/null and b/assets/teaser/02.mp4 differ diff --git a/assets/teaser/03.mp4 b/assets/teaser/03.mp4 new file mode 100644 index 0000000..fd8bcb6 Binary files /dev/null and b/assets/teaser/03.mp4 differ diff --git a/assets/teaser/04.mp4 b/assets/teaser/04.mp4 new file mode 100644 index 0000000..ded207f Binary files /dev/null and b/assets/teaser/04.mp4 differ diff --git a/assets/teaser/05.mp4 b/assets/teaser/05.mp4 new file mode 100644 index 0000000..006b71d Binary files /dev/null and b/assets/teaser/05.mp4 differ diff --git a/assets/teaser/06.mp4 b/assets/teaser/06.mp4 new file mode 100644 index 0000000..de9e0db Binary files /dev/null and b/assets/teaser/06.mp4 differ diff --git a/assets/text_alignment.gif b/assets/text_alignment.gif new file mode 100644 index 0000000..1b7784b Binary files /dev/null and b/assets/text_alignment.gif differ diff --git a/assets/text_alignment.mp4 b/assets/text_alignment.mp4 new file mode 100644 index 0000000..694c043 Binary files /dev/null and b/assets/text_alignment.mp4 differ diff --git a/assets/video_quality.jpg b/assets/video_quality.jpg new file mode 100644 index 0000000..a1283f5 Binary files /dev/null and b/assets/video_quality.jpg differ diff --git a/index.html b/index.html new file mode 100644 index 0000000..80ca0c0 --- /dev/null +++ b/index.html @@ -0,0 +1,408 @@ + + +
+ + + + ++ Generative models have demonstrated remarkable capability in synthesizing high-quality text, images, and videos. + For video generation, contemporary text-to-video models exhibit impressive capabilities, crafting visually stunning videos. + Nonetheless, evaluating such videos poses significant challenges. Current research predominantly employs automated metrics such as FVD, IS, and CLIP Score. + However, these metrics provide an incomplete analysis, particularly in the temporal assessment of video content, thus rendering them unreliable indicators of true video quality. + Furthermore, while user studies have the potential to reflect human perception accurately, they are hampered by their time-intensive and laborious nature, with outcomes that are often tainted by subjective bias. + In this paper, we investigate the limitations inherent in existing metrics and introduce a novel evaluation pipeline, the Text-to-Video Score (T2VScore). + This metric integrates two pivotal criteria: (1) Text-Video Alignment, which scrutinizes the fidelity of the video in representing the given text description, and (2) Video Quality, which evaluates the video's overall production caliber with a mixture of experts. + Moreover, to evaluate the proposed metrics and facilitate future improvements on them, we present the TVGE dataset, collecting human judgements of 2,543 text-to-video generated videos on the two criteria. + Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation. The code and dataset will be open-sourced. +
++ Domain Gap with Natural Videos. The common distortions in generated videos (as in TVGE dataset) are different from those in natural videos, both spatially and temporally. +
++ Score Distributions in TVGE. In general, the generated videos receive lower-than-average human ratings on both perspectives, suggesting the need to continuously improve these methods to eventually produce plausible videos. + Nevertheless, specific models also prove decent proficiency on one single dimension, e.g. Pika gets an average score of 3.45 on video quality. + Between the two perspectives, we notice a very low correlation (0.223 Spearman’s ρ, 0.152 Kendall’s φ), proving that the two dimensions are different and should be considered independently. +
+
+ @article{wu2024towards,
+ title={Towards A Better Metric for Text-to-Video Generation},
+ author={xxx},
+ journal={arXiv preprint arXiv:xxxx.xxxxx},
+ year={2024}
+ }
+
+