Unified-Modal Video Generation System.

UniVG: Towards UNIfied-modal Video Generation

Ludan RUAN, Lei TIAN, Chuanwei HUANG, Xu ZHANG, Xinyan XIAO,

Baidu Inc.

*Best viewed in the highest resolution and with the sound on

Diffusion based video generation has received extensive attention and achieved considerable success within both the academic and industrial communities. However, current efforts are mainly concen- trated on single-objective or single-task video generation, such as generation driven by text, by image, or by a combination of text and image. This cannot fully meet the needs of real-world application scenarios, as users are likely to input images and text conditions in a flexible manner, either individu- ally or in combination. To address this, we proposes a UNIfied-modal Video Genearation system that is capable of handling arbitrary combinations of text and images as input conditions. To this end, we reclassify multiple tasks of video generation models within the system into categories of high-freedom generation and low-freedom generation. For high-freedom video generation, we use multi-condition cross attention to generate videos that align with the semantics of the input images or text. For low-freedom video generation, we introduce Biased Gaussian Noise to replace completely random Gaussian Noise in order to better preserve the content of the input conditions. Our method achieves the lowest FVD on the MSR-VTT database, surpasses the current open-source methods in human evaluations, and is on par with the current close-source method Gen2.


A cat is eating carrot.

A cat is eating carrot.

A girl is floating with fish around.

A girl is floating with fish around.

A cat is drinking beer.

A cat is yawning

A cat reading a book at the snow mountain.


 title={UniVG: Towards UNIfied-modal Video Generation},
 author={Ludan Ruan, Lei Tian, Chuanwei Huang, Xu Zhang, Xinyan Xiao },
 booktitle={arXiv preprint arXiv:2401.09084},