UniVG: Towards UNIfied-modal Video Generation
Diffusion based video generation has received significant attention in both the academic and industrial communities. Despite recent exploration of diverse conditional inputs for better video generation control, existing methods, primarily targeting individual tasks, often fall short in real-world scenarios where users may use any form of conditioning, either individually or combined. To address this, we propose a Unified-modal Video Generation system capable of handling multiple video generation tasks across different modalities. Our approach introduces the concept of generative freedom in the diffusion process, which allows us to reclassify video generation tasks into high-freedom and low-freedom categories based on the solution space given certain conditions. We then design different diffusion paradigms for each category. For high-freedom video generation, we present a base model that is capable of handling varied semantic combinations of text and image. For low-freedom video generation, we propose the Biased Gaussian Noise (BGN) to tackle the discrepancy of the diffusion process between the training and inference stage when using strong conditional guidance strategy. Our proposed UniVG achieves superior objective results on public datasets, surpassing the current open-source methods and is on par with the current close-source method Gen2 and Pika in human evaluations.
News
We provide all generated results, which are used in human evaluation in our work. They can be downloaded as following:
Google Driver Link
BaiduYun Link
Comparison With Others
Ours (Left) v.s. PYOCO (Right)
Ours (1280x720)
PROMPT: A cyborg koala DJ in front of a turntable, in heavy raining futuristic tokyo rooftop cyberpunk night, sci-fi, fantasy, intricate neon light, soft light, smooth, sharp focus, illustration.
PYOCO (1024x1024)
PROMPT: A cyborg koala DJ in front of a turntable, in heavy raining futuristic tokyo rooftop cyberpunk night, sci-fi, fantasy, intricate neon light, soft light, smooth, sharp focus, illustration.
Ours (1280x720)
PROMPT: A tiger is eating grass, wildlife photography, photograph, high quality, wildlife, f1.8, soft focus, 8k, award-winning photograph.
PYOCO (1024x1024)
PROMPT: A tiger is eating grass, wildlife photography, photograph, high quality, wildlife, f1.8, soft focus, 8k, award-winning photograph.
Ours (Left) v.s. SVD (Right)
Ours (Left) v.s. VideoCrafter-1 (Right)
Ours (1280x720)
PROMPT: Zoom in video of a robot warrior, ultra realistic, concept art, intricate details, highly detailed, photorealistic, 8k, sharp focus, volumetric lighting unreal engine.
VideoCrafter-1 (512x320)
PROMPT: Zoom in video of a robot warrior, ultra realistic, concept art, intricate details, highly detailed, photorealistic, 8k, sharp focus, volumetric lighting unreal engine.
Ours (1280x720)
PROMPT: A jar filled with fire, 4K video, 3D rendered, well-rendered.
VideoCrafter-1 (512x320)
PROMPT: A jar filled with fire, 4K video, 3D rendered, well-rendered.
Ours (Left) v.s. ImagenVideo (Right)
Ours (1280x720)
PROMPT: A teddy bear running in New York City.
ImagenVideo (1280x768)
PROMPT: A teddy bear running in New York City.
Ours (1280x720)
PROMPT: Shoveling snow.
ImagenVideo (1280x768)
PROMPT: Shoveling snow.
Gallary
A cat is eating carrot.
A cat is eating carrot.
A girl is floating with fish around.
A girl is floating with fish around.
A cat is drinking beer.
A cat is yawning
A cat reading a book at the snow mountain.