CascadeV | An Implemention of Würstchen architecture for High-Resolution Video Generation
News
[2024.07.17] We release the code and pretrained weights of a DiT-based video VAE, which supports video reconstruction with a high compression factor (1x32x32=1024). The T2V model is still on the way.
Introduction
CascadeV is a video generation pipeline built upon the Würstchen architecture. By using a highly compressed latent representation, we can generate longer videos with higher resolution.
Video VAE
Comparison of Our Cascade Approach with Other VAEs (on Latent Space of Shape 8x32x32)
Video Recontruction: Original (left) vs. Reconstructed (right) | Click to view the videos
1. Model Architecture
1.1 DiT
We use PixArt-Σ as our base model with the following modifications:
- Replace the original VAE (of SDXL) with the one from Stable Video Diffusion.
- Use sematic compressor from StableCascade to provide the low-resolution latent input.
- Remove text encoder and all multi-head cross-attention layers since we are not using text condition.
- Replace all 2D attention layers to 3D. We find that 3D attention outperforms 2+1D (i.e. alternative spatial and temporal attention), especially in temporal consistency.
Comparison of 2+1D Attention (left) vs. 3D Attention (right)
1.2. Grid Attention
Using 3D attention requires much more computational resources than 2D/2+1D, especially with higher resolution. As a compromise solution, we replace some 3D attention layers with alternative spatial and temporal grid attention.
2. Evaluation
Dataset: We perform qualitative comparison with other baselines on the dataset Inter4K, by sampling the first 200 videos from the Inter4K to create a video dataset with a resolution of 1024x1024 and 30 FPS.
Metrics: We use PSNR, SSIM and LPIPS to evaluate the per-frame quality (and the similarity between original and reconstructed video) and VBench to evaluate the video quality independently.
2.1 PSNR/SSIM/LPIPS
Diffusion-based VAEs (like StableCascade and our model) performs poorly in reconstruction metrics, due to their ability to produce videos with more fine-grained details but less similiar to the original ones.
Model/Compression Factor | PSNR↑ | SSIM↑ | LPIPS↓ |
---|---|---|---|
Open-Sora-Plan v1.1/4x8x8=256 | 25.7282 | 0.8000 | 0.1030 |
EasyAnimate v3/4x8x8=256 | 28.8666 | 0.8505 | 0.0818 |
StableCascade/1x32x32=1024 | 24.3336 | 0.6896 | 0.1395 |
Ours/1x32x32=1024 | 23.7320 | 0.6742 | 0.1786 |
2.2 VBench
Our approach has comparable performance to the previous VAEs in both frame-wise and temporal quality even with much larger compression factor.
Model/Compression Factor | Subject Consistency | Background Consistency | Temporal Flickering | Motion Smoothness | Imaging Quality | Aesthetic Quality |
---|---|---|---|---|---|---|
Open-Sora-Plan v1.1/4x8x8=256 | 0.9519 | 0.9618 | 0.9573 | 0.9789 | 0.6791 | 0.5450 |
EasyAnimate v3/4x8x8=256 | 0.9578 | 0.9695 | 0.9615 | 0.9845 | 0.6735 | 0.5535 |
StableCascade/1x32x32=1024 | 0.9490 | 0.9517 | 0.9430 | 0.9639 | 0.6811 | 0.5675 |
Ours/1x32x32=1024 | 0.9601 | 0.9679 | 0.9626 | 0.9837 | 0.6747 | 0.5579 |
3. Usage
3.1 Installation
Recommend to use Conda
conda create -n cascadev python==3.9.0
conda activate cascadev
Install PixArt-Σ
bash install.sh
3.2 Download Pretrained Weights
bash pretrained/download.sh
3.3 Video Reconstruction
A sample script for video reconstruction with compression factor of 32
bash recon.sh
Results of Video Reconstruction: w/o LDM (left) vs. w/ LDM (right)
It takes almost 1 minutes to reconstruct a video of shape 8x1024x1024 with one NVIDIA-A800
3.4 Train VAE
- Replace "video_list" in configs/s1024.effn-f32.py with your own video datasets
- Then run
bash train_vae.sh
Acknowledgement
- PixArt-Σ: The main codebase we built upon.
- StableCascade: Würstchen architecture we used.
- Thanks Stable Video Diffusion for its amazing Video VAE.