Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

Authors: Zhening Xing, Gereon Fox, Yanhong Zeng, Xingang Pan, Mohamed Elgharib, Christian Theobalt, Kai Chen † (†: corresponding author)

Key Features

Uni-directional Temporal Attention with Warmup Mechanism
Multitimestep KV-Cache for Temporal Attention during Inference
Depth Prior for Better Structure Consistency
Compatible with DreamBooth and LoRA for Various Styles
TensorRT Supported

The speed evaluation is conducted on Ubuntu 20.04.6 LTS and Pytorch 2.2.2 with RTX 4090 GPU and Intel(R) Xeon(R) Platinum 8352V CPU. Denoising steps are set as 2.

Resolution	TensorRT	FPS
512 x 512	On	16.43
512 x 512	Off	6.91
768 x 512	On	12.15
768 x 512	Off	6.29

Real-Time Video2Video Demo

Human Face (Web Camera Input)	Anime Character (Screen Video Input)

Acknowledgements

The video and image demos in this GitHub repository were generated using LCM-LoRA. Stream batch in StreamDiffusion is used for model acceleration. The design of Video Diffusion Model is adopted from AnimateDiff. We use a third-party implementation of MiDaS implementation which support onnx export. Our online demo is modified from Real-Time-Latent-Consistency-Model.

BibTex

If you find it helpful, please consider citing our work:

@article{xing2024live2diff,
  title={Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models},
  author={Zhening Xing and Gereon Fox and Yanhong Zeng and Xingang Pan and Mohamed Elgharib and Christian Theobalt and Kai Chen},
  booktitle={arXiv preprint arxiv:2407.08701},
  year={2024}
}