Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution
Abstract
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of pixel level image diffusion model to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in https://drive.google.com/drive/folders/1YVc-KMSJqOrEUdQWVaI-Yfu8Vsfu_1aO?usp=sharing .
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models (2023)
- Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution (2023)
- Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation (2023)
- Photorealistic Video Generation with Diffusion Models (2023)
- MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper