PixArt-Ξ±: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Abstract
The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-alpha's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-alpha only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \300,000 (26,000 vs. \320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-\alpha excels in image quality, artistry, and semantic control. We hope PIXART-\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.
Community
Disclaimer: π₯ AI-generated summary:
Objective
The paper introduces PIXART-Ξ±, a Transformer-based text-to-image diffusion model that achieves near state-of-the-art image generation quality while significantly reducing training costs and CO2 emissions compared to other models.
The key contributions are: 1) Training strategy decomposition into pixel dependency learning, text-image alignment, and aesthetic enhancement stages; 2) An efficient T2I Transformer architecture incorporating cross-attention and optimized normalization; 3) Using an auto-labeling pipeline with LLaVA to create a high-information-density text-image dataset.
Implementation
The model is based on Diffusion Transformer (DiT) with additional cross-attention modules to inject text conditions.
Training is divided into 3 main stages:
Stage 1: Learn pixel distributions using a class-condition model pretrained on ImageNet.
Stage 2: Learn text-image alignment using high-information captions labeled by LLaVA.
Stage 3: Enhance image aesthetics using high-quality datasets.
An auto-labeling pipeline with LLaVA is used to create dense, precise captions for the SAM dataset.
Efficiency optimizations like shared normalization parameters (adaLN-single) are incorporated.
Training uses AdamW optimizer with learning rate 2e-5, batch size 64-178, on 64 V100 GPUs.
Insights
Decomposing the training strategy into distinct stages (pixel, alignment, aesthetic) significantly improves efficiency.
Using auto-labeled, high-information captions is crucial for fast text-image alignment learning.
Compatibility with pretrained class-condition model weights provides a useful initialization.
Architectural optimizations like cross-attention modules and shared normalization parameters improve efficiency.
The model achieves near state-of-the-art quality with only 2% of the training cost of other models.
Results
PIXART-Ξ± achieves competitive image generation quality to state-of-the-art models while reducing training costs by 98% and CO2 emissions by 90%.
We need the codeβ¦
code,bro
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models (2023)
- IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models (2023)
- Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack (2023)
- LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (2023)
- InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Hi, the code is released in https://github.com/PixArt-alpha/PixArt-alpha
And the project page is https://pixart-alpha.github.io/
The model is amazing
I made a full tutorial
Also opened feature add request on Automatic1111 SD Web UI, Kohya Trainer scripts and OneTrainer
We really need more details about how to train it
My tutorial and auto installers cover on Windows and RunPod / Linux
supports 8 bit Text Encoder loading and CPU off load feature
This model is definitely better than SDXL
PixArt-$\alpha$: Revolutionizing Text-to-Image Synthesis with Low Training Costs!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 0
No dataset linking this paper