arxiv:2310.00426

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Published on Sep 30, 2023

· Submitted by

akhaliq on Oct 3, 2023

#1 Paper of the day

Upvote

Authors:

Jincheng Yu ,

Abstract

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces PIXART-alpha, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators (e.g., Imagen, SDXL, and even Midjourney), reaching near-commercial application standards. Additionally, it supports high-resolution image synthesis up to 1024px resolution with low training cost, as shown in Figure 1 and 2. To achieve this goal, three core designs are proposed: (1) Training strategy decomposition: We devise three distinct training steps that separately optimize pixel dependency, text-image alignment, and image aesthetic quality; (2) Efficient T2I Transformer: We incorporate cross-attention modules into Diffusion Transformer (DiT) to inject text conditions and streamline the computation-intensive class-condition branch; (3) High-informative data: We emphasize the significance of concept density in text-image pairs and leverage a large Vision-Language model to auto-label dense pseudo-captions to assist text-image alignment learning. As a result, PIXART-alpha's training speed markedly surpasses existing large-scale T2I models, e.g., PIXART-alpha only takes 10.8% of Stable Diffusion v1.5's training time (675 vs. 6,250 A100 GPU days), saving nearly \300,000 (26,000 vs. \320,000) and reducing 90% CO2 emissions. Moreover, compared with a larger SOTA model, RAPHAEL, our training cost is merely 1%. Extensive experiments demonstrate that PIXART-\alpha excels in image quality, artistry, and semantic control. We hope PIXART-\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

View arXiv page View PDF Add to collection

Community

julien-c

Oct 3, 2023

Disclaimer: 💥 AI-generated summary:

Objective

The paper introduces PIXART-α, a Transformer-based text-to-image diffusion model that achieves near state-of-the-art image generation quality while significantly reducing training costs and CO2 emissions compared to other models.

The key contributions are: 1) Training strategy decomposition into pixel dependency learning, text-image alignment, and aesthetic enhancement stages; 2) An efficient T2I Transformer architecture incorporating cross-attention and optimized normalization; 3) Using an auto-labeling pipeline with LLaVA to create a high-information-density text-image dataset.

Implementation

The model is based on Diffusion Transformer (DiT) with additional cross-attention modules to inject text conditions.
Training is divided into 3 main stages:
Stage 1: Learn pixel distributions using a class-condition model pretrained on ImageNet.
Stage 2: Learn text-image alignment using high-information captions labeled by LLaVA.
Stage 3: Enhance image aesthetics using high-quality datasets.
An auto-labeling pipeline with LLaVA is used to create dense, precise captions for the SAM dataset.
Efficiency optimizations like shared normalization parameters (adaLN-single) are incorporated.
Training uses AdamW optimizer with learning rate 2e-5, batch size 64-178, on 64 V100 GPUs.

Insights

Decomposing the training strategy into distinct stages (pixel, alignment, aesthetic) significantly improves efficiency.
Using auto-labeled, high-information captions is crucial for fast text-image alignment learning.
Compatibility with pretrained class-condition model weights provides a useful initialization.
Architectural optimizations like cross-attention modules and shared normalization parameters improve efficiency.
The model achieves near state-of-the-art quality with only 2% of the training cost of other models.