AudioLDM 2

AudioLDM 2 was proposed in AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.

Inspired by Stable Diffusion, AudioLDM 2 is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from text embeddings. Two text encoder models are used to compute the text embeddings from a prompt input: the text-branch of CLAP and the encoder of Flan-T5. These text embeddings are then projected to a shared embedding space by an AudioLDM2ProjectionModel. A GPT2 language model (LM) is used to auto-regressively predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The UNet of AudioLDM 2 is unique in the sense that it takes two cross-attention embeddings, as opposed to one cross-attention conditioning, as in most other LDMs.

The abstract of the paper is the following:

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at this https URL.

This pipeline was contributed by sanchit-gandhi and Nguyễn Công Tú Anh. The original codebase can be found at haoheliu/audioldm2.

Tips

Choosing a checkpoint

AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation.

All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on the three checkpoints:

Checkpoint	Task	UNet Model Size	Total Model Size	Training Data / h
audioldm2	Text-to-audio	350M	1.1B	1150k
audioldm2-large	Text-to-audio	750M	1.5B	1150k
audioldm2-music	Text-to-music	350M	1.1B	665k
audioldm2-gigaspeech	Text-to-speech	350M	1.1B	10k
audioldm2-ljspeech	Text-to-speech	350M	1.1B

Constructing a prompt

Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream").
It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.
Using a negative prompt can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality."

Controlling inference

The quality of the predicted audio sample can be controlled by the num_inference_steps argument; higher steps give higher quality audio at the expense of slower inference.
The length of the predicted audio sample can be controlled by varying the audio_length_in_s argument.

Evaluating generated waveforms:

The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation.
Multiple waveforms can be generated in one go: set num_waveforms_per_prompt to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.

The following example demonstrates how to construct good music and speech generation using the aforementioned tips: example.

Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.

AudioLDM2Pipeline

[[autodoc]] AudioLDM2Pipeline - all - call

AudioLDM2ProjectionModel

[[autodoc]] AudioLDM2ProjectionModel - forward

AudioLDM2UNet2DConditionModel

[[autodoc]] AudioLDM2UNet2DConditionModel - forward

AudioPipelineOutput

[[autodoc]] pipelines.AudioPipelineOutput