|
# AudioGen: Textually-guided audio generation
|
|
|
|
AudioCraft provides the code and a model re-implementing AudioGen, a [textually-guided audio generation][audiogen_arxiv]
|
|
model that performs text-to-sound generation.
|
|
|
|
The provided AudioGen reimplementation follows the LM model architecture introduced in [MusicGen][musicgen_arxiv]
|
|
and is a single stage auto-regressive Transformer model trained over a 16kHz
|
|
<a href="https://github.com/facebookresearch/encodec">EnCodec tokenizer</a> with 4 codebooks sampled at 50 Hz.
|
|
This model variant reaches similar audio quality than the original implementation introduced in the AudioGen publication
|
|
while providing faster generation speed given the smaller frame rate.
|
|
|
|
**Important note:** The provided models are NOT the original models used to report numbers in the
|
|
[AudioGen publication][audiogen_arxiv]. Refer to the model card to learn more about architectural changes.
|
|
|
|
Listen to samples from the **original AudioGen implementation** in our [sample page][audiogen_samples].
|
|
|
|
|
|
## Model Card
|
|
|
|
See [the model card](../model_cards/AUDIOGEN_MODEL_CARD.md).
|
|
|
|
|
|
## Installation
|
|
|
|
Please follow the AudioCraft installation instructions from the [README](../README.md).
|
|
|
|
AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters).
|
|
|
|
## API and usage
|
|
|
|
We provide a simple API and 1 pre-trained models for AudioGen:
|
|
|
|
`facebook/audiogen-medium`: 1.5B model, text to sound - [🤗 Hub](https://huggingface.co/facebook/audiogen-medium)
|
|
|
|
You can play with AudioGen by running the jupyter notebook at [`demos/audiogen_demo.ipynb`](../demos/audiogen_demo.ipynb) locally (if you have a GPU).
|
|
|
|
See after a quick example for using the API.
|
|
|
|
```python
|
|
import torchaudio
|
|
from audiocraft.models import AudioGen
|
|
from audiocraft.data.audio import audio_write
|
|
|
|
model = AudioGen.get_pretrained('facebook/audiogen-medium')
|
|
model.set_generation_params(duration=5) # generate 5 seconds.
|
|
descriptions = ['dog barking', 'sirene of an emergency vehicle', 'footsteps in a corridor']
|
|
wav = model.generate(descriptions) # generates 3 samples.
|
|
|
|
for idx, one_wav in enumerate(wav):
|
|
# Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
|
|
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
|
|
```
|
|
|
|
## Training
|
|
|
|
The [AudioGenSolver](../audiocraft/solvers/audiogen.py) implements the AudioGen's training pipeline
|
|
used to develop the released model. Note that this may not fully reproduce the results presented in the paper.
|
|
Similarly to MusicGen, it defines an autoregressive language modeling task over multiple streams of
|
|
discrete tokens extracted from a pre-trained EnCodec model (see [EnCodec documentation](./ENCODEC.md)
|
|
for more details on how to train such model) with dataset-specific changes for environmental sound
|
|
processing.
|
|
|
|
Note that **we do NOT provide any of the datasets** used for training AudioGen.
|
|
|
|
### Example configurations and grids
|
|
|
|
We provide configurations to reproduce the released models and our research.
|
|
AudioGen solvers configuration are available in [config/solver/audiogen](../config/solver/audiogen).
|
|
The base training configuration used for the released models is the following:
|
|
[`solver=audiogen/audiogen_base_16khz`](../config/solver/audiogen/audiogen_base_16khz.yaml)
|
|
|
|
Please find some example grids to train AudioGen at
|
|
[audiocraft/grids/audiogen](../audiocraft/grids/audiogen/).
|
|
|
|
```shell
|
|
# text-to-sound
|
|
dora grid audiogen.audiogen_base_16khz
|
|
```
|
|
|
|
### Sound dataset and metadata
|
|
|
|
AudioGen's underlying dataset is an AudioDataset augmented with description metadata.
|
|
The AudioGen dataset implementation expects the metadata to be available as `.json` files
|
|
at the same location as the audio files or through specified external folder.
|
|
Learn more in the [datasets section](./DATASETS.md).
|
|
|
|
### Evaluation stage
|
|
|
|
By default, evaluation stage is also computing the cross-entropy and the perplexity over the
|
|
evaluation dataset. Indeed the objective metrics used for evaluation can be costly to run
|
|
or require some extra dependencies. Please refer to the [metrics documentation](./METRICS.md)
|
|
for more details on the requirements for each metric.
|
|
|
|
We provide an off-the-shelf configuration to enable running the objective metrics
|
|
for audio generation in
|
|
[config/solver/audiogen/evaluation/objective_eval](../config/solver/audiogen/evaluation/objective_eval.yaml).
|
|
|
|
One can then activate evaluation the following way:
|
|
```shell
|
|
# using the configuration
|
|
dora run solver=audiogen/debug solver/audiogen/evaluation=objective_eval
|
|
# specifying each of the fields, e.g. to activate KL computation
|
|
dora run solver=audiogen/debug evaluate.metrics.kld=true
|
|
```
|
|
|
|
See [an example evaluation grid](../audiocraft/grids/audiogen/audiogen_pretrained_16khz_eval.py).
|
|
|
|
### Generation stage
|
|
|
|
The generation stage allows to generate samples conditionally and/or unconditionally and to perform
|
|
audio continuation (from a prompt). We currently support greedy sampling (argmax), sampling
|
|
from softmax with a given temperature, top-K and top-P (nucleus) sampling. The number of samples
|
|
generated and the batch size used are controlled by the `dataset.generate` configuration
|
|
while the other generation parameters are defined in `generate.lm`.
|
|
|
|
```shell
|
|
# control sampling parameters
|
|
dora run solver=audiogen/debug generate.lm.gen_duration=5 generate.lm.use_sampling=true generate.lm.top_k=15
|
|
```
|
|
|
|
## More information
|
|
|
|
Refer to [MusicGen's instructions](./MUSICGEN.md).
|
|
|
|
### Learn more
|
|
|
|
Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md).
|
|
|
|
|
|
## Citation
|
|
|
|
AudioGen
|
|
```
|
|
@article{kreuk2022audiogen,
|
|
title={Audiogen: Textually guided audio generation},
|
|
author={Kreuk, Felix and Synnaeve, Gabriel and Polyak, Adam and Singer, Uriel and D{\'e}fossez, Alexandre and Copet, Jade and Parikh, Devi and Taigman, Yaniv and Adi, Yossi},
|
|
journal={arXiv preprint arXiv:2209.15352},
|
|
year={2022}
|
|
}
|
|
```
|
|
|
|
MusicGen
|
|
```
|
|
@article{copet2023simple,
|
|
title={Simple and Controllable Music Generation},
|
|
author={Jade Copet and Felix Kreuk and Itai Gat and Tal Remez and David Kant and Gabriel Synnaeve and Yossi Adi and Alexandre Défossez},
|
|
year={2023},
|
|
journal={arXiv preprint arXiv:2306.05284},
|
|
}
|
|
```
|
|
|
|
## License
|
|
|
|
See license information in the [model card](../model_cards/AUDIOGEN_MODEL_CARD.md).
|
|
|
|
[audiogen_arxiv]: https://arxiv.org/abs/2209.15352
|
|
[musicgen_arxiv]: https://arxiv.org/abs/2306.05284
|
|
[audiogen_samples]: https://felixkreuk.github.io/audiogen/
|
|
|