|
# MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer
|
|
|
|
AudioCraft provides the code and models for MAGNeT, [Masked Audio Generation using a Single Non-Autoregressive Transformer][arxiv].
|
|
|
|
MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions.
|
|
It is a masked generative non-autoregressive Transformer trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz.
|
|
Unlike prior work on masked generative audio Transformers, such as [SoundStorm](https://arxiv.org/abs/2305.09636) and [VampNet](https://arxiv.org/abs/2307.04686),
|
|
MAGNeT doesn't require semantic token conditioning, model cascading or audio prompting, and employs a full text-to-audio using a single non-autoregressive Transformer.
|
|
|
|
Check out our [sample page][magnet_samples] or test the available demo!
|
|
|
|
We use 16K hours of licensed music to train MAGNeT. Specifically, we rely on an internal dataset
|
|
of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data.
|
|
|
|
|
|
## Model Card
|
|
|
|
See [the model card](../model_cards/MAGNET_MODEL_CARD.md).
|
|
|
|
|
|
## Installation
|
|
|
|
Please follow the AudioCraft installation instructions from the [README](../README.md).
|
|
|
|
AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters).
|
|
|
|
## Usage
|
|
|
|
We currently offer two ways to interact with MAGNeT:
|
|
1. You can use the gradio demo locally by running [`python -m demos.magnet_app --share`](../demos/magnet_app.py).
|
|
2. You can play with MAGNeT by running the jupyter notebook at [`demos/magnet_demo.ipynb`](../demos/magnet_demo.ipynb) locally (if you have a GPU).
|
|
|
|
## API
|
|
|
|
We provide a simple API and 6 pre-trained models. The pre trained models are:
|
|
- `facebook/magnet-small-10secs`: 300M model, text to music, generates 10-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-small-10secs)
|
|
- `facebook/magnet-medium-10secs`: 1.5B model, text to music, generates 10-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-medium-10secs)
|
|
- `facebook/magnet-small-30secs`: 300M model, text to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-small-30secs)
|
|
- `facebook/magnet-medium-30secs`: 1.5B model, text to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-medium-30secs)
|
|
- `facebook/audio-magnet-small`: 300M model, text to sound-effect - [🤗 Hub](https://huggingface.co/facebook/audio-magnet-small)
|
|
- `facebook/audio-magnet-medium`: 1.5B model, text to sound-effect - [🤗 Hub](https://huggingface.co/facebook/audio-magnet-medium)
|
|
|
|
In order to use MAGNeT locally **you must have a GPU**. We recommend 16GB of memory, especially for
|
|
the medium size models.
|
|
|
|
See after a quick example for using the API.
|
|
|
|
```python
|
|
import torchaudio
|
|
from audiocraft.models import MAGNeT
|
|
from audiocraft.data.audio import audio_write
|
|
|
|
model = MAGNeT.get_pretrained('facebook/magnet-small-10secs')
|
|
descriptions = ['disco beat', 'energetic EDM', 'funky groove']
|
|
wav = model.generate(descriptions) # generates 3 samples.
|
|
|
|
for idx, one_wav in enumerate(wav):
|
|
# Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
|
|
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True)
|
|
```
|
|
|
|
## 🤗 Transformers Usage
|
|
|
|
Coming soon...
|
|
|
|
## Training
|
|
|
|
The [MagnetSolver](../audiocraft/solvers/magnet.py) implements MAGNeT's training pipeline.
|
|
It defines a masked generation task over multiple streams of discrete tokens
|
|
extracted from a pre-trained EnCodec model (see [EnCodec documentation](./ENCODEC.md)
|
|
for more details on how to train such model).
|
|
|
|
Note that **we do NOT provide any of the datasets** used for training MAGNeT.
|
|
We provide a dummy dataset containing just a few examples for illustrative purposes.
|
|
|
|
Please read first the [TRAINING documentation](./TRAINING.md), in particular the Environment Setup section.
|
|
|
|
|
|
### Example configurations and grids
|
|
|
|
We provide configurations to reproduce the released models and our research.
|
|
MAGNeT solvers configuration are available in [config/solver/magnet](../config/solver/magnet),
|
|
in particular:
|
|
* MAGNeT model for text-to-music:
|
|
[`solver=magnet/magnet_32khz`](../config/solver/magnet/magnet_32khz.yaml)
|
|
* MAGNeT model for text-to-sound:
|
|
[`solver=magnet/audio_magnet_16khz`](../config/solver/magnet/audio_magnet_16khz.yaml)
|
|
|
|
We provide 3 different scales, e.g. `model/lm/model_scale=small` (300M), or `medium` (1.5B), and `large` (3.3B).
|
|
|
|
Please find some example grids to train MAGNeT at
|
|
[audiocraft/grids/magnet](../audiocraft/grids/magnet/).
|
|
|
|
```shell
|
|
# text-to-music
|
|
dora grid magnet.magnet_32khz --dry_run --init
|
|
|
|
# text-to-sound
|
|
dora grid magnet.audio_magnet_16khz --dry_run --init
|
|
|
|
# Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup.
|
|
```
|
|
|
|
### dataset and metadata
|
|
Learn more in the [datasets section](./DATASETS.md).
|
|
|
|
#### Music Models
|
|
MAGNeT's underlying dataset is an AudioDataset augmented with music-specific metadata.
|
|
The MAGNeT dataset implementation expects the metadata to be available as `.json` files
|
|
at the same location as the audio files.
|
|
|
|
#### Sound Models
|
|
Audio-MAGNeT's underlying dataset is an AudioDataset augmented with description metadata.
|
|
The Audio-MAGNeT dataset implementation expects the metadata to be available as `.json` files
|
|
at the same location as the audio files or through specified external folder.
|
|
|
|
### Audio tokenizers
|
|
|
|
See [MusicGen](./MUSICGEN.md)
|
|
|
|
### Fine tuning existing models
|
|
|
|
You can initialize your model to one of the pretrained models by using the `continue_from` argument, in particular
|
|
|
|
```bash
|
|
# Using pretrained MAGNeT model.
|
|
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=//pretrained/facebook/magnet-medium-10secs conditioner=text2music
|
|
|
|
# Using another model you already trained with a Dora signature SIG.
|
|
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=//sig/SIG conditioner=text2music
|
|
|
|
# Or providing manually a path
|
|
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=/checkpoints/my_other_xp/checkpoint.th
|
|
```
|
|
|
|
**Warning:** You are responsible for selecting the other parameters accordingly, in a way that make it compatible
|
|
with the model you are fine tuning. Configuration is NOT automatically inherited from the model you continue from. In particular make sure to select the proper `conditioner` and `model/lm/model_scale`.
|
|
|
|
**Warning:** We currently do not support fine tuning a model with slightly different layers. If you decide
|
|
to change some parts, like the conditioning or some other parts of the model, you are responsible for manually crafting a checkpoint file from which we can safely run `load_state_dict`.
|
|
If you decide to do so, make sure your checkpoint is saved with `torch.save` and contains a dict
|
|
`{'best_state': {'model': model_state_dict_here}}`. Directly give the path to `continue_from` without a `//pretrained/` prefix.
|
|
|
|
### Evaluation stage
|
|
For the 6 pretrained MAGNeT models, objective metrics could be reproduced using the following grids:
|
|
|
|
```shell
|
|
# text-to-music
|
|
REGEN=1 dora grid magnet.magnet_pretrained_32khz_eval --dry_run --init
|
|
|
|
# text-to-sound
|
|
REGEN=1 dora grid magnet.audio_magnet_pretrained_16khz_eval --dry_run --init
|
|
|
|
# Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup.
|
|
```
|
|
|
|
See [MusicGen](./MUSICGEN.md) for more details.
|
|
|
|
### Generation stage
|
|
|
|
See [MusicGen](./MUSICGEN.md)
|
|
|
|
### Playing with the model
|
|
|
|
Once you have launched some experiments, you can easily get access
|
|
to the Solver with the latest trained model using the following snippet.
|
|
|
|
```python
|
|
from audiocraft.solvers.magnet import MagnetSolver
|
|
|
|
solver = MagnetSolver.get_eval_solver_from_sig('SIG', device='cpu', batch_size=8)
|
|
solver.model
|
|
solver.dataloaders
|
|
```
|
|
|
|
### Importing / Exporting models
|
|
|
|
We do not support currently loading a model from the Hugging Face implementation or exporting to it.
|
|
If you want to export your model in a way that is compatible with `audiocraft.models.MAGNeT`
|
|
API, you can run:
|
|
|
|
```python
|
|
from audiocraft.utils import export
|
|
from audiocraft import train
|
|
xp = train.main.get_xp_from_sig('SIG_OF_LM')
|
|
export.export_lm(xp.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/state_dict.bin')
|
|
# You also need to bundle the EnCodec model you used !!
|
|
## Case 1) you trained your own
|
|
xp_encodec = train.main.get_xp_from_sig('SIG_OF_ENCODEC')
|
|
export.export_encodec(xp_encodec.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/compression_state_dict.bin')
|
|
## Case 2) you used a pretrained model. Give the name you used without the //pretrained/ prefix.
|
|
## This will actually not dump the actual model, simply a pointer to the right model to download.
|
|
export.export_pretrained_compression_model('facebook/encodec_32khz', '/checkpoints/my_audio_lm/compression_state_dict.bin')
|
|
```
|
|
|
|
Now you can load your custom model with:
|
|
```python
|
|
import audiocraft.models
|
|
magnet = audiocraft.models.MAGNeT.get_pretrained('/checkpoints/my_audio_lm/')
|
|
```
|
|
|
|
|
|
### Learn more
|
|
|
|
Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md).
|
|
|
|
## FAQ
|
|
|
|
#### What are top-k, top-p, temperature and classifier-free guidance?
|
|
|
|
Check out [@FurkanGozukara tutorial](https://github.com/FurkanGozukara/Stable-Diffusion/blob/main/Tutorials/AI-Music-Generation-Audiocraft-Tutorial.md#more-info-about-top-k-top-p-temperature-and-classifier-free-guidance-from-chatgpt).
|
|
|
|
#### Should I use FSDP or autocast ?
|
|
|
|
The two are mutually exclusive (because FSDP does autocast on its own).
|
|
You can use autocast up to 1.5B (medium), if you have enough RAM on your GPU.
|
|
FSDP makes everything more complex but will free up some memory for the actual
|
|
activations by sharding the optimizer state.
|
|
|
|
## Citation
|
|
```
|
|
@misc{ziv2024masked,
|
|
title={Masked Audio Generation using a Single Non-Autoregressive Transformer},
|
|
author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi},
|
|
year={2024},
|
|
eprint={2401.04577},
|
|
archivePrefix={arXiv},
|
|
primaryClass={cs.SD}
|
|
}
|
|
```
|
|
|
|
## License
|
|
|
|
See license information in the [model card](../model_cards/MAGNET_MODEL_CARD.md).
|
|
|
|
[arxiv]: https://arxiv.org/abs/2401.04577
|
|
[magnet_samples]: https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT/
|
|
|