|
# EnCodec: High Fidelity Neural Audio Compression
|
|
|
|
AudioCraft provides the training code for EnCodec, a state-of-the-art deep learning
|
|
based audio codec supporting both mono and stereo audio, presented in the
|
|
[High Fidelity Neural Audio Compression][arxiv] paper.
|
|
Check out our [sample page][encodec_samples].
|
|
|
|
## Original EnCodec models
|
|
|
|
The EnCodec models presented in High Fidelity Neural Audio Compression can be accessed
|
|
and used with the [EnCodec repository](https://github.com/facebookresearch/encodec).
|
|
|
|
**Note**: We do not guarantee compatibility between the AudioCraft and EnCodec codebases
|
|
and released checkpoints at this stage.
|
|
|
|
|
|
## Installation
|
|
|
|
Please follow the AudioCraft installation instructions from the [README](../README.md).
|
|
|
|
|
|
## Training
|
|
|
|
The [CompressionSolver](../audiocraft/solvers/compression.py) implements the audio reconstruction
|
|
task to train an EnCodec model. Specifically, it trains an encoder-decoder with a quantization
|
|
bottleneck - a SEANet encoder-decoder with Residual Vector Quantization bottleneck for EnCodec -
|
|
using a combination of objective and perceptual losses in the forms of discriminators.
|
|
|
|
The default configuration matches a causal EnCodec training at a single bandwidth.
|
|
|
|
### Example configuration and grids
|
|
|
|
We provide sample configuration and grids for training EnCodec models.
|
|
|
|
The compression configuration are defined in
|
|
[config/solver/compression](../config/solver/compression).
|
|
|
|
The example grids are available at
|
|
[audiocraft/grids/compression](../audiocraft/grids/compression).
|
|
|
|
```shell
|
|
# base causal encodec on monophonic audio sampled at 24 khz
|
|
dora grid compression.encodec_base_24khz
|
|
# encodec model used for MusicGen on monophonic audio sampled at 32 khz
|
|
dora grid compression.encodec_musicgen_32khz
|
|
```
|
|
|
|
### Training and validation stages
|
|
|
|
The model is trained using a combination of objective and perceptual losses.
|
|
More specifically, EnCodec is trained with the MS-STFT discriminator along with
|
|
objective losses through the use of a loss balancer to effectively weight
|
|
the different losses, in an intuitive manner.
|
|
|
|
### Evaluation stage
|
|
|
|
Evaluation metrics for audio generation:
|
|
* SI-SNR: Scale-Invariant Signal-to-Noise Ratio.
|
|
* ViSQOL: Virtual Speech Quality Objective Listener.
|
|
|
|
Note: Path to the ViSQOL binary (compiled with bazel) needs to be provided in
|
|
order to run the ViSQOL metric on the reference and degraded signals.
|
|
The metric is disabled by default.
|
|
Please refer to the [metrics documentation](../METRICS.md) to learn more.
|
|
|
|
### Generation stage
|
|
|
|
The generation stage consists in generating the reconstructed audio from samples
|
|
with the current model. The number of samples generated and the batch size used are
|
|
controlled by the `dataset.generate` configuration. The output path and audio formats
|
|
are defined in the generate stage configuration.
|
|
|
|
```shell
|
|
# generate samples every 5 epoch
|
|
dora run solver=compression/encodec_base_24khz generate.every=5
|
|
# run with a different dset
|
|
dora run solver=compression/encodec_base_24khz generate.path=<PATH_IN_DORA_XP_FOLDER>
|
|
# limit the number of samples or use a different batch size
|
|
dora grid solver=compression/encodec_base_24khz dataset.generate.num_samples=10 dataset.generate.batch_size=4
|
|
```
|
|
|
|
### Playing with the model
|
|
|
|
Once you have a model trained, it is possible to get the entire solver, or just
|
|
the trained model with the following functions:
|
|
|
|
```python
|
|
from audiocraft.solvers import CompressionSolver
|
|
|
|
# If you trained a custom model with signature SIG.
|
|
model = CompressionSolver.model_from_checkpoint('//sig/SIG')
|
|
# If you want to get one of the pretrained models with the `//pretrained/` prefix.
|
|
model = CompressionSolver.model_from_checkpoint('//pretrained/facebook/encodec_32khz')
|
|
# Or load from a custom checkpoint path
|
|
model = CompressionSolver.model_from_checkpoint('/my_checkpoints/foo/bar/checkpoint.th')
|
|
|
|
|
|
# If you only want to use a pretrained model, you can also directly get it
|
|
# from the CompressionModel base model class.
|
|
from audiocraft.models import CompressionModel
|
|
|
|
# Here do not put the `//pretrained/` prefix!
|
|
model = CompressionModel.get_pretrained('facebook/encodec_32khz')
|
|
model = CompressionModel.get_pretrained('dac_44khz')
|
|
|
|
# Finally, you can also retrieve the full Solver object, with its dataloader etc.
|
|
from audiocraft import train
|
|
from pathlib import Path
|
|
import logging
|
|
import os
|
|
import sys
|
|
|
|
# Uncomment the following line if you want some detailed logs when loading a Solver.
|
|
# logging.basicConfig(stream=sys.stderr, level=logging.INFO)
|
|
|
|
# You must always run the following function from the root directory.
|
|
os.chdir(Path(train.__file__).parent.parent)
|
|
|
|
|
|
# You can also get the full solver (only for your own experiments).
|
|
# You can provide some overrides to the parameters to make things more convenient.
|
|
solver = train.get_solver_from_sig('SIG', {'device': 'cpu', 'dataset': {'batch_size': 8}})
|
|
solver.model
|
|
solver.dataloaders
|
|
```
|
|
|
|
### Importing / Exporting models
|
|
|
|
At the moment we do not have a definitive workflow for exporting EnCodec models, for
|
|
instance to Hugging Face (HF). We are working on supporting automatic conversion between
|
|
AudioCraft and Hugging Face implementations.
|
|
|
|
We still have some support for fine-tuning an EnCodec model coming from HF in AudioCraft,
|
|
using for instance `continue_from=//pretrained/facebook/encodec_32k`.
|
|
|
|
An AudioCraft checkpoint can be exported in a more compact format (excluding the optimizer etc.)
|
|
using `audiocraft.utils.export.export_encodec`. For instance, you could run
|
|
|
|
```python
|
|
from audiocraft.utils import export
|
|
from audiocraft import train
|
|
xp = train.main.get_xp_from_sig('SIG')
|
|
export.export_encodec(
|
|
xp.folder / 'checkpoint.th',
|
|
'/checkpoints/my_audio_lm/compression_state_dict.bin')
|
|
|
|
|
|
from audiocraft.models import CompressionModel
|
|
model = CompressionModel.get_pretrained('/checkpoints/my_audio_lm/compression_state_dict.bin')
|
|
|
|
from audiocraft.solvers import CompressionSolver
|
|
# The two are strictly equivalent, but this function supports also loading from non-already exported models.
|
|
model = CompressionSolver.model_from_checkpoint('//pretrained//checkpoints/my_audio_lm/compression_state_dict.bin')
|
|
```
|
|
|
|
We will see then how to use this model as a tokenizer for MusicGen/AudioGen in the
|
|
[MusicGen documentation](./MUSICGEN.md).
|
|
|
|
### Learn more
|
|
|
|
Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md).
|
|
|
|
|
|
## Citation
|
|
```
|
|
@article{defossez2022highfi,
|
|
title={High Fidelity Neural Audio Compression},
|
|
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
|
|
journal={arXiv preprint arXiv:2210.13438},
|
|
year={2022}
|
|
}
|
|
```
|
|
|
|
|
|
## License
|
|
|
|
See license information in the [README](../README.md).
|
|
|
|
[arxiv]: https://arxiv.org/abs/2210.13438
|
|
[encodec_samples]: https://ai.honu.io/papers/encodec/samples.html
|
|
|