Advanced long-form generation
#12
by
skroed
- opened
Is there a way to do advanced long-form generation similar as it is described here: https://github.com/suno-ai/bark/blob/main/notebooks/long_form_generation.ipynb
by passing a temperature
and min_eos_p
?
You can use the same logic than the original notebook you've sent:
import nltk # we'll use this to split into sentences
import numpy as np
from transformers import BarkModel, AutoProcessor
import torch
nltk.download('punkt')
device = "cuda"
# Load model with optimization
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16).to(device)
# flash attention
model = model.to_bettertransformer()
processor = AutoProcessor.from_pretrained("suno/bark")
sampling_rate = model.generation_config.sample_rate
silence = np.zeros(int(0.25 * sampling_rate)) # quarter second of silence
voice_preset = "v2/en_speaker_6"
BATCH_SIZE = 12
# split into sentences
model_input = nltk.sent_tokenize(TEXT_TO_GENERATE)
pieces = []
for i in range(0, len(model_input), BATCH_SIZE):
inputs = model_input[BATCH_SIZE*i:min(BATCH_SIZE*(i+1), len(model_input))]
if len(inputs) != 0:
inputs = processor(inputs, voice_preset=voice_preset)
speech_output, output_lengths = model.generate(**inputs.to(device), return_output_lengths=True, min_eos_p=0.2)
speech_output = [output[:length].cpu().numpy() for (output,length) in zip(speech_output, output_lengths)]
print(f"{i}-th part generated")
# you can already play `speech_output` or wait for the whole generation
pieces += [*speech_output, silence.copy()]
whole_ouput = np.concatenate(pieces)
This one however is faster because it produces the output in batch!
thanks I will give this a try.