File size: 8,587 Bytes
b10c429 ad191ce b10c429 55e94a0 b10c429 c8bbc93 b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 657d868 3398a60 ad191ce 55e94a0 ad191ce b10c429 55e94a0 b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce b10c429 ad191ce 55e94a0 ad191ce 55e94a0 ad191ce b10c429 ad191ce b10c429 0726229 b10c429 0726229 b10c429 0726229 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 |
---
license: apache-2.0
datasets:
- jan-hq/instruction-speech-v1
language:
- en
tags:
- sound language model
---
## Model Details
We have developed and released the family [Jan-Llama3](https://huggingface.co/collections/jan-hq/jan-llama3-668e4dad446c8736208dca4f). This family is natively understanding audio and text input.
We continue to expand [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) with sound understanding capabilities by leveraging 700M tokens [Instruction Speech v1](https://huggingface.co/datasets/jan-hq/instruction-speech-v1) dataset.
**Model developers** Homebrew Research.
**Input** Text and sound.
**Output** Text.
**Model Architecture** Llama-3.
**Language(s):** English.
## Intended Use
**Intended Use Cases** This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
**Out-of-scope** The use of Llama-3-Sound in any manner that violates applicable laws or regulations is strictly prohibited.
## How to Get Started with the Model
```python
import torch
import torchaudio
from encodec import EncodecModel
from encodec.utils import convert_audio
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
# Audio to Sound Tokens
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device="cuda"):
model = EncodecModel.encodec_model_24khz()
model.set_target_bandwidth(target_bandwidth)
model.to(device)
wav, sr = torchaudio.load(audio_path)
wav = convert_audio(wav, sr, model.sample_rate, model.channels)
wav = wav.unsqueeze(0).to(device)
with torch.no_grad():
encoded_frames = model.encode(wav)
codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1)
audio_code1, audio_code2 = codes[0][0], codes[0][1]
flatten_tokens = torch.stack((audio_code1, audio_code2), dim=1).flatten().tolist()
result = ''.join(f'<|sound_{num}|>' for num in flatten_tokens)
return f'<|sound_start|>{result}<|sound_end|>'
# LLM Pipeline Setup
def setup_pipeline(model_path, use_4bit=True):
tokenizer = AutoTokenizer.from_pretrained(model_path)
model_kwargs = {"device_map": "auto"}
if use_4bit:
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
return pipeline("text-generation", model=model, tokenizer=tokenizer)
# Text Generation
def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
generation_args = {
"max_new_tokens": max_new_tokens,
"return_full_text": False,
"temperature": temperature,
"do_sample": do_sample,
}
output = pipe(messages, **generation_args)
return output[0]['generated_text']
# Main process
def audio_to_text(audio_path, model_path, use_4bit=True):
# Convert audio to sound tokens
sound_tokens = audio_to_sound_tokens(audio_path)
# Setup LLM pipeline
pipe = setup_pipeline(model_path, use_4bit)
# Generate text
messages = [{"role": "user", "content": sound_tokens}]
return generate_text(pipe, messages)
# Usage example
audio_path = "/path/to/your/audio/file"
model_path = "jan-hq/Jan-Llama3-0708"
generated_text = audio_to_text(audio_path, model_path)
```
## Training process
**Training Metrics Image**: Below is a snapshot of the training loss curve visualized.
![train_loss_curve/png](https://cdn-uploads.huggingface.co/production/uploads/65713d70f56f9538679e5a56/9bv-kpnqrTxaBhiYrVHN7.png)
### Hardware
**GPU Configuration**: Cluster of 8x NVIDIA H100-SXM-80GB.
**GPU Usage**:
- **Continual Training**: 8 hours.
### Training Arguments
| Parameter | Continual Training |
|----------------------------|-------------------------|
| **Epoch** | 1 |
| **Global batch size** | 128 |
| **Learning Rate** | 5e-5 |
| **Learning Scheduler** | Cosine with warmup |
| **Optimizer** | [Adam-mini](https://arxiv.org/abs/2406.16793) |
| **Warmup Ratio** | 0.1 |
| **Weight Decay** | 0.01 |
| **beta1** | 0.9 |
| **beta2** | 0.98 |
| **epsilon** | 1e-6 |
| **Gradient Cliping** | 1.0 |
###
Accelerate FSDP Config
```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
fsdp_activation_checkpointing: true
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
## Examples
1. Good example:
```
# Question:
Does the following review have a positive or negative opinion of the movie?
i thought the relationships were wonderful ,
Options: A). negative; B). positive;
# Expected Output:
The review has a B). positive opinion about the movie because they think the relationships were wonderful.
# Model's Output:
The review is positive.
```
2. Misunderstanding example:
```
# Question:
Translate the following sentence to Russian:
6 Work in JOBS and careers after 40
Russian:
# Expected Output:
6 Работать в РАБОТАХ и карьере после 40
Russian (Cyrillic): 6 Работать в РАБОТАХ и карьере после 40
Let me explain this to you like you're five years old: This sentence means that people can still work and have careers, even if they're older than 40 years old. The words are in a different language, called Russian.
# Model's Output:
В случае, когда вы хотите перевести фразу "The first step is to identify the language of the text." на русский язык, вы можете использовать следующую фразу: "Первый шаг - определить язык текста."
```
3. Off-tracked example:
```
# Quesion:
Answer the following question: - The roots of the growing plants start to break up the rock - The plant acids dissolve the rock - The rock is broken down into smaller pieces - Erosion begins. What might be the first step of the process?
Answer:
# Expected Output:
The first step of the process might be: The roots of the growing plants start to break up the rock.
# Model's Output:
Step 1: Identify the key elements in the premise and hypothesis.
Premise: "A man in a blue shirt and a red hat is holding a sign."
Key elements: man, blue shirt, red hat, holding, sign
Hypothesis: "A man is holding a sign."
Key elements: man, holding, sign
Step 2: Compare the key elements in the premise and hypothesis.
In the premise, we have a man wearing a blue shirt and a red hat, while in the hypothesis, we only have a man. The premise also mentions that the man is holding a sign, which is also mentioned in the hypothesis.
Step 3: Determine if the hypothesis is entailed by the premise.
Since the hypothesis states that a man is holding a sign, and the premise confirms that a man is holding a sign, we can conclude that the hypothesis is entailed by the premise. The additional information about the man's clothing in the premise does not contradict or negate the
```
Despite being undertrained, the model demonstrates an emerging grasp of sound-text semantics.
## Citation Information
**BibTeX:**
```
@article{Llama-3-Sound: Sound Instruction LLM 2024,
title={Llama-3-Sound},
author={Homebrew Research},
year=2024,
month=July},
url={https://huggingface.co/jan-hq/Jan-Llama3-0708}
```
## Acknowledgement
- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
- **[Encodec](https://github.com/facebookresearch/encodec)**
- **[Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)** |