---
license: apache-2.0
language:
- ug
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: automatic-speech-recognition
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->

This model is a fine-tuned version of lucio/xls-r-uyghur-cv7 which based on the  facebook/wav2vec2-xls-r-300m, the MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG dataset used for fine-tuning. 

It achieves the following results:
  
  Loss: 1.0882
## Model Details
Detail of the model see facebook/wav2vec2-xls-r-300m.

### Model Description
The model vocabulary consists of the alphabetic characters of the Perso-Arabic script for the Uyghur language, with punctuation removed.

### Intended uses & limitations
  This model is expected to be of some utility for low-fidelity use cases such as:
  
  Draft video captions
  
  Indexing of recorded broadcasts
  
  The model is not reliable enough to use as a substitute for live captions for accessibility purposes, and it should not be used in a manner that would infringe the privacy of any of the contributors to the Common Voice dataset nor any other speakers.

### Training and evaluation data
  The combination of train and dev of common voice official splits were used as training data. 
  
  The official test split was used for final evaluation.

### Training procedure
The featurization layers of the XLS-R model are frozen while tuning a final CTC/LM layer on the Uyghur CV18 example sentences. 


### Training hyperparameters
The following hyperparameters were used during training:

    group_by_length=True,
    per_device_train_batch_size=8,
    evaluation_strategy="no",
    eval_strategy="steps",
    num_train_epochs=3,
    fp16=True,
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4,
    warmup_steps=500,
    save_total_limit=2

<!-- Provide a longer summary of what this model is. -->
### How to Training:
#### You may create a python document named as "fine_tuen.py".
#### "fine_tune.py" shoud including the following contents:

```
import torchaudio
import torch
from datasets import load_dataset, Audio
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from transformers import DefaultDataCollator
from transformers import TrainingArguments, Trainer
from dataclasses import dataclass
from typing import Dict, List, Union
import librosa

# 加载数据集
dataset = load_dataset("MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG", split="train")
dataset = dataset.cast_column("path", Audio())

# 加载处理器
processor = Wav2Vec2Processor.from_pretrained("lucio/xls-r-uyghur-cv7")

def preprocess_function(batch):
   
    audio = batch["path"]
    
    if audio["sampling_rate"] != 16000:
        resampler = torchaudio.transforms.Resample(audio["sampling_rate"], 16000)
        waveform = torch.tensor(audio["array"], dtype=torch.float32)
        audio["array"] = resampler(waveform).numpy()

    # 确保所有音频长度相同
    audio_array = librosa.util.fix_length(audio["array"], size=200000)

    # 将音频数组转换为张量
    audio_tensor = torch.from_numpy(audio_array).float()

    inputs = processor(
        audio_tensor,
        sampling_rate=16000,
        return_tensors="pt",
        padding="longest"
    )

    with processor.as_target_processor():
        labels = processor(batch["sentence"]).input_ids

    batch["input_values"] = inputs.input_values[0]  # 移除批次维度
    batch["labels"] = labels
    return batch


# 应用预处理
dataset = dataset.map(preprocess_function, remove_columns=["path", "sentence"])
model = Wav2Vec2ForCTC.from_pretrained("lucio/xls-r-uyghur-cv7", ctc_loss_reduction="mean", pad_token_id=processor.tokenizer.pad_token_id)

# 冻结特征提取器参数
model.freeze_feature_encoder()

training_args = TrainingArguments(

    output_dir="./wav2vec2_finetune",
    group_by_length=True,
    per_device_train_batch_size=8,
    evaluation_strategy="no",
    eval_strategy="steps",
    num_train_epochs=3,
    fp16=True,
    save_steps=500,
    eval_steps=500,
    logging_steps=500,
    learning_rate=1e-4,
    warmup_steps=500,
    save_total_limit=2,
)

@dataclass
class DataCollatorCTCWithPadding:
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # 提取所有的 input_values 并转换为张量
        input_features = [torch.tensor(feature["input_values"]) for feature in features]

        # 找到最短的序列长度
        min_length = min(map(len, input_features))

        # 截断 input_values
        input_features = [feature[:min_length] for feature in input_features]

        # 填充 input_values
        input_features = torch.nn.utils.rnn.pad_sequence(input_features, batch_first=True)

        # 获取所有的标签序列并转换为张量
        label_features = [torch.tensor(feature["labels"]) for feature in features]

        # 填充标签
        labels_batch = torch.nn.utils.rnn.pad_sequence(label_features, batch_first=True, padding_value=-100)

        batch = {
            "input_values": input_features,
            "labels": labels_batch,
        }

        return batch
# 使用自定义的数据整理器
data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True)

# 更新 Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=processor.feature_extractor,
    data_collator=data_collator
)

trainer.train()

model.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #微调后的模型名称

processor.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #到这里微调工作全部结束，可对微调后的"fine_tuned_wav2vec2_UGASR_model"模型进行进一步的评估。
```

#### above is the full contents of fine_tune.py

- **Developed by:** Mamajtan Abudkader 2024.9.10
- **Model type:** ASR
- **Language(s) (NLP):** Uyghur
- **License:** Apache2
- **Finetuned from model:** lucio/xls-r-uyghur-cv7


## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This model is used for auto speech recognition of uyghur language in arabic scripts. 


## How to Get Started with the Model

## Use the code below to get started with the model, you may create a python document named as "asr.py".
## "asr.py" should include the following contents:
```
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import torch
import time

stt = time.time()

# 指定模型的路径
model_path = "mamatjan/xls-r-uyghur-cv18"

# 加载模型和处理器
model = Wav2Vec2ForCTC.from_pretrained(model_path)
processor = Wav2Vec2Processor.from_pretrained(model_path)

# 读取音频文件并重采样到16kHz
audio_input, sampling_rate = librosa.load("exmaple.mp3", sr=None) #"exmaple.mp3"是需要音转文的音频文件，确保该文件和asr.py文件在同一个目录或者给出"exmaple.mp3"文件的完整路径。 
if sampling_rate != 16000:

    audio_input = librosa.resample(audio_input, orig_sr=sampling_rate, target_sr=16000)

    sampling_rate = 16000

# 使用处理器处理音频数据
inputs = processor(audio_input, return_tensors="pt", sampling_rate=sampling_rate, padding=True)

# 使用模型进行预测
with torch.no_grad():
    logits = model(inputs.input_values).logits

# 使用 CTC 解码器解码预测结果
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
waqit = time.time()-stt
print("======سەرىپ قىلغان ۋاقىت===============") # 打印消耗的时间
print(f"ۋاقىت: {waqit:.2f} سىكۇنت")  #打印（时间：*.**秒）
print(transcription[0])    # 打印音转文维吾尔语文本，至此asr.py的全部内容运行完了。
```
## above is the full content of the "asr.py".

## Hardware

NVIDIA Geforce 3060 ti been used for the training on Win10 system for 14 hours.