--- license: apache-2.0 language: - ug base_model: - facebook/nllb-200-distilled-600M pipeline_tag: automatic-speech-recognition --- # Model Card for Model ID This model is a fine-tuned version of lucio/xls-r-uyghur-cv7 which based on the facebook/wav2vec2-xls-r-300m, the MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG dataset used for fine-tuning. It achieves the following results: Loss: 1.0882 ## Model Details Detail of the model see facebook/wav2vec2-xls-r-300m. ### Model Description The model vocabulary consists of the alphabetic characters of the Perso-Arabic script for the Uyghur language, with punctuation removed. ### Intended uses & limitations This model is expected to be of some utility for low-fidelity use cases such as: Draft video captions Indexing of recorded broadcasts The model is not reliable enough to use as a substitute for live captions for accessibility purposes, and it should not be used in a manner that would infringe the privacy of any of the contributors to the Common Voice dataset nor any other speakers. ### Training and evaluation data The combination of train and dev of common voice official splits were used as training data. The official test split was used for final evaluation. ### Training procedure The featurization layers of the XLS-R model are frozen while tuning a final CTC/LM layer on the Uyghur CV18 example sentences. ### Training hyperparameters The following hyperparameters were used during training: group_by_length=True, per_device_train_batch_size=8, evaluation_strategy="no", eval_strategy="steps", num_train_epochs=3, fp16=True, save_steps=500, eval_steps=500, logging_steps=500, learning_rate=1e-4, warmup_steps=500, save_total_limit=2 ### How to Training: #### You may create a python document named as "fine_tuen.py". #### "fine_tune.py" shoud including the following contents: ``` import torchaudio import torch from datasets import load_dataset, Audio from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from transformers import DefaultDataCollator from transformers import TrainingArguments, Trainer from dataclasses import dataclass from typing import Dict, List, Union import librosa # 加载数据集 dataset = load_dataset("MOZILLA-FOUNDATION/COMMON_VOICE_18_0 - UG", split="train") dataset = dataset.cast_column("path", Audio()) # 加载处理器 processor = Wav2Vec2Processor.from_pretrained("lucio/xls-r-uyghur-cv7") def preprocess_function(batch): audio = batch["path"] if audio["sampling_rate"] != 16000: resampler = torchaudio.transforms.Resample(audio["sampling_rate"], 16000) waveform = torch.tensor(audio["array"], dtype=torch.float32) audio["array"] = resampler(waveform).numpy() # 确保所有音频长度相同 audio_array = librosa.util.fix_length(audio["array"], size=200000) # 将音频数组转换为张量 audio_tensor = torch.from_numpy(audio_array).float() inputs = processor( audio_tensor, sampling_rate=16000, return_tensors="pt", padding="longest" ) with processor.as_target_processor(): labels = processor(batch["sentence"]).input_ids batch["input_values"] = inputs.input_values[0] # 移除批次维度 batch["labels"] = labels return batch # 应用预处理 dataset = dataset.map(preprocess_function, remove_columns=["path", "sentence"]) model = Wav2Vec2ForCTC.from_pretrained("lucio/xls-r-uyghur-cv7", ctc_loss_reduction="mean", pad_token_id=processor.tokenizer.pad_token_id) # 冻结特征提取器参数 model.freeze_feature_encoder() training_args = TrainingArguments( output_dir="./wav2vec2_finetune", group_by_length=True, per_device_train_batch_size=8, evaluation_strategy="no", eval_strategy="steps", num_train_epochs=3, fp16=True, save_steps=500, eval_steps=500, logging_steps=500, learning_rate=1e-4, warmup_steps=500, save_total_limit=2, ) @dataclass class DataCollatorCTCWithPadding: processor: Wav2Vec2Processor padding: Union[bool, str] = True def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]: # 提取所有的 input_values 并转换为张量 input_features = [torch.tensor(feature["input_values"]) for feature in features] # 找到最短的序列长度 min_length = min(map(len, input_features)) # 截断 input_values input_features = [feature[:min_length] for feature in input_features] # 填充 input_values input_features = torch.nn.utils.rnn.pad_sequence(input_features, batch_first=True) # 获取所有的标签序列并转换为张量 label_features = [torch.tensor(feature["labels"]) for feature in features] # 填充标签 labels_batch = torch.nn.utils.rnn.pad_sequence(label_features, batch_first=True, padding_value=-100) batch = { "input_values": input_features, "labels": labels_batch, } return batch # 使用自定义的数据整理器 data_collator = DataCollatorCTCWithPadding(processor=processor, padding=True) # 更新 Trainer trainer = Trainer( model=model, args=training_args, train_dataset=dataset, tokenizer=processor.feature_extractor, data_collator=data_collator ) trainer.train() model.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #微调后的模型名称 processor.save_pretrained("fine_tuned_wav2vec2_UGASR_model") #到这里微调工作全部结束,可对微调后的"fine_tuned_wav2vec2_UGASR_model"模型进行进一步的评估。 ``` #### above is the full contents of fine_tune.py - **Developed by:** Mamajtan Abudkader 2024.9.10 - **Model type:** ASR - **Language(s) (NLP):** Uyghur - **License:** Apache2 - **Finetuned from model:** lucio/xls-r-uyghur-cv7 ## Uses This model is used for auto speech recognition of uyghur language in arabic scripts. ## How to Get Started with the Model ## Use the code below to get started with the model, you may create a python document named as "asr.py". ## "asr.py" should include the following contents: ``` from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import librosa import torch import time stt = time.time() # 指定模型的路径 model_path = "mamatjan/xls-r-uyghur-cv18" # 加载模型和处理器 model = Wav2Vec2ForCTC.from_pretrained(model_path) processor = Wav2Vec2Processor.from_pretrained(model_path) # 读取音频文件并重采样到16kHz audio_input, sampling_rate = librosa.load("exmaple.mp3", sr=None) #"exmaple.mp3"是需要音转文的音频文件,确保该文件和asr.py文件在同一个目录或者给出"exmaple.mp3"文件的完整路径。 if sampling_rate != 16000: audio_input = librosa.resample(audio_input, orig_sr=sampling_rate, target_sr=16000) sampling_rate = 16000 # 使用处理器处理音频数据 inputs = processor(audio_input, return_tensors="pt", sampling_rate=sampling_rate, padding=True) # 使用模型进行预测 with torch.no_grad(): logits = model(inputs.input_values).logits # 使用 CTC 解码器解码预测结果 predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) waqit = time.time()-stt print("======سەرىپ قىلغان ۋاقىت===============") # 打印消耗的时间 print(f"ۋاقىت: {waqit:.2f} سىكۇنت") #打印(时间:*.**秒) print(transcription[0]) # 打印音转文维吾尔语文本,至此asr.py的全部内容运行完了。 ``` ## above is the full content of the "asr.py". ## Hardware NVIDIA Geforce 3060 ti been used for the training on Win10 system for 14 hours.