antony66
/

whisper-large-v3-russian

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions Community

whisper-large-v3-russian / README.md

antony66's picture

Update README.md

58c0afd verified 5 months ago

|

No virus

2.25 kB

	---
	language:
	- ru
	library_name: transformers
	tags:
	- asr
	- whisper
	- russian
	datasets:
	- mozilla-foundation/common_voice_11_0
	metrics:
	- wer
	---

	# Model Details

	This is a version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) finetuned for better support of Russian language.

	Dataset used for finetuning is Common Voice 11.0, Russian part.

	After preprocessing of the original dataset (train + test + validation splits were mixed and split to a new train + test split by 0.95/0.05) the original Whisper v3 has WER 9.2 while the finetuned version shows 6.31 (so far).

	## Usage

	```
	import torch
	from transformers import WhisperForConditionalGeneration, WhisperProcessor, pipeline

	torch_dtype = torch.bfloat16 # set your preferred type here

	device = 'cpu'
	if torch.cuda.is_available():
	device = 'cuda'
	elif torch.backends.mps.is_available():
	device = 'mps'
	setattr(torch.distributed, "is_initialized", lambda : False) # monkey patching
	device = torch.device(device)

	whisper = WhisperForConditionalGeneration.from_pretrained(
	"antony66/whisper-large-v3-russian", torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True,
	# add attn_implementation="flash_attention_2" if your GPU supports it
	)

	processor = WhisperProcessor.from_pretrained("antony66/whisper-large-v3-russian")

	asr_pipeline = pipeline(
	"automatic-speech-recognition",
	model=whisper,
	tokenizer=processor.tokenizer,
	feature_extractor=processor.feature_extractor,
	max_new_tokens=256,
	chunk_length_s=30,
	batch_size=16,
	return_timestamps=True,
	torch_dtype=torch_dtype,
	device=device,
	)

	# read your wav file into variable wav. For example:
	from io import BufferIO
	wav = BytesIO()
	with open('call.wav', 'rb') as f:
	wav.write(f.read())
	wav.seek(0)

	# get the transcription
	asr = asr_pipeline(wav, generate_kwargs={"language": "russian", "max_new_tokens": 256}, return_timestamps=False)

	print(asr['text'])

	```

	## Work in progress

	This model is in WIP state for now. The goal is to finetune it for speech recognition of phone calls as much as possible. If you want to contribute and you know or have any good dataset please let me know. Your help will be much appreciated.