Edit model card

OWSM-CTC (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC. It is trained on 180k hours of public audio data for multilingual speech recognition, any-to-any speech translation, and language identification, which follows the design of the project, Open Whisper-style Speech Model (OWSM).

This model is initialized with OWSM-CTC v3.1 and then fine-tuned on v3.2 data for 225k steps.

Currently, the code for OWSM-CTC has not been merged into ESPnet main branch. Instead, it is available as follows:

To use the pre-trained model, you need to install espnet and espnet_model_zoo. The requirements are:

librosa
torch
espnet @ git+https://github.com/pyf98/espnet@owsm-ctc
espnet_model_zoo

We use FlashAttention during training, but we do not need it during inference. Please install it as follows:

pip install flash-attn --no-build-isolation

Example script for short-form ASR/ST

import soundfile as sf
import numpy as np
import librosa
import kaldiio
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch


s2t = Speech2TextGreedySearch.from_pretrained(
    "pyf98/owsm_ctc_v3.2_ft_1B",
    device="cuda",
    generate_interctc_outputs=False,
    lang_sym='<eng>',
    task_sym='<asr>',
)

speech, rate = sf.read(
    "xxx.wav"
)
speech = librosa.util.fix_length(speech, size=(16000 * 30))

res = s2t(speech)[0]
print(res)

Example script for long-form ASR/ST

import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch


if __name__ == "__main__":
    context_len_in_secs = 4   # left and right context when doing buffered inference
    batch_size = 32   # depends on the GPU memory
    s2t = Speech2TextGreedySearch.from_pretrained(
        "pyf98/owsm_ctc_v3.2_ft_1B",
        device='cuda' if torch.cuda.is_available() else 'cpu',
        generate_interctc_outputs=False,
        lang_sym='<eng>',
        task_sym='<asr>',
    )

    speech, rate = sf.read(
        "xxx.wav"
    )

    text = s2t.decode_long_batched_buffered(
        speech,
        batch_size=batch_size,
        context_len_in_secs=context_len_in_secs,
        frames_per_sec=12.5,        # 80ms shift, model-dependent, don't change
    )
    print(text)

Example for CTC forced alignment using ctc-segmentation

It can be efficiently applied to audio of an arbitrary length. For model downloading, please refer to https://github.com/espnet/espnet?tab=readme-ov-file#ctc-segmentation-demo

import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation


if __name__ == "__main__":
    ## Please download model first
    aligner = CTCSegmentation(
        s2t_model_file="exp/s2t_train_s2t_multitask-ctc_ebf27_conv2d8_size1024_raw_bpe50000/valid.total_count.ave_5best.till45epoch.pth",
        fs=16000,
        ngpu=1,
        batch_size=16,    # batched parallel decoding; reduce it if your GPU memory is smaller
        kaldi_style_text=True,
        time_stamps="fixed",
        samples_to_frames_ratio=1280,   # 80ms time shift; don't change as it depends on the pre-trained model
        lang_sym="<eng>",
        task_sym="<asr>",
        context_len_in_secs=2,  # left and right context in buffered decoding
        frames_per_sec=12.5,    # 80ms time shift; don't change as it depends on the pre-trained model
    )

    speech, rate = sf.read(
        "example.wav"
    )
    print(f"speech duration: {len(speech) / rate : .2f} seconds")
    text = '''
utt1 hello there
utt2 welcome to this repo
'''

    segments = aligner(speech, text)
    print(segments)
Downloads last month
16
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.