How to get Word and Verbose level transcription?

#11
by souvik0306 - opened

Large-v3 is very fast with batching as shown here --- https://huggingface.co/openai/whisper-large-v3

Batching speeds up the transcription process by a lot. The only reason I wish to use faster_whisper is cause it provides things like verbose, word level transcription

Additionally support for various input params like best_of, beam_size etc all of which are supported by whisper - https://github.com/openai/whisper/blob/main/whisper/transcribe.py

Using word level transcriptions, as specified in the Model Card:

result = pipe(sample, return_timestamps="word")
print(result["chunks"])

Should give you word level timestamps, something like:

{
    "text": " the",
    "timestamp": [
        187.6,
        188.64
    ]
},
{
    "text": " fact",
    "timestamp": [
        188.64,
        188.88
    ]
},

This is very similar to what you get with the whisper model, which looks something like:

{
            "id": 0,
            "seek": 0,
            "start": 0.0,
            "end": 3.0,
            "text": " Okay, so I've started recording.",
            "tokens": [
                50364,
                1033,
                11,
                ...
                13,
                50524
            ],
            "temperature": 0.0,
            "avg_logprob": -0.43806132332223363,
            "compression_ratio": 1.2953020134228188,
            "no_speech_prob": 0.1916283816099167,
            "words": [
                {
                    "word": " Okay,",
                    "start": 0.0,
                    "end": 0.56,
                    "probability": 0.12234115600585938
                },
               ...
                {
                    "word": " recording.",
                    "start": 2.44,
                    "end": 3.0,
                    "probability": 0.8062686920166016
                }
            ]
        },

While the whisper model does provide more information, and other input params might still not be available, word timestamps are currently possible

I actually got an error --
result = pipe(sample, return_timestamps="word")
print(result["chunks"])

I got the error with GPU, the pipeline code remains the same as model card

image.png

Use batch_size=1
results=pipe(batch_size=1,

Sign up or log in to comment