Upload tokenizer.json

#16

by jonatanklosko - opened Nov 10, 2023

base: refs/heads/main

←

from: refs/pr/16

Discussion Files changed

+114861

-0

Upload tokenizer.jsonb22d3c1e

jonatanklosko

Nov 10, 2023

Generated with:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("openai/whisper-large-v3")
assert tokenizer.is_fast
tokenizer.save_pretrained("...")

patrickvonplaten

Nov 16, 2023

Looks good to me! cc @sanchit-gandhi

sanchit-gandhi

Nov 16, 2023

As discussed with @ArthurZ on the PR the fast tokenizer can always be loaded from the slow one: https://github.com/huggingface/transformers/pull/27338/files#r1384935617

So there's no issue with not having the tokenizer.json. Happy to merge this PR to improve clarity for the Hub weights however

jonatanklosko

Nov 16, 2023

@sanchit-gandhi yeah, the thing is that the Rust huggingface/tokenizers can only load tokenizer.json. In the Elixir ecosystem we have bindings to huggingface/tokenizers and so rely solely on fast tokenizers :)

sanchit-gandhi

Nov 16, 2023

Thanks for the explanation! Makes sense - let's merge this one then @ArthurZ @patrickvonplaten

patrickvonplaten changed pull request status to merged Nov 16, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment