# Tokenizer generator ```elixir Mix.install([ {:kino, "~> 0.10.0"}, {:req, "~> 0.4.3"} ]) ``` ## Info ```elixir Kino.Markdown.new(""" ## Background HuggingFace repositories store tokenizers in two flavours: 1. "slow tokenizer" - corresponds to a tokenizer implemented in Python and stored as `tokenizer_config.json` 2. "fast tokenizers" - corresponds to a tokenizer implemented in Rust and stored as `tokenizer.json` Many repositories only include files for 1., but the `transformers` library automatically converts "slow tokenizer" to "fast tokenizer" whenever possible. Bumblebee relies on the Rust bindings and therefore always requires the `tokenizer.json` file. This app generates that file for any repository with the "slow tokenizer". """) ``` ## Generator ```elixir Kino.Markdown.new("## Converter") ``` ```elixir {version, 0} = System.cmd("python", ["-c", "import transformers; print(transformers.__version__, end='')"]) Kino.Markdown.new(""" `tokenizers: #{version}` """) ``` ```elixir repo_input = Kino.Input.text("HuggingFace repo") ``` ```elixir repo = Kino.Input.read(repo_input) if repo == "" do Kino.interrupt!(:normal, "Enter repository.") end ``` ```elixir response = Req.post!("https://huggingface.co/api/models/#{repo}/paths-info/main", json: %{paths: ["tokenizer.json"]} ) case response do %{status: 200, body: []} -> :ok %{status: 200, body: [%{"path" => "tokenizer.json"}]} -> Kino.interrupt!(:error, "The tokenizer.json file already exist in the given repository.") _ -> Kino.interrupt!(:error, "The repository does not exist or requires authentication.") end ``` ```elixir output_dir = Path.join(System.tmp_dir!(), repo) ``` ````elixir script = """ import sys from transformers import AutoTokenizer repo = sys.argv[1] output_dir = sys.argv[2] try: tokenizer = AutoTokenizer.from_pretrained(repo) assert tokenizer.is_fast tokenizer.save_pretrained(output_dir) except Exception as error: print(error) exit(1) """ case System.cmd("python", ["-c", script, repo, output_dir]) do {_, 0} -> :ok {output, _} -> Kino.Markdown.new(""" ``` #{output} ``` """) |> Kino.render() Kino.interrupt!(:error, "Tokenizer conversion failed.") end ```` ```elixir tokenizer_path = Path.join(output_dir, "tokenizer.json") Kino.Download.new( fn -> File.read!(tokenizer_path) end, filename: "tokenizer.json", label: "tokenizer.json" ) ``` `````elixir Kino.Markdown.new(""" ### Next steps 1. Go to https://huggingface.co/#{repo}/upload/main. 2. Upload the `tokenizer.json` file. 3. Add the following description: ````markdown Generated with: ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("#{repo}") assert tokenizer.is_fast tokenizer.save_pretrained("...") ``` ```` 4. Submit the PR. """) `````