Spaces:
Running
Running
<!-- livebook:{"app_settings":{"access_type":"public","auto_shutdown_ms":5000,"multi_session":true,"output_type":"rich","show_source":true,"slug":"tokenizer-generator"}} --> | |
# Tokenizer generator | |
```elixir | |
Mix.install([ | |
{:kino, "~> 0.10.0"}, | |
{:req, "~> 0.4.3"} | |
]) | |
``` | |
## Info | |
```elixir | |
Kino.Markdown.new(""" | |
## Background | |
HuggingFace repositories store tokenizers in two flavours: | |
1. "slow tokenizer" - corresponds to a tokenizer implemented in Python | |
and stored as `tokenizer_config.json` | |
2. "fast tokenizers" - corresponds to a tokenizer implemented in Rust | |
and stored as `tokenizer.json` | |
Many repositories only include files for 1., but the `transformers` library | |
automatically converts "slow tokenizer" to "fast tokenizer" whenever possible. | |
Bumblebee relies on the Rust bindings and therefore always requires the | |
`tokenizer.json` file. This app generates that file for any repository with the | |
"slow tokenizer". | |
""") | |
``` | |
## Generator | |
```elixir | |
Kino.Markdown.new("## Converter") | |
``` | |
```elixir | |
{version, 0} = | |
System.cmd("python", ["-c", "import transformers; print(transformers.__version__, end='')"]) | |
Kino.Markdown.new(""" | |
`tokenizers: #{version}` | |
""") | |
``` | |
```elixir | |
repo_input = Kino.Input.text("HuggingFace repo") | |
``` | |
```elixir | |
repo = Kino.Input.read(repo_input) | |
if repo == "" do | |
Kino.interrupt!(:normal, "Enter repository.") | |
end | |
``` | |
```elixir | |
response = | |
Req.post!("https://huggingface.co/api/models/#{repo}/paths-info/main", | |
json: %{paths: ["tokenizer.json"]} | |
) | |
case response do | |
%{status: 200, body: []} -> | |
:ok | |
%{status: 200, body: [%{"path" => "tokenizer.json"}]} -> | |
Kino.interrupt!(:error, "The tokenizer.json file already exist in the given repository.") | |
_ -> | |
Kino.interrupt!(:error, "The repository does not exist or requires authentication.") | |
end | |
``` | |
```elixir | |
output_dir = Path.join(System.tmp_dir!(), repo) | |
``` | |
````elixir | |
script = """ | |
import sys | |
from transformers import AutoTokenizer | |
repo = sys.argv[1] | |
output_dir = sys.argv[2] | |
try: | |
tokenizer = AutoTokenizer.from_pretrained(repo) | |
assert tokenizer.is_fast | |
tokenizer.save_pretrained(output_dir) | |
except Exception as error: | |
print(error) | |
exit(1) | |
""" | |
case System.cmd("python", ["-c", script, repo, output_dir]) do | |
{_, 0} -> | |
:ok | |
{output, _} -> | |
Kino.Markdown.new(""" | |
``` | |
#{output} | |
``` | |
""") | |
|> Kino.render() | |
Kino.interrupt!(:error, "Tokenizer conversion failed.") | |
end | |
```` | |
```elixir | |
tokenizer_path = Path.join(output_dir, "tokenizer.json") | |
Kino.Download.new( | |
fn -> File.read!(tokenizer_path) end, | |
filename: "tokenizer.json", | |
label: "tokenizer.json" | |
) | |
``` | |
`````elixir | |
Kino.Markdown.new(""" | |
### Next steps | |
1. Go to https://huggingface.co/#{repo}/upload/main. | |
2. Upload the `tokenizer.json` file. | |
3. Add the following description: | |
````markdown | |
Generated with: | |
```python | |
from transformers import AutoTokenizer | |
tokenizer = AutoTokenizer.from_pretrained("#{repo}") | |
assert tokenizer.is_fast | |
tokenizer.save_pretrained("...") | |
``` | |
```` | |
4. Submit the PR. | |
""") | |
````` | |