bumblebee-tools / public-apps /tokenizer-generator.livemd
jonatanklosko's picture
Add tokenizer generator
808e59a
raw
history blame
No virus
3.11 kB
<!-- livebook:{"app_settings":{"access_type":"public","auto_shutdown_ms":5000,"multi_session":true,"output_type":"rich","show_source":true,"slug":"tokenizer-generator"}} -->
# Tokenizer generator
```elixir
Mix.install([
{:kino, "~> 0.10.0"},
{:req, "~> 0.4.3"}
])
```
## Info
```elixir
Kino.Markdown.new("""
## Background
HuggingFace repositories store tokenizers in two flavours:
1. "slow tokenizer" - corresponds to a tokenizer implemented in Python
and stored as `tokenizer_config.json`
2. "fast tokenizers" - corresponds to a tokenizer implemented in Rust
and stored as `tokenizer.json`
Many repositories only include files for 1., but the `transformers` library
automatically converts "slow tokenizer" to "fast tokenizer" whenever possible.
Bumblebee relies on the Rust bindings and therefore always requires the
`tokenizer.json` file. This app generates that file for any repository with the
"slow tokenizer".
""")
```
## Generator
```elixir
Kino.Markdown.new("## Converter")
```
```elixir
{version, 0} =
System.cmd("python", ["-c", "import transformers; print(transformers.__version__, end='')"])
Kino.Markdown.new("""
`tokenizers: #{version}`
""")
```
```elixir
repo_input = Kino.Input.text("HuggingFace repo")
```
```elixir
repo = Kino.Input.read(repo_input)
if repo == "" do
Kino.interrupt!(:normal, "Enter repository.")
end
```
```elixir
response =
Req.post!("https://huggingface.co/api/models/#{repo}/paths-info/main",
json: %{paths: ["tokenizer.json"]}
)
case response do
%{status: 200, body: []} ->
:ok
%{status: 200, body: [%{"path" => "tokenizer.json"}]} ->
Kino.interrupt!(:error, "The tokenizer.json file already exist in the given repository.")
_ ->
Kino.interrupt!(:error, "The repository does not exist or requires authentication.")
end
```
```elixir
output_dir = Path.join(System.tmp_dir!(), repo)
```
````elixir
script = """
import sys
from transformers import AutoTokenizer
repo = sys.argv[1]
output_dir = sys.argv[2]
try:
tokenizer = AutoTokenizer.from_pretrained(repo)
assert tokenizer.is_fast
tokenizer.save_pretrained(output_dir)
except Exception as error:
print(error)
exit(1)
"""
case System.cmd("python", ["-c", script, repo, output_dir]) do
{_, 0} ->
:ok
{output, _} ->
Kino.Markdown.new("""
```
#{output}
```
""")
|> Kino.render()
Kino.interrupt!(:error, "Tokenizer conversion failed.")
end
````
```elixir
tokenizer_path = Path.join(output_dir, "tokenizer.json")
Kino.Download.new(
fn -> File.read!(tokenizer_path) end,
filename: "tokenizer.json",
label: "tokenizer.json"
)
```
`````elixir
Kino.Markdown.new("""
### Next steps
1. Go to https://huggingface.co/#{repo}/upload/main.
2. Upload the `tokenizer.json` file.
3. Add the following description:
````markdown
Generated with:
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("#{repo}")
assert tokenizer.is_fast
tokenizer.save_pretrained("...")
```
````
4. Submit the PR.
""")
`````