alasdairforsythe
commited on
Commit
•
430ce7c
1
Parent(s):
98ae018
Update README.md
Browse files
README.md
CHANGED
@@ -5,38 +5,23 @@ license: mit
|
|
5 |
|
6 |
The documentation and code is available on Github [alasdairforsythe/tokenmonster](https://github.com/alasdairforsythe/tokenmonster).
|
7 |
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|-----------------|------------|--------|-------------
|
29 |
-
| english-100256 | 100256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-100256.vocab)
|
30 |
-
| english-65536 | 65536 | UTF-8 | in-progress
|
31 |
-
| english-50256 | 50256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-50256.vocab)
|
32 |
-
| english-40000 | 40000 | UTF-8 | in-progress
|
33 |
-
| english-32000 | 32000 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-32000.vocab)
|
34 |
-
| english-24000 | 24000 | UTF-8 | in-progress
|
35 |
-
| code-100256 | 100256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/code-100256.vocab)
|
36 |
-
| code-65536 | 65536 | UTF-8 | in-progress
|
37 |
-
| code-50256 | 50256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/code-50256.vocab)
|
38 |
-
| code-40000 | 40000 | UTF-8 | in-progress
|
39 |
-
| code-32000 | 32000 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/code-32000.vocab)
|
40 |
-
| code-24000 | 24000 | UTF-8 | in-progress
|
41 |
-
|
42 |
-
in-progress vocabularies will be released 1 per day.
|
|
|
5 |
|
6 |
The documentation and code is available on Github [alasdairforsythe/tokenmonster](https://github.com/alasdairforsythe/tokenmonster).
|
7 |
|
8 |
+
The prebuilt vocabularies are all available for download [here](https://huggingface.co/alasdairforsythe/tokenmonster/tree/main/vocabs).
|
9 |
+
|
10 |
+
**July 3:** TokenMonster v1.0 has been released. The "420" prebuilt vocabularies are being released as they are completed, at a rate of around 10 per day. Let me know if there's one you want and I can prioritize it.
|
11 |
+
|
12 |
+
Choose a dataset from:
|
13 |
+
`code` `english` `englishcode` `fiction`
|
14 |
+
|
15 |
+
Choose a vocab size from:
|
16 |
+
`1024` `2048` `4096` `8000` `16000` `24000` `32000` `40000` `50256` `65536` `100256`
|
17 |
+
|
18 |
+
Choose an optimization mode from:
|
19 |
+
`unfiltered` `clean` `balanced` `consistent` `strict`
|
20 |
+
|
21 |
+
For a capcode disabled vocabulary add:
|
22 |
+
`nocapcode`
|
23 |
+
|
24 |
+
And finally add the version number:
|
25 |
+
`v1`
|
26 |
+
|
27 |
+
Examples: `fiction-24000-consistent-v1` `code-4096-clean-nocapcode-v1`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|