alasdairforsythe commited on
Commit
430ce7c
1 Parent(s): 98ae018

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -35
README.md CHANGED
@@ -5,38 +5,23 @@ license: mit
5
 
6
  The documentation and code is available on Github [alasdairforsythe/tokenmonster](https://github.com/alasdairforsythe/tokenmonster).
7
 
8
- Trained models can be downloaded from here:
9
-
10
- #### With capcode
11
- | Name | Vocab Size | Charset | Availablity
12
- |-------------------------|------------|-------|--------------
13
- | english-100256-capcode | 100256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-100256-capcode.vocab)
14
- | english-65536-capcode | 65536 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-65536-capcode.vocab)
15
- | english-50256-capcode | 50256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-50256-capcode.vocab)
16
- | english-40000-capcode | 40000 | UTF-8 | in-progress
17
- | english-32000-capcode | 32000 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-32000-capcode.vocab)
18
- | english-24000-capcode | 24000 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-24000-capcode.vocab)
19
- | code-100256-capcode | 100256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/code-100256-capcode.vocab)
20
- | code-65536-capcode | 65536 | UTF-8 | in-progress
21
- | code-50256-capcode | 50256 | UTF-8 | in-progress
22
- | code-40000-capcode | 40000 | UTF-8 | in-progress
23
- | code-32000-capcode | 32000 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/code-32000-capcode.vocab)
24
- | code-24000-capcode | 24000 | UTF-8 | in-progress
25
-
26
- #### Without capcode
27
- | Name | Vocab Size | Charset | Availablity
28
- |-----------------|------------|--------|-------------
29
- | english-100256 | 100256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-100256.vocab)
30
- | english-65536 | 65536 | UTF-8 | in-progress
31
- | english-50256 | 50256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-50256.vocab)
32
- | english-40000 | 40000 | UTF-8 | in-progress
33
- | english-32000 | 32000 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/english-32000.vocab)
34
- | english-24000 | 24000 | UTF-8 | in-progress
35
- | code-100256 | 100256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/code-100256.vocab)
36
- | code-65536 | 65536 | UTF-8 | in-progress
37
- | code-50256 | 50256 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/code-50256.vocab)
38
- | code-40000 | 40000 | UTF-8 | in-progress
39
- | code-32000 | 32000 | UTF-8 | [download](https://huggingface.co/alasdairforsythe/tokenmonster/resolve/main/code-32000.vocab)
40
- | code-24000 | 24000 | UTF-8 | in-progress
41
-
42
- in-progress vocabularies will be released 1 per day.
 
5
 
6
  The documentation and code is available on Github [alasdairforsythe/tokenmonster](https://github.com/alasdairforsythe/tokenmonster).
7
 
8
+ The prebuilt vocabularies are all available for download [here](https://huggingface.co/alasdairforsythe/tokenmonster/tree/main/vocabs).
9
+
10
+ **July 3:** TokenMonster v1.0 has been released. The "420" prebuilt vocabularies are being released as they are completed, at a rate of around 10 per day. Let me know if there's one you want and I can prioritize it.
11
+
12
+ Choose a dataset from:
13
+ `code` `english` `englishcode` `fiction`
14
+
15
+ Choose a vocab size from:
16
+ `1024` `2048` `4096` `8000` `16000` `24000` `32000` `40000` `50256` `65536` `100256`
17
+
18
+ Choose an optimization mode from:
19
+ `unfiltered` `clean` `balanced` `consistent` `strict`
20
+
21
+ For a capcode disabled vocabulary add:
22
+ `nocapcode`
23
+
24
+ And finally add the version number:
25
+ `v1`
26
+
27
+ Examples: `fiction-24000-consistent-v1` `code-4096-clean-nocapcode-v1`