Fixed tokenizer.json, so it is equal with LLama-3.1-8B-Instruct's tokenizer.json
#5
by
Joseph717171
- opened
No description provided.
Since you guys trained on top of LLaMa-3.1-8B-Instruct, I found it odd that your config.json files were different, mainly that Llama-3.1-SuperNova-Lite/tokenizer.json's was missing somethings from meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json. This PR seeks to fix this.
a="/Users/jsarnecki/opt/Workspace/arcee-ai/Llama-3.1-SuperNova-Lite/tokenizer.json"
b="/Users/jsarnecki/opt/Workspace/meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"
diff "$a" "$b"
2332,2335c2332,2394
< "type": "ByteLevel",
< "add_prefix_space": true,
< "trim_offsets": false,
< "use_regex": true
---
> "type": "Sequence",
> "processors": [
> {
> "type": "ByteLevel",
> "add_prefix_space": true,
> "trim_offsets": false,
> "use_regex": true
> },
> {
> "type": "TemplateProcessing",
> "single": [
> {
> "SpecialToken": {
> "id": "<|begin_of_text|>",
> "type_id": 0
> }
> },
> {
> "Sequence": {
> "id": "A",
> "type_id": 0
> }
> }
> ],
> "pair": [
> {
> "SpecialToken": {
> "id": "<|begin_of_text|>",
> "type_id": 0
> }
> },
> {
> "Sequence": {
> "id": "A",
> "type_id": 0
> }
> },
> {
> "SpecialToken": {
> "id": "<|begin_of_text|>",
> "type_id": 1
> }
> },
> {
> "Sequence": {
> "id": "B",
> "type_id": 1
> }
> }
> ],
> "special_tokens": {
> "<|begin_of_text|>": {
> "id": "<|begin_of_text|>",
> "ids": [
> 128000
> ],
> "tokens": [
> "<|begin_of_text|>"
> ]
> }
> }
> }
> ]
Crystalcareai
changed pull request status to
open
Crystalcareai
changed pull request status to
merged