davanstrien HF staff commited on
Commit
f403786
1 Parent(s): f12f979

Add language information to model metadata

Browse files

Thanks for sharing this incredible model! I've suggested language tags for the metadata section of the model based on the languages outlined in https://blog.salesforceairesearch.com/xgen/:

> For Wikipedia, we cover 22 languages: bg, ca, cs, da, de, en, es, fr, hr, hu, it, nl, pl, pt, ro, ru, sl, sr, sv, uk, ja, zh, more than LLaMA (20 languages) and MPT (English only).

Since most tokens in the training data are English, you might prefer only to choose English. In your blog post, I also didn't see if you did any additional evaluation of downstream performance for non-English languages, so you may prefer to choose a different subset of languages to the one I have selected.

Files changed (1) hide show
  1. README.md +24 -1
README.md CHANGED
@@ -1,5 +1,28 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # XGen-7B-8K-Base
@@ -60,4 +83,4 @@ print(tokenizer.decode(sample[0]))
60
  year={2023},
61
  url={https://blog.salesforceairesearch.com/xgen}
62
  }
63
- ```
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - bg
6
+ - ca
7
+ - cs
8
+ - da
9
+ - de
10
+ - es
11
+ - fr
12
+ - hr
13
+ - hu
14
+ - it
15
+ - nl
16
+ - pl
17
+ - pt
18
+ - ro
19
+ - ru
20
+ - sl
21
+ - sr
22
+ - sv
23
+ - uk
24
+ - ja
25
+ - zh
26
  ---
27
 
28
  # XGen-7B-8K-Base
 
83
  year={2023},
84
  url={https://blog.salesforceairesearch.com/xgen}
85
  }
86
+ ```