sho-takase commited on
Commit
6b4e753
2 Parent(s): 14d5c52 8c0be10
Files changed (1) hide show
  1. README.md +72 -70
README.md CHANGED
@@ -1,71 +1,73 @@
1
- ---
2
- license: mit
3
- language:
4
- - ja
5
- - en
6
- ---
7
-
8
- # Sarashina2-7B
9
-
10
- This repository provides large language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
11
-
12
-
13
- ## How to use
14
-
15
- ```
16
- import torch
17
- from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
18
-
19
- model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-7b", torch_dtype=torch.bfloat16)
20
- tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-7b", use_fast=False)
21
- generator = pipeline("text-generation", model=model, tokenizer=tokenizer, device_map="auto")
22
- set_seed(123)
23
-
24
- text = generator(
25
- "おはようございます、今日の天気は",
26
- max_length=30,
27
- do_sample=True,
28
- pad_token_id=tokenizer.pad_token_id,
29
- num_return_sequences=3,
30
- )
31
-
32
- for t in text:
33
- print(t)
34
-
35
- # These examples are generated by sarashina2-7b parameters model
36
- # {'generated_text': 'おはようございます、今日の天気は晴れです。ちょっと風が強い。\n昨日は、久しぶりにゆっくりとしていました。\n2週間位間があいてしまったかも、でもその間に'}
37
- # {'generated_text': 'おはようございます、今日の天気は曇。朝は曇っていてどんよりしていましたね。昼からは晴れそうですが。気温は徐々に上昇しています。昨日は春らしい陽気でした。'}
38
- # {'generated_text': 'おはようございます、今日の天気はくもり、少し寒気がします。 この土日に、家族で一泊二日で旅行に行ってきました。といっても、100キロ'}
39
- ```
40
-
41
- ## Configuration
42
-
43
- | Parameters | Vocab size | Trainning tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
44
- | :-----: | :-----------: | :-------------: | :------------ | :-----------: | :----: | :--------: | :-------------: |
45
- | [7B](https://huggingface.co/sbintuitions/sarashina2-7b) | 102400 | 2.1T | Llama2 | RoPE | 32 | 4096 | 32 |
46
- | [13B](https://huggingface.co/sbintuitions/sarashina2-13b) | 102400 | 2.1T | Llama2 | RoPE | 40 | 5120 | 40 |
47
- | 70B (TBA)| | | | | | |
48
-
49
- ## Training Corpus
50
-
51
- For our Japanese training data, we used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
52
- To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
53
- After cleaning, our Japanese training data contains about 1T tokens.
54
-
55
- For our English training data, we extracted English documents from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) but we removed books3 corpus due to copyright infringement.
56
-
57
- ## Tokenization
58
-
59
- We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
60
- We do not apply pre-tokenization with Japanese tokenizer.
61
- Thus, a user may directly feed raw sentences into the tokenizer.
62
-
63
-
64
- ## Ethical Considerations and Limitations
65
- Sarashina2 has not been tuned to follow an instruction yet.
66
- Therefore, sarashina2 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
67
- Before using sarashina2, we would like developers to tune models based on human preferences and safety considerations.
68
-
69
- ## License
70
-
 
 
71
  [MIT License](https://huggingface.co/sbintuitions/sarashina2-7b/blob/main/LICENSE)
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ja
5
+ - en
6
+ ---
7
+
8
+ # Sarashina2-7B
9
+
10
+ This repository provides large language models trained by [SB Intuitions](https://www.sbintuitions.co.jp/).
11
+
12
+
13
+ ## How to use
14
+
15
+ Please set **use_fast=False** to use our tokenizer properly.
16
+
17
+ ```python
18
+ import torch
19
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
20
+
21
+ model = AutoModelForCausalLM.from_pretrained("sbintuitions/sarashina2-7b", torch_dtype=torch.bfloat16, device_map="auto")
22
+ tokenizer = AutoTokenizer.from_pretrained("sbintuitions/sarashina2-7b", use_fast=False)
23
+ generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
24
+ set_seed(123)
25
+
26
+ text = generator(
27
+ "おはようございます、今日の天気は",
28
+ max_length=30,
29
+ do_sample=True,
30
+ pad_token_id=tokenizer.pad_token_id,
31
+ num_return_sequences=3,
32
+ )
33
+
34
+ for t in text:
35
+ print(t)
36
+
37
+ # These examples are generated by sarashina2-7b parameters model
38
+ # {'generated_text': 'おはようございます、今日の天気は晴れです。ちょっと風が強い。\n昨日は、久しぶりにゆっくりとしていました。\n2週間位間があいてしまったかも、でもその間に'}
39
+ # {'generated_text': 'おはようございます、今日の天気は曇。朝は曇っていてどんよりしていましたね。昼からは晴れそうですが。気温は徐々に上昇しています。昨日は春らしい陽気でした。'}
40
+ # {'generated_text': 'おはようございます、今日の天気はく��り、少し寒気がします。 この土日に、家族で一泊二日で旅行に行ってきました。といっても、100キロ'}
41
+ ```
42
+
43
+ ## Configuration
44
+
45
+ | Parameters | Vocab size | Training tokens | Architecture | Position type | Layers | Hidden dim | Attention heads |
46
+ | :-----: | :-----------: | :-------------: | :------------ | :-----------: | :----: | :--------: | :-------------: |
47
+ | [7B](https://huggingface.co/sbintuitions/sarashina2-7b) | 102400 | 2.1T | Llama2 | RoPE | 32 | 4096 | 32 |
48
+ | [13B](https://huggingface.co/sbintuitions/sarashina2-13b) | 102400 | 2.1T | Llama2 | RoPE | 40 | 5120 | 40 |
49
+ | 70B (TBA)| | | | | | |
50
+
51
+ ## Training Corpus
52
+
53
+ For our Japanese training data, we used a Japanese portion of the [Common Crawl corpus](https://commoncrawl.org/), which is the largest Web corpus, as our training dataset.
54
+ To clean the training corpus, we used [CCNet](https://github.com/facebookresearch/cc_net) and [HojiChar](https://github.com/HojiChar/HojiChar).
55
+ After cleaning, our Japanese training data contains about 1T tokens.
56
+
57
+ For our English training data, we extracted English documents from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B) but we removed books3 corpus due to copyright infringement.
58
+
59
+ ## Tokenization
60
+
61
+ We use a [sentencepiece](https://github.com/google/sentencepiece) tokenizer with a unigram language model and byte-fallback.
62
+ We do not apply pre-tokenization with Japanese tokenizer.
63
+ Thus, a user may directly feed raw sentences into the tokenizer.
64
+
65
+
66
+ ## Ethical Considerations and Limitations
67
+ Sarashina2 has not been tuned to follow an instruction yet.
68
+ Therefore, sarashina2 might generate some meaningless sequences, some inaccurate instances or biased/objectionable outputs.
69
+ Before using sarashina2, we would like developers to tune models based on human preferences and safety considerations.
70
+
71
+ ## License
72
+
73
  [MIT License](https://huggingface.co/sbintuitions/sarashina2-7b/blob/main/LICENSE)