File size: 2,287 Bytes
382bbf6
 
9c1df25
 
 
 
 
 
 
 
 
382bbf6
9c1df25
 
a149c32
333a4c3
 
 
 
 
 
 
 
 
 
a149c32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
license: apache-2.0
datasets:
- heegyu/kowikitext
- heegyu/kowiki-sentences
language:
- ko
- en
library_name: transformers
tags:
- pytorch
---

Experimental Repository :)

Contents will updated without any notice at all. If you plan to use this repository, please use with `revision` with git hash.

This experiment is aimed to:

- Maintain NLU capability of Mistral-Instruct model(mistralai/Mistral-7B-Instruct-v0.1)
- Adapt new Korean vocab seamlessly
- Use minimal dataset (used Korean wikipedia only)
- Computationally efficient method
- Let model answer using English knowledge and NLU capability even the question/answer is Korean only.

Here's some test:

```python
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    'beomi/Mistral-Ko-Inst-dev',
    torch_dtype='auto',
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('beomi/Mistral-Ko-Inst-dev')

pipe = pipeline(
    'text-generation', 
    model=model, 
    tokenizer=tokenizer, 
    do_sample=True,
    max_new_tokens=350, 
    return_full_text=False,
    no_repeat_ngram_size=6,
    eos_token_id=1, # not yet tuned to gen </s>, use <s> instead.
)


def gen(x):
    chat = tokenizer.apply_chat_template([
        {"role": "user", "content": x},
        # {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
        # {"role": "user", "content": "Do you have mayonnaise recipes? please say in Korean."}
    ], tokenize=False)
    print(pipe(chat)[0]['generated_text'].strip())

gen("μŠ€νƒ€λ²…μŠ€μ™€ μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„μ˜ μ°¨μ΄λŠ”?")

# (생성 μ˜ˆμ‹œ)
# μŠ€νƒ€λ²…μŠ€λŠ” μ „ μ„Έκ³„μ μœΌλ‘œ μš΄μ˜ν•˜κ³  μžˆλŠ” 컀피 전문사이닀. ν•œκ΅­μ—λŠ” μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λΌλŠ” μ΄λ¦„μœΌλ‘œ 운영되고 μžˆλ‹€.
# μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λŠ” λŒ€ν•œλ―Όκ΅­μ— μž…μ ν•œ 이후 2009λ…„κ³Ό 2010년에 두 μ°¨λ‘€μ˜ λΈŒλžœλ“œκ³Όμ˜ μž¬κ²€ν†  및 μƒˆλ‘œμš΄ λ””μžμΈμ„ 톡해 μƒˆλ‘œμš΄ λΈŒλžœλ“œλ‹€. 컀피 μ „λ¬Έμ˜ 프리미엄 이미지λ₯Ό μœ μ§€ν•˜κ³  있고, μŠ€νƒ€λ²…μŠ€ μ½”λ¦¬μ•„λŠ” ν•œκ΅­μ„ λŒ€ν‘œν•˜λŠ” 프리미엄 컀피 μ „λ¬Έ λΈŒλžœλ“œμ„ λ§Œλ“€κ³  μžˆλ‹€.
```