File size: 2,287 Bytes
382bbf6 9c1df25 382bbf6 9c1df25 a149c32 333a4c3 a149c32 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
---
license: apache-2.0
datasets:
- heegyu/kowikitext
- heegyu/kowiki-sentences
language:
- ko
- en
library_name: transformers
tags:
- pytorch
---
Experimental Repository :)
Contents will updated without any notice at all. If you plan to use this repository, please use with `revision` with git hash.
This experiment is aimed to:
- Maintain NLU capability of Mistral-Instruct model(mistralai/Mistral-7B-Instruct-v0.1)
- Adapt new Korean vocab seamlessly
- Use minimal dataset (used Korean wikipedia only)
- Computationally efficient method
- Let model answer using English knowledge and NLU capability even the question/answer is Korean only.
Here's some test:
```python
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
'beomi/Mistral-Ko-Inst-dev',
torch_dtype='auto',
device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained('beomi/Mistral-Ko-Inst-dev')
pipe = pipeline(
'text-generation',
model=model,
tokenizer=tokenizer,
do_sample=True,
max_new_tokens=350,
return_full_text=False,
no_repeat_ngram_size=6,
eos_token_id=1, # not yet tuned to gen </s>, use <s> instead.
)
def gen(x):
chat = tokenizer.apply_chat_template([
{"role": "user", "content": x},
# {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
# {"role": "user", "content": "Do you have mayonnaise recipes? please say in Korean."}
], tokenize=False)
print(pipe(chat)[0]['generated_text'].strip())
gen("μ€νλ²
μ€μ μ€νλ²
μ€ μ½λ¦¬μμ μ°¨μ΄λ?")
# (μμ± μμ)
# μ€νλ²
μ€λ μ μΈκ³μ μΌλ‘ μ΄μνκ³ μλ μ»€νΌ μ λ¬Έμ¬μ΄λ€. νκ΅μλ μ€νλ²
μ€ μ½λ¦¬μλΌλ μ΄λ¦μΌλ‘ μ΄μλκ³ μλ€.
# μ€νλ²
μ€ μ½λ¦¬μλ λνλ―Όκ΅μ μ
μ ν μ΄ν 2009λ
κ³Ό 2010λ
μ λ μ°¨λ‘μ λΈλλκ³Όμ μ¬κ²ν λ° μλ‘μ΄ λμμΈμ ν΅ν΄ μλ‘μ΄ λΈλλλ€. μ»€νΌ μ λ¬Έμ ν리미μ μ΄λ―Έμ§λ₯Ό μ μ§νκ³ μκ³ , μ€νλ²
μ€ μ½λ¦¬μλ νκ΅μ λννλ ν리미μ μ»€νΌ μ λ¬Έ λΈλλμ λ§λ€κ³ μλ€.
``` |