metadata
language:
- en
- ko
license: llama3
library_name: transformers
base_model:
- meta-llama/Meta-Llama-3-8B
Hansung Bllossom | Demo | Developer κΉνλ―Ό | Github |
νμ±λνκ΅ QA κΈ°λ°μΌλ‘ νμ΅μν¨Hansung-Bllossom-8B λ₯Ό μΆμν©λλ€.
μ΄λ MLP-KTLim/llama-3-Korean-Bllossom-8B μ κΈ°λ°μΌλ‘ νμ΅λμμ΅λλ€.
The Bllossom language model is a Korean-English bilingual language model based on the open-source LLama3. It enhances the connection of knowledge between Korean and English. It has the following features:
- Knowledge Linking: Linking Korean and English knowledge through additional training
- Vocabulary Expansion: Expansion of Korean vocabulary to enhance Korean expressiveness.
- Instruction Tuning: Tuning using custom-made instruction following data specialized for Korean language and Korean culture
- Human Feedback: DPO has been applied
- Vision-Language Alignment: Aligning the vision transformer with this language model
Example code
Install Dependencies
pip install torch transformers==4.40.0 accelerate
Python code with Pipeline
import transformers
import torch
model_id = "kfkas/Hansung-Bllossom-8B"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
pipeline.model.eval()
PROMPT = '''λΉμ μ μ μ©ν AI μ΄μμ€ν΄νΈμ
λλ€. μ¬μ©μμ μ§μμ λν΄ μΉμ νκ³ μ ννκ² λ΅λ³ν΄μΌ ν©λλ€.
You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
instruction = "νμ±λνκ΅μμλ μ΄λ€ μΆμ λ νμ¬κ° μ΄λ¦¬λμ?"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
prompt = pipeline.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
terminators = [
pipeline.tokenizer.eos_token_id,
pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = pipeline(
prompt,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9
)
print(outputs[0]["generated_text"][len(prompt):])
Python code with AutoModel
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = 'kfkas/Hansung-Bllossom-8B'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
PROMPT = '''λΉμ μ μ μ©ν AI μ΄μμ€ν΄νΈμ
λλ€. μ¬μ©μμ μ§μμ λν΄ μΉμ νκ³ μ ννκ² λ΅λ³ν΄μΌ ν©λλ€.
You are a helpful AI assistant, you'll need to answer users' queries in a friendly and accurate manner.'''
instruction = "νμ±λνκ΅λ μΈμ μ€λ¦½λμλμ?"
messages = [
{"role": "system", "content": f"{PROMPT}"},
{"role": "user", "content": f"{instruction}"}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("<|eot_id|>")
]
outputs = model.generate(
input_ids,
max_new_tokens=2048,
eos_token_id=terminators,
do_sample=True,
temperature=0.6,
top_p=0.9
)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
Citation
Language Model
@misc{bllossom,
author = {ChangSu Choi, Yongbin Jeong, Seoyoon Park, InHo Won, HyeonSeok Lim, SangMin Kim, Yejee Kang, Chanhyuk Yoon, Jaewan Park, Yiseul Lee, HyeJin Lee, Younggyun Hahm, Hansaem Kim, KyungTae Lim},
title = {Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean},
year = {2024},
journal = {LREC-COLING 2024},
paperLink = {\url{https://arxiv.org/pdf/2403.10882}},
},
}
Vision-Language Model
@misc{bllossom-V,
author = {Dongjae Shin, Hyunseok Lim, Inho Won, Changsu Choi, Minjun Kim, Seungwoo Song, Hangyeol Yoo, Sangmin Kim, Kyungtae Lim},
title = {X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment},
year = {2024},
publisher = {GitHub},
journal = {NAACL 2024 findings},
paperLink = {\url{https://arxiv.org/pdf/2403.11399}},
},
}
Contact
- κΉνλ―Ό(Taemin Kim), Intelligent System.
[email protected]
Contributor
- κΉνλ―Ό(Taemin Kim), Intelligent System.
[email protected]