--- library_name: peft license: apache-2.0 language: - mn - en tags: - Mongolian - QLora - Llama3 - Instructed-model pipeline_tag: text-generation --- ## Mongolian-Llama3 ![ Alt Text](Llama.jpg) ### Model Description Mongolian-Llama3 implementation in Chat UI [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1LC0xx4i9xqFmwn9l8T6vw25RIr-BP0Tq?usp=sharing]) Mongolian-Llama3 is the first open source instruction-tuned language model for Mongolian & English users with various abilities such as roleplaying & tool-using built upon the quantized Meta-Llama-3-8B model. Developed by: Dorjzodovsuren License: Llama-3 License Base Model: llama-3-8b-bnb-4bit Model Size: 4.65B Context length: 8K ## Bias, Risks, and Limitations To combat fake news, current strategies rely heavily on synthetic and translated data. However, these approaches have inherent biases, risks, and limitations: 1. **Synthetic Data Bias**: Algorithms may inadvertently perpetuate biases present in training data. 2. **Translation Inaccuracy**: Translations can distort meaning or lose context, leading to misinformation. 3. **Cultural Nuances**: Synthetic and translated data may miss cultural intricacies, risking amplification of stereotypes. 4. **Algorithmic Limits**: Effectiveness is constrained by algorithm capabilities and training data quality. 5. **Dependency on Data**: Accuracy hinges on quality and representativeness of training data. 6. **Adversarial Attacks**: Malicious actors can exploit vulnerabilities to manipulate content. 7. **Different answer based on language**: Answer might be a bit different based on language. ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Due to hallucinations and pretraining datasets characteristics, some information might be misleading, and answer might be a bit different based on language. Please ask in Mongolian if possible. ## How to Get Started with the Model Use the code below to get started with the model. ```python import torch import gradio as gr from threading import Thread from peft import PeftModel, PeftConfig from unsloth import FastLanguageModel from transformers import TextStreamer from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer config = PeftConfig.from_pretrained("Dorjzodovsuren/Mongolian_llama3") model = AutoModelForCausalLM.from_pretrained("unsloth/llama-3-8b-bnb-4bit", torch_dtype = torch.float16) model = PeftModel.from_pretrained(model, "Dorjzodovsuren/Mongolian_llama3") #load tokenizer tokenizer = AutoTokenizer.from_pretrained("Dorjzodovsuren/Mn_llama3") alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {} ### Input: {} ### Response: {}""" # Enable native 2x faster inference FastLanguageModel.for_inference(model) # Create a text streamer text_streamer = TextStreamer(tokenizer, skip_prompt=False,skip_special_tokens=True) # Get the device based on GPU availability device = 'cuda' if torch.cuda.is_available() else 'cpu' # Move model into device model = model.to(device) class StopOnTokens(StoppingCriteria): def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool: stop_ids = [29, 0] for stop_id in stop_ids: if input_ids[0][-1] == stop_id: return True return False # Current implementation does not support conversation based on previous conversation. # Highly recommend to experiment on various hyper parameters to compare qualities. def predict(message, history): stop = StopOnTokens() messages = alpaca_prompt.format( message, "", "", ) model_inputs = tokenizer([messages], return_tensors="pt").to(device) streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True) generate_kwargs = dict( model_inputs, streamer=streamer, max_new_tokens=1024, top_p=0.95, temperature=0.001, repetition_penalty=1.1, stopping_criteria=StoppingCriteriaList([stop]) ) t = Thread(target=model.generate, kwargs=generate_kwargs) t.start() partial_message = "" for new_token in streamer: if new_token != '<': partial_message += new_token yield partial_message gr.ChatInterface(predict).launch(debug=True, share=True, show_api=True) ```