--- language: - en - es --- # Model Card for Carpincho-30b This is Carpincho-30B qlora 4-bit checkpoint, an Instruction-tuned LLM based on LLama-30B. It is trained to answer in colloquial spanish Argentine language. It was trained on 2x3090 (48G) for 120 hs using huggingface QLoRA code (4-bit quantization) ## Model Details The model is provided in LoRA format. ## Usage Here is example inference code, you will need to install the following requirements: ``` bitsandbytes==0.39.0 transformers @ git+https://github.com/huggingface/transformers.git peft @ git+https://github.com/huggingface/peft.git accelerate @ git+https://github.com/huggingface/accelerate.git einops==0.6.1 evaluate==0.4.0 scikit-learn==1.2.2 sentencepiece==0.1.99 wandb==0.15.3 ``` ``` import time import torch from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer, LlamaTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer model_name = "models/huggyllama_llama-30b/" adapters_name = 'carpincho-30b-qlora' print(f"Starting to load the model {model_name} into memory") model = AutoModelForCausalLM.from_pretrained( model_name, load_in_4bit=True, torch_dtype=torch.bfloat16, device_map="sequential" ) print(f"Loading {adapters_name} into memory") model = PeftModel.from_pretrained(model, adapters_name) tokenizer = LlamaTokenizer.from_pretrained(model_name) tokenizer.bos_token_id = 1 stop_token_ids = [0] print(f"Successfully loaded the model {model_name} into memory") def main(tokenizer): prompt = '''Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: %s ### Response: ''' % "Hola, como estas?" batch = tokenizer(prompt, return_tensors="pt") batch = {k: v.cuda() for k, v in batch.items()} with torch.no_grad(): generated = model.generate(inputs=batch["input_ids"], do_sample=True, use_cache=True, repetition_penalty=1.1, max_new_tokens=100, temperature=0.9, top_p=0.95, top_k=40, return_dict_in_generate=True, output_attentions=False, output_hidden_states=False, output_scores=False) result_text = tokenizer.decode(generated['sequences'].cpu().tolist()[0]) print(result_text) main(tokenizer) ``` ### Model Description - **Developed by:** Alfredo Ortega (@ortegaalfredo) - **Model type:** 30B LLM QLoRA - **Language(s):** (NLP): English and colloquial Argentine Spanish - **License:** Free for non-commercial use, but I'm not the police. - **Finetuned from model:** https://huggingface.co/huggyllama/llama-30b ### Model Sources [optional] - **Repository:** https://huggingface.co/huggyllama/llama-30b - **Paper [optional]:** https://arxiv.org/abs/2302.13971 ## Uses This is a generic LLM chatbot that can be used to interact directly with humans. ## Bias, Risks, and Limitations This bot is uncensored and may provide shocking answers. Also it contains bias present in the training material. ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. ## Model Card Contact Contact the creator at @ortegaalfredo on twitter/github