dominguesm
/

canarim-7b

+---
+tags:
+    - text-generation
+    - pytorch
+inference: false
+license: cc-by-4.0
+language:
+    - pt
+pipeline_tag: text-generation
+library_name: transformers
+---
+<p align="center">
+  <img width="250" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/Canarim-Instruct-PTBR/main/assets/canarim.png">
+</p>
+<hr>
+# `Canarim-7B`
+Canarim-7B is a Portuguese language model developed by [Maicon Domingues](https://nlp.rocks).
+## Model description
+The model was pretrained on 16 billion tokens from the Portuguese subset of [CommonCrawl 2023-23](https://huggingface.co/datasets/dominguesm/CC-MAIN-2023-23), starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023.
+## Key Features
+-   **Language:** Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences.
+-   **Architecture:** Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results.
+-   **Diverse Dataset:** The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese.
+## Applications
+Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications:
+-   **Natural Language Understanding (NLU):** Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided.
+-   **Natural Language Generation (NLG):** Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format.
+-   **Language Translation:** Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning.
+### Tips for Efficient Use
+-   **Few-shot Learning:** When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task.
+-   **Contextualization:** Including additional context in the input can significantly improve the quality of the model’s predictions and text generation.
+---
+## Getting Started
+To start using Canarim-7B with the Transformers library, first install the library if you haven't already:
+```bash
+pip install transformers
+```
+You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the `pipeline` function:
+```python
+from transformers import AutoTokenizer, pipeline
+import torch
+model_id = "dominguesm/canarim-7b"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+pipe = pipeline(
+    "text-generation",
+    model=model_id,
+    torch_dtype=torch.float16,
+    device_map="auto",
+)
+prompt = make_prompt(question)
+sequences = pipe(
+   prompt,
+   do_sample=True,
+   num_return_sequences=1,
+   eos_token_id=tokenizer.eos_token_id,
+   max_length=2048,
+   temperature=0.9,
+   top_p=0.6,
+   repetition_penalty=1.15
+)
+```
+This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like `max_length` according to your requirements.
+## License
+Canarim-7B is released under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). This license allows others to copy, distribute, remix, adapt, and build upon the work, even commercially, as long as they credit the original creation.

assets/canarim.png ADDED Viewed