Italia 9B - Instruct v0.1
Introduction
For more details on Italia and iGenius, please visit our website and read our release blog post.
Subscribe to our newsletter to receive updates on our latest AI model advancements.
Italia is a family of Open Source large language models developed by iGenius, designed for companies operating in the public and private sectors.
The first model in our series is Italia 9B, a foundational LLM with a 9-billion-parameter Transformer architecture, developed in collaboration with Cineca and released under the MIT license.
The Italia family of models has been designed for companies operating in highly regulated sectors, such as financial services or public administration. Even in its first version, it is a a unique LLM: although specialized in just one single language, the high number of parameters combined with the quality of the training process makes it the ideal choice for the most critical use cases in the enterprise world, where the reliability of generated content is of paramount importance.
As the name suggests, Italia is equipped with excellent linguistic formulation capabilities in Italian. This doesn't just include vocabulary and sentence structure, but also cultural and historical knowledge of the country, which are essential for applications requiring advanced proficiency in the Italian language.
Data security and information reliability have always been priorities for iGenius. We have invested in building a high-quality Italian dataset to develop a truly open, transparent, and secure language model, in compliance with European AI regulations such as the AI Act.
Terms of Use: link
Authors: The iGenius Team
Model release date: 04 July 2024
Status: Visit the iGenius website for more information and updates. This is a static model trained on an offline dataset. Future versions of the fine-tuned models will be released as we improve model safety based on community feedback.
Hardware and Software
Thanks to the partnership with Cineca, we had the opportunity to train and fine-tune Italia 9B on a large scale using thousands of GPUs on the Leonardo supercomputer, one of the most advanced and high-performing computing infrastructures in the world.
Training
Italia 9B was trained from scratch in Italian on trillions of tokens, using a heterogeneous mix of data: public sources, synthetic data, and domain-specific content provided by our commercial partners. Trained exclusively in Italian, without translations from English, Italia 9B can understand all Italian linguistic and cultural nuances with unprecedented precision.
More than 90% of the pre-training data for Italia consists of Italian text, with the remaining portion in English. This enables Italia to be fully proficient in English and perform well in translation tasks. Additionally, the model has undergone a post-training process that includes both supervised fine-tuning and direct preference optimization to enhance instruction-following capabilities and ensure robust safety measures.
The pretraining data has a cutoff date of December 2023, meaning that all the textual information used to train the model was collected and included up until that point. This ensures that the model is equipped with the most recent linguistic and contextual knowledge available at the time of training, enhancing its relevance and accuracy in understanding and generating text based on contemporary language usage.
Benchmarks
All existing benchmarks for evaluating the performance of language models are specifically designed for the English-speaking ecosystem. The questions used in these benchmarks reflect elements, concepts, and structures typical of American and British cultures, which are not represented in native Italian training sources. We are collaborating with leading institutions in Italy to develop a benchmarking system tailored specifically for evaluating native Italian models. However, Italia demonstrated nearly state-of-the-art performance among models of a similar size when assessed against benchmarks testing common sense, language understanding, and logical reasoning. Here are the benchmark results generated with llm-harness.
Model | Italia 9B - Instruct - v0.1 |
xcopa_it | 0.73 |
lambada_openai_mt_it (perplexity) | 40.6 |
lambada_openai_mt_it (acc) | 0.43 |
m_mmlu_it (5-shot) | 0.42 |
arc_it (5-shot) | 0.43 |
belebele_ita_Latn (5-shot) | 0.46 |
hellaswag_it (5-shot) | 0.55 |
truthfulqa_it_mc1 (0-shot) | 0.30 |
truthfulqa_it_mc2 (0-shot) | 0.42 |
How to use
Use with transformers
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
model_id = "iGeniusAI/Italia-9B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
t_pipeline = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
device_map="auto",
return_full_text=False,
top_p = 0.95,
top_k = 50
)
SYSTEM_PROMPT = """Il tuo nome è Modello Italia. Tu sei un'intelligenza artificiale, un modello di linguaggio naturale addestrato da iGenius su Leonardo, uno dei supercomputer più potenti al mondo."""
TEMPERATURE = 0.3
MAX_NEW_TOKENS = 250
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Ciao come stai?"},
]
conv_template = tokenizer.apply_chat_template(
messages,
tokenize=False
)
outputs = t_pipeline(
conv_template,
max_new_tokens=MAX_NEW_TOKENS,
do_sample=True,
temperature=TEMPERATURE,
num_return_sequences=1,
)
print(outputs[0]["generated_text"])
Chat Format
Italy-9B Instruct is a finetuned model to follow instructions provided by a user, so for best results it is necessary to use the chat format as follow:
<|system|>
Your system prompt.</s>
<|user|>
user request.</s>
<|assistant|>
For example:
<|system|>
Il tuo nome è Modello Italia. Tu sei un'intelligenza artificiale, un modello di linguaggio naturale addestrato da iGenius su Leonardo, uno dei supercomputer più potenti al mondo.</s>
<|user|>
Scrivi una funzione python che genera numeri random.</s>
<|assistant|>
where the model generates the text after <|assistant|>
. </s>
is the EOS token.
Intended Use
Italia 9B is a large language model (LLM) designed for both commercial and research purposes, focusing on the Italian language. It is versatile and adaptable, making it suitable for a wide range of applications across different domains and industries. Whether used for automated content generation, or domain-specific research, this model can be fine-tuned for a variety of natural language processing tasks. It excels in enterprise environments, providing secure, efficient, and accurate AI solutions for business problem-solving. This LLM is ideal for use cases that demand high reliability and precision, making it a valuable tool for companies seeking advanced AI capabilities.
In the field of Natural Language Processing (NLP) research, Italia serves as a foundation for researchers to experiment with NLP techniques, develop algorithms, and contribute to the advancement of the field. This opens up numerous opportunities for academic and practical innovations.
Users can use Italia as a base model for text generation or fine-tune it for specific downstream tasks. However, they should consider several key aspects in accordance with the MIT license:
- Attribution: The MIT license requires you to include the full text of the license and the copyright notice in any distributed files. The copyright notice and the MIT license must be incorporated in any projects utilizing the model.
- Limitation of Liability: The MIT license contains a disclaimer clause that limits the liability of the authors or contributors for any damages resulting from the use of the software. This implies that no warranty or liability is provided for the software’s use.
- Sharing Modifications: The MIT license does not mandate sharing modifications made to the software. Users are free to modify the model for fine-tuning without an obligation to share these modifications with the community.
- Compatibility: The MIT license is highly permissive and compatible with many other open-source licenses. Nonetheless, it is crucial to verify the compatibility of the MIT license with any other software or libraries used alongside the model.
Out of Scope
Italia should not be used for applications related to the following categories:
- Violations of law: Any use that may violate local, national, or international laws and regulations.
- Infringement of privacy: Any use that may compromise the privacy or personal data of individuals without their consent.
- Malicious activities: Applications intended to harm, deceive, or exploit individuals or groups, including but not limited to phishing, fraud, or cyberattacks.
- Misinformation: Spreading false or misleading information, particularly in sensitive contexts such as health, safety, and public policy.
- Discriminatory practices: Uses that contribute to discrimination or unfair treatment of individuals based on race, gender, age, nationality, or other protected characteristics.
- Coding tasks: Tasks related to generating or interpreting source code.
Limitations
The core values of iGenius are openness, helpfulness, and fairness. We aim to serve everyone and address a wide range of use cases. Italia is designed to be accessible to people from diverse backgrounds, experiences, and perspectives. It respects all users, emphasizing the importance of free thought and expression, which drive innovation and progress.
However, Italia is a new technology, and there are risks associated with its use. Testing conducted to date has not been able to cover all scenarios. For these reasons, as with all LLMs, Italia’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased, or otherwise objectionable responses. We recommend that developers perform safety testing before deploying any applications based on Italia.
Contributors
The iGenius Team.
Special thanks to Cineca and their team for their invaluable support and the use of the Leonardo supercomputer in developing our model. This collaboration shows how valuable partnerships can benefit society, businesses, and individuals alike, fostering creativity and innovation to help our country thrive.
- Downloads last month
- 8,283