Igea-1B-v0.1 / README.md

Update README.md

b91225a verified 3 months ago

6.46 kB

	---
	language:
	- it
	library_name: transformers
	tags:
	- pretrained
	- biomedical
	- text-generation
	- medical
	base_model: sapienzanlp/Minerva-1B-base-v1.0
	datasets:
	- IVN-RIN/BioBERT_Italian
	- Detsutut/medmcqa-ita
	pipeline_tag: text-generation
	widget:
	- text: 'I batteri della famiglia Bacteroides sono importanti per '
	example_title: Example 1
	license: apache-2.0
	extra_gated_prompt: >-
	This is a pretrained model that should be fine-tuned to perform downstream
	tasks. You agree to not use the model to conduct experiments that cause harm
	to human subjects, or to perform any medical-related task.
	extra_gated_fields:
	Company: text
	Country: country
	Specific date: date_picker
	I want to use this model for:
	type: select
	options:
	- Research
	- Education
	- label: Other
	value: other
	I have read and unsderstood the 'Bias, Risk, and Limitation' section of the model card: checkbox
	extra_gated_heading: Acknowledge terms and conditions to accept the repository
	extra_gated_description: Our team may take 2-3 days to process your request
	extra_gated_button_content: Acknowledge
	---

	# Igea-1B-v0.0.1 ⚕️🩺

	Igea is a biomedical Small Language Model (SLM) for Italian, continually pretrained from [Minerva](https://huggingface.co/sapienzanlp/Minerva-1B-base-v1.0) with [NMT translated Pubmed Abstracts](https://huggingface.co/datasets/IVN-RIN/BioBERT_Italian)

	🔓: Access to the model is only granted after explicitly acknowledging that you have read the 'Bias, Risk, and Limitation' section of this model card.

	This is ongoing research. Do not use it for any medical-related tasks.

	Preprint: [Igea: a Decoder-Only Language Model for Biomedical Text Generation in Italian](https://arxiv.org/abs/2407.06011).

	## How to use Igea with Hugging Face transformers

	```python
	import transformers
	import torch

	model_id = "bmi-labmedinfo/Igea-1B-v0.1"

	# Initialize the pipeline.
	pipeline = transformers.pipeline(
	"text-generation",
	model=model_id,
	model_kwargs={"torch_dtype": torch.bfloat16},
	device_map="auto",
	)

	# Input text for the model.
	input_text = "Il fegato è "

	# Compute the outputs.
	output = pipeline(
	input_text,
	max_new_tokens=128,
	)

	# Output:
	# [{'generated_text': "Il fegato è una ghiandola fondamentale per il metabolismo umano, la più [...]"}]
	```

	## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨
	This section identifies foreseeable harms and misunderstandings.

	This is a continued pretraining of a foundation model, not subject to alignment. Model may:

	- Overrepresent some viewpoints and underrepresent others
	- Contain stereotypes
	- Contain personal information
	- Generate:
	- Racist and sexist content
	- Hateful, abusive, or violent language
	- Discriminatory or prejudicial language
	- Content that may not be appropriate for all settings, including sexual content
	- Make errors, including producing incorrect information or historical facts as if it were factual
	- Generate irrelevant or repetitive outputs

	We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data.

	The biomedical setting poses additional threats, including:

	- Disparities in research focus, demographic representation, and reporting standards
	- Reinforcement of existing medical paradigms and overlook emerging or alternative viewpoints, hindering innovation and comprehensive care
	- Generation of incorrect information and false claims, potentially leading to incorrect medical decisions

	This model is therefore not intended to be used as it is for any medical-related task.

	## Training and evaluation data

	It achieves the following results on the evaluation set:
	- Loss: 1.6976
	- Accuracy: 0.6011

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 4
	- gradient_accumulation_steps: 2
	- total_train_batch_size: 64
	- total_eval_batch_size: 32
	- optimizer: Adam with betas=(0.9,0.95) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.02
	- num_epochs: 1

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \|
	\|:-------------:\|:------:\|:-----:\|:---------------:\|:--------:\|
	\| 1.8964 \| 0.0989 \| 5000 \| 1.8924 \| 0.5713 \|
	\| 1.8265 \| 0.1978 \| 10000 \| 1.8264 \| 0.5809 \|
	\| 1.7883 \| 0.2966 \| 15000 \| 1.7892 \| 0.5866 \|
	\| 1.7652 \| 0.3955 \| 20000 \| 1.7626 \| 0.5905 \|
	\| 1.7415 \| 0.4944 \| 25000 \| 1.7418 \| 0.5939 \|
	\| 1.7259 \| 0.5933 \| 30000 \| 1.7253 \| 0.5965 \|
	\| 1.7106 \| 0.6922 \| 35000 \| 1.7126 \| 0.5985 \|
	\| 1.703 \| 0.7910 \| 40000 \| 1.7037 \| 0.6000 \|
	\| 1.6969 \| 0.8899 \| 45000 \| 1.6989 \| 0.6009 \|
	\| 1.6963 \| 0.9888 \| 50000 \| 1.6976 \| 0.6011 \|


	### Framework versions

	- Transformers 4.40.2
	- Pytorch 2.3.0+cu121
	- Datasets 2.19.1
	- Tokenizers 0.19.1

	### Recommendations

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	## Evaluation

	Evaluation results in terms of normalized accuracy for the Igea models on biomedical and general datasets, translated in Italian. The best performing checkpoint of Minerva has been included for comparison.

	\| Dataset \| Domain \|Minerva 3B (best base) \| Igea 350M \| Igea 1B \| Igea 3B \|
	\|:--------------------:\|:-------:\|:-----------------:\|:----------:\|:-------:\|:--------:\|
	\| MedMCQA-ITA (0-shot) \| Biomed \| 0.293 \| 0.250 \| 0.307 \| 0.313 \|
	\| Hellaswag-IT (0-shot)\| General \| 0.519 \| 0.303 \| 0.357 \| 0.491 \|
	\| ARC-IT (0-shot) \| General \| 0.305 \| 0.244 \| 0.270 \| 0.287 \|
	\| MMLU-IT (5-shot) \| General \| 0.261 \| 0.254 \| 0.255 \| 0.252 \|

	## Credits

	Developed by [Tommaso M. Buonocore](https://huggingface.co/Detsutut) and [Simone Rancati](https://huggingface.co/SimoRancati).

	Thanks to [Michele Montebovi](https://huggingface.co/DeepMount00) for his precious advices.