m42-health
/

Llama3-Med42-8B

@@ -1,201 +1,211 @@
 ---
-library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+language:
+- en
+license: llama3
+tags:
+- m42
+- health
+- healthcare
+- clinical-llm
+pipeline_tag: text-generation
+inference: false
+license_name: llama3
 ---
+# **Med42-v2 - A Suite of Clinically-aligned Large Language Models**
+Med42-v2 is a suite of open-access clinical large language models (LLM) instruct and preference-tuned by M42 to expand access to medical knowledge. Built off LLaMA-3 and comprising either 8 or 70 billion parameters, these generative AI system provide high-quality answers to medical questions.
+## Key performance metrics:
+- Med42-v2-70B outperforms GPT-4.0 in most of the MCQA tasks.
+- Med42-v2-70B achieves a MedQA zero-shot performance of 79.10, surpassing the prior state-of-the-art among all openly available medical LLMs.
+- Med42-v2-70B sits at the top of the Clinical Elo Rating Leaderboard.
+|Models|Elo Score|
+|:---:|:---:|
+|Med42-v2-70B| 1764 |
+|Llama3-70B-Instruct| 1643 |
+|GPT4-o| 1426 |
+|Llama3-8B-Instruct| 1352 |
+|Mixtral-8x7b-Instruct| 970 |
+|Med42-v2-8B| 924 |
+|OpenBioLLM-70B| 657 |
+|JSL-MedLlama-3-8B-v2.0| 447 |
+## Limitations & Safe Use
+- Med42-v2 suite of models are not ready for real clinical use. Extensive human evaluation is undergoing as it is required to ensure safety.
+- Potential for generating incorrect or harmful information.
+- Risk of perpetuating biases in training data.
+Use this suite of models responsibly! Do not rely on them for medical usage without rigorous safety testing.
+## Model Details
+*Disclaimer: This large language model is not yet ready for clinical use without further testing and validation. It should not be relied upon for making medical decisions or providing patient care.*
+Beginning with Llama3 models, Med42-v2 were instruction-tuned using a dataset of ~1B tokens compiled from different open-access and high-quality sources, including medical flashcards, exam questions, and open-domain dialogues.
+**Model Developers:** M42 Health AI Team
+**Finetuned from model:** Llama3 - 8B & 70B Instruct
+**Context length:** 8k tokens
+**Input:** Text only data
+**Output:** Model generates text only
+**Status:** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we enhance model's performance.
+**License:** Llama 3 Community License Agreement
+**Research Paper:** *Comming soon*
+## Intended Use
+Med42-v2 suite of models are being made available for further testing and assessment as AI assistants to enhance clinical decision-making and enhance access to LLMs for healthcare use. Potential use cases include:
+- Medical question answering
+- Patient record summarization
+- Aiding medical diagnosis
+- General health Q&A
+**Run the model**
+You can use the 🤗 Transformers library `text-generation` pipeline to do inference.
+```python
+import transformers
+import torch
+model_name_or_path = "m42-health/Llama3-Med42-8B"
+pipeline = transformers.pipeline(
+    "text-generation",
+    model=model_name_or_path,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+messages = [
+    {
+        "role": "system",
+        "content": (
+            "You are a helpful, respectful and honest medical assistant. You are a second version of Med42 developed by the AI team at M42, UAE. "
+            "Always answer as helpfully as possible, while being safe. "
+            "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. "
+            "Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. "
+            "If you don’t know the answer to a question, please don’t share false information."
+        ),
+    },
+    {"role": "user", "content": "What are the symptoms of diabetes?"},
+]
+prompt = pipeline.tokenizer.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=False
+)
+stop_tokens = [
+    pipeline.tokenizer.eos_token_id,
+    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"),
+]
+outputs = pipeline(
+    prompt,
+    max_new_tokens=512,
+    eos_token_id=stop_tokens,
+    do_sample=True,
+    temperature=0.4,
+    top_k=150,
+    top_p=0.75,
+)
+print(outputs[0]["generated_text"][len(prompt) :])
+```
+## Hardware and Software
+The training was conducted on the NVIDIA DGX cluster with H100 GPUs, utilizing PyTorch's Fully Sharded Data Parallel (FSDP) framework.
+## Evaluation Results
+### Open-ended question generation
+To ensure a robust evaluation of our model's output quality, we employ the LLM-as-a-Judge approach using Prometheus-8x7b-v2.0. Our assessment uses carefully curated 4,000 publicly accessible healthcare-related questions, generating responses from various models. We then use Prometheus to conduct pairwise comparisons of the answers. Drawing inspiration from the LMSYS Chatbot-Arena methodology, we present the results as Elo ratings for each model.
+To maintain fairness and eliminate potential bias from prompt engineering, we used the same simple system prompt for every model throughout the evaluation process.
+Below is the scoring rubric we used to prompt Prometheus to select the best answer:
+```
+### Score Rubric:
+Which response is of higher overall quality in a medical context? Consider:
+* Relevance: Does it directly address the question?
+* Completeness: Does it cover all important aspects, details and subpoints?
+* Safety: Does it avoid unsafe practices and address potential risks?
+* Ethics: Does it maintain confidentiality and avoid biases?
+* Clarity: Is it professional, clear and easy to understand?
+```
+#### Elo Ratings
+|Models|Elo Score|
+|:---:|:---:|
+|Med42-v2-70B| 1764 |
+|Llama3-70B-Instruct| 1643 |
+|GPT4-o| 1426 |
+|Llama3-8B-Instruct| 1352 |
+|Mixtral-8x7b-Instruct| 970 |
+|Med42-v2-8B| 924 |
+|OpenBioLLM-70B| 657 |
+|JSL-MedLlama-3-8B-v2.0| 447 |
+#### Win-rate
+![plot](./pairwise_model_comparison.svg)
+### MCQA Evaluation
+Med42-v2 improves performance on every clinical benchmark compared to our previous version, including MedQA, MedMCQA, USMLE, MMLU clinical topics and MMLU Pro clinical subset. For all evaluations reported so far, we use [EleutherAI's evaluation harness library](https://github.com/EleutherAI/lm-evaluation-harness) and report zero-shot accuracies (except otherwise stated). We integrated chat templates into harness and computed the likelihood for the full answer instead of only the tokens "a.", "b.", "c." or "d.".
+|Model|MMLU Pro|MMLU|MedMCQA|MedQA|USMLE|
+|---:|:---:|:---:|:---:|:---:|:---:|
+|Med42v2-70B|64.36|87.12|73.20|79.10|83.80|
+|Med42v2-8B|54.30|75.76|61.34|62.84|67.04|
+|OpenBioLLM|64.24|90.40|73.18|76.90|79.01|
+|GPT-4.0<sup>&dagger;</sup>|-|87.00|69.50|78.90|84.05|
+|MedGemini*|-|-|-|84.00|-|
+|Med-PaLM-2(5-shot)*|-|87.77|71.30|79.70|-|
+|Med42|-|76.72|60.90|61.50|71.85|
+|ClinicalCamel-70B|-|69.75|47.00|53.40|54.30|
+|GPT-3.5<sup>&dagger;</sup>|-|66.63|50.10|50.80|53.00|
+|Llama3-8B-Instruct|48.24|72.89|59.65|61.64|60.38|
+|Llama3-70B-Instruct|64.24|85.99|72.03|78.88|83.57|
+**For MedGemini, results are reported for MedQA without self-training and without search. We note that 0-shot performance is not reported for Med-PaLM 2. Further details can be found at [https://github.com/m42health/med42](https://github.com/m42health/med42)*.
+<sup>&dagger;</sup> *Results as reported in the paper [Capabilities of GPT-4 on Medical Challenge Problems](https://www.microsoft.com/en-us/research/uploads/prod/2023/03/GPT-4_medical_benchmarks.pdf)*.
+## Accessing Med42 and Reporting Issues
+Please report any software "bug" or other problems through one of the following means:
+- Reporting issues with the model: [https://github.com/m42health/med42](https://github.com/m42health/med42)
+- Reporting risky content generated by the model, bugs and/or any security concerns: [https://forms.office.com/r/fPY4Ksecgf](https://forms.office.com/r/fPY4Ksecgf)
+- M42’s privacy policy available at [https://m42.ae/privacy-policy/](https://m42.ae/privacy-policy/)
+- Reporting violations of the Acceptable Use Policy or unlicensed uses of Med42: <[email protected]>
+## Acknowledgements
+We thank the Torch FSDP team for their robust distributed training framework, the EleutherAI harness team for their valuable evaluation tools, and the Hugging Face Alignment team for their contributions to responsible AI development.
+## Citation
+```
+@article{christophe2024med42,
+  title={Med42-v2 - A Suite of Clinically-aligned Large Language Models},
+  author={Christophe, Cl{\'e}ment and Raha, Tathagata and Hayat, Nasir and Kanithi, Praveen and Al-Mahrooqi, Ahmed and Munjal, Prateek and Saadi, Nada and Javed, Hamza and Salman, Umar and Pimentel, Marco and Rajan, Ronnie and Khan, Shadab},
+  journal={M42},
+  year={2024}
+}
+```