File size: 3,829 Bytes
d6e8550 617cb07 8b3b1af 617cb07 0c4a237 d6e8550 617cb07 4e1fe68 617cb07 3f1dcfd 617cb07 4e1fe68 3f1dcfd 4e1fe68 3f1dcfd 4e1fe68 3f1dcfd 4e1fe68 3f1dcfd 4e1fe68 3f1dcfd 4e1fe68 e8de855 4e1fe68 3f1dcfd e84d086 4e1fe68 e8de855 e84d086 3f1dcfd e84d086 4e1fe68 3f1dcfd 0c4a237 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
---
language:
- en
- de
- es
- ar
- ja
- ko
- zh
license: cc-by-nc-sa-4.0
library_name: transformers
datasets:
- wi_locness
- matejklemen/falko_merlin
- paws
- paws-x
- facebook/asset
metrics:
- bleu
- rouge
- sari
- accuracy
pipeline_tag: text-generation
---
# Model Card for mEdIT-xxl
The `medit-xxl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-13b-lora` model on the mEdIT dataset.
**Paper:** mEdIT: Multilingual Text Editing via Instruction Tuning
**Authors:** Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar
## Model Details
### Model Description
- **Language(s) (NLP)**: Arabic, Chinese, English, German, Japanese, Korean, Spanish
- **Finetuned from model:** `MBZUAI/bactrian-x-llama-13b-lora`
### Model Sources
- **Repository:** https://github.com/vipulraheja/medit
- **Paper:** https://arxiv.org/abs/2402.16472v1
## How to use
Given an edit instruction and an original text, our model can generate the edited version of the text.<br>
![task_specs](https://cdn-uploads.huggingface.co/production/uploads/60985a0547dc3dbf8a976607/816ZY2t0XPCpMMd6Z072K.png)
Specifically, our models support both multi-lingual and cross-lingual text revision. Note that the input and output texts are always in the same language. The monolingual
vs. cross-lingual setting is determined by comparing the language of the edit instruction in relation to the language of the input text.
### Instruction format
Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results.
```
instruction_tokens = [
"Instruction",
"Anweisung",
...
]
input_tokens = [
"Input",
"Aporte",
...
]
output_tokens = [
"Output",
"Produzione",
...
]
task_descriptions = [
"Fix grammatical errors in this sentence", # <-- GEC task
"Umschreiben Sie den Satz", # <-- Paraphrasing
...
]
```
**The entire list of possible instructions, input/output tokens, and task descriptions can be found in the Appendix of our paper.**
```
prompt_template = """### <instruction_token>:\n<task_description>\n### <input_token>:\n<input>\n### <output_token>:\n\n"""
```
Note that the tokens and the task description need not be in the language of the input (in the case of cross-lingual revision).
### Run the model
**Make sure you have the following libraries installed:**
```
- peft
- protobuf
- sentencepiece
- tokenizers
- torch
- transformers
```
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "grammarly/medit-xxl"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# English GEC using Japanese instructions
prompt = '### ε½δ»€:\nζη« γζζ³ηγ«γγ\n### ε
₯ε:\nI has small cat ,\n### εΊε:\n\n'
inputs = tokenizer(prompt, return_tensors='pt')
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# --> I have a small cat ,
# German GEC using Japanese instructions
prompt = '### ε½δ»€:\nζη« γζζ³ηγ«γγ\n### ε
₯ε:\nIch haben eines kleines Katze ,\n### εΊε:\n\n'
# ...
# --> Ich habe eine kleine Katze ,
```
#### Software
https://github.com/vipulraheja/medit
## Citation
**BibTeX:**
```
@article{raheja2023medit,
title={mEdIT: mEdIT: Multilingual Text Editing via Instruction Tuning},
author={Vipul Raheja and Dimitris Alikaniotis and Vivek Kulkarni and Bashar Alhafni and Dhruv Kumar},
year={2024},
eprint={2402.16472v1},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
**APA:**
Raheja, V., Alikaniotis, D., Kulkarni, V., Alhafni, B., & Kumar, D. (2024). MEdIT: Multilingual Text Editing via Instruction Tuning. ArXiv. /abs/2402.16472 |