Text2Text Generation
Transformers
dialog
Inference Endpoints
jncraton commited on
Commit
b6e8292
1 Parent(s): 8657778

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,286 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - ro
6
+ - de
7
+ - multilingual
8
+
9
+ widget:
10
+ - text: "Continue the dialogue as a task-oriented dialogue system called SYSTEM. The answer of SYSTEM should follow the ACTION provided next while answering the USER's last utterance: \n<USER> Hello, I am looking for a restaurant in Cambridge. I believe it is called Golden Wok. \n<EXTERNAL KNOWLEDGE> ACTION: {'Restaurant-Inform': [['address', '191 Histon Road Chesterton']]}"
11
+ example_title: "Dialog Act to Response Generation"
12
+ - text: "Translate to German: My name is Arthur"
13
+ example_title: "Translation"
14
+ - text: "Please answer to the following question. Who is going to be the next Ballon d'or?"
15
+ example_title: "Question Answering"
16
+ - text: "Q: Can Geoffrey Hinton have a conversation with George Washington? Give the rationale before answering."
17
+ example_title: "Logical reasoning"
18
+ - text: "Please answer the following question. What is the boiling point of Nitrogen?"
19
+ example_title: "Scientific knowledge"
20
+ - text: "Answer the following yes/no question. Can you write 200 words in a single tweet?"
21
+ example_title: "Yes/no question"
22
+ - text: "Answer the following yes/no question by reasoning step-by-step. Can you write 200 words in a single tweet?"
23
+ example_title: "Reasoning task"
24
+ - text: "Q: Is the statement ( `Jianguo is a research scientist at Salesforce AI` and `Jianguo is a student at UIC` ) True or Flase? A: Let's think step by step"
25
+ example_title: "Boolean Expressions"
26
+ - text: "The square root of x is the cube root of y. What is y to the power of 2, if x = 4?"
27
+ example_title: "Math reasoning"
28
+ - text: "Premise: At my age you will probably have learnt one lesson. Hypothesis: It's not certain how many lessons you'll learn by your thirties. Does the premise entail the hypothesis?"
29
+ example_title: "Premise and hypothesis"
30
+
31
+ inference:
32
+ parameters:
33
+ max_length: 256
34
+
35
+ tags:
36
+ - text2text-generation
37
+ - dialog
38
+
39
+ datasets:
40
+ - Salesforce/dialogstudio
41
+ - flan
42
+
43
+
44
+ license: apache-2.0
45
+ ---
46
+
47
+ # Model Card for DialogStudio-T5 large
48
+
49
+ <img src="https://huggingface.co/datasets/Salesforce/dialogstudio/resolve/main/logo.png"
50
+ alt="drawing" width="510"/>
51
+
52
+ # Table of Contents
53
+
54
+ 0. [TL;DR](#TL;DR)
55
+ 1. [Model Details](#model-details)
56
+ 2. [Usage](#usage)
57
+ 3. [Uses](#uses)
58
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
59
+ 5. [Training Details](#training-details)
60
+ 6. [Evaluation](#evaluation)
61
+ 7. [Environmental Impact](#environmental-impact)
62
+ 8. [Citation](#citation)
63
+ 9. [Model Card Authors](#model-card-authors)
64
+
65
+ # TL;DR
66
+
67
+ If you already know T5 and Flan-T5, DialogStudio-T5 is better at many things. With the same number of parameters, the models are fine-tuned from a selected amount of dialogues from [DialogStudio](https://github.com/salesforce/DialogStudio) and also 1000 additional tasks.
68
+
69
+
70
+ **Disclaimer**: Content from **this** model card are modified from contents written by the Hugging Face team, and parts of it were copy pasted from the [T5 model card](https://huggingface.co/t5-large) and [Flan-T5 model card](https://huggingface.co/google/flan-t5-large).
71
+
72
+
73
+ **Follow the [DialogStudio](https://github.com/salesforce/DialogStudio) GitHub repository for latest information.**
74
+
75
+
76
+ # Model Details
77
+ ## Data
78
+
79
+ We sample a small amount of dialogues from each commercial supported dataset under three categories of [DialogStudio](https://huggingface.co/datasets/Salesforce/dialogstudio), i.e., KG-Dial, TOD and Open-Domain dialogues. Additionally, we sample at most 150 examples for each non-translation task from [FLAN](https://github.com/google-research/FLAN/tree/main/flan/v2).
80
+
81
+
82
+ **Note** that this model version 1.0 does not incorporate datasets utilized for training large-scale models (>=7B) like Alpaca, ShareGPT, GPT4ALL, UltraChat from OpenAI's 'GPT-3.5/4', or other datasets such as OASST1 and WizardCoder.
83
+
84
+
85
+ <img src="https://huggingface.co/datasets/Salesforce/dialogstudio/resolve/main/DialogStudio_Stats.jpg"
86
+ alt="drawing" width="700"/>
87
+
88
+
89
+
90
+ ## Model Description
91
+
92
+
93
+ - **Model type:** Language model
94
+ - **Language(s) (NLP):** English, Spanish, Japanese, Persian, Hindi, French, Chinese, Bengali, Gujarati, German, Telugu, Italian, Arabic, Polish, Tamil, Marathi, Malayalam, Oriya, Panjabi, Portuguese, Urdu, Galician, Hebrew, Korean, Catalan, Thai, Dutch, Indonesian, Vietnamese, Bulgarian, Filipino, Central Khmer, Lao, Turkish, Russian, Croatian, Swedish, Yoruba, Kurdish, Burmese, Malay, Czech, Finnish, Somali, Tagalog, Swahili, Sinhala, Kannada, Zhuang, Igbo, Xhosa, Romanian, Haitian, Estonian, Slovak, Lithuanian, Greek, Nepali, Assamese, Norwegian
95
+ - **License:** Apache 2.0
96
+ - **Related Models:** [All DialogStudio-T5 Checkpoints](https://huggingface.co/models?search=dialogstudio-t5)
97
+ - **Resources for more information:**
98
+ - [Research paper](https://arxiv.org/abs/2307.10172)
99
+ - [GitHub Repo](https://github.com/salesforce/DialogStudio)
100
+ - **Maximum model length:**:
101
+ - Maximum input length: 1200
102
+ - Maximum output length: 256
103
+ - **Training formats:**
104
+ - We process dialogue data into below input format :
105
+ - With instruction and external knowledge: ```Instruction: your instruction <USER> user utterance 1 <SYSTEM> system utterance 1 ... <USER> user utterance N <EXTERNAL KNOWLEDGE> your external knowledge```
106
+ - Without instruction: ```<USER> user utterance 1 <SYSTEM> system utterance 1 ... <USER> user utterance N <EXTERNAL KNOWLEDGE> your external knowledge```
107
+ - Without external knowledge: ```Instruction: your instruction <USER> user utterance 1 <SYSTEM> system utterance 1 ... <USER> user utterance N```
108
+ - Without both: ```<USER> user utterance 1 <SYSTEM> system utterance 1 ... <USER> user utterance N```
109
+ - Note: output is final the system response; `<USER>`, `<SYSTEM>` and `<EXTERNAL KNOWLEDGE>` are special tokens
110
+ - For sampled FLAN data:
111
+ - We follow their original data format, i.e., we did not set special tokens to separate in-context learning examples.
112
+ - In summary:
113
+ - We recommend you use our format and add our special tokens (such as `<USER>` and `<SYSTEM>` ) to get better performance. However, you may not necessary need to exactly follow our format if you do not observe random behavios.
114
+ - We found that T5 model series such as Flan-t5 and DialogStudio-T5 may generate repetitive tokens during inference. If you find such repetition issues, you can set the `repetition_penalty` in model.generate(), such as 1.5, to mitigate them. Note that `repetition_penalty=1.0` by default.
115
+ # Usage
116
+
117
+ Find below some example scripts on how to use the model in `transformers`:
118
+
119
+ ## Using the Pytorch model
120
+
121
+ ### Running the model on a CPU
122
+
123
+ <details>
124
+ <summary> Click to expand </summary>
125
+
126
+ ```python
127
+
128
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
129
+
130
+ tokenizer = AutoTokenizer.from_pretrained("Salesforce/dialogstudio-t5-large-v1.0")
131
+ model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/dialogstudio-t5-large-v1.0")
132
+
133
+ input_text = "Answer the following yes/no question by reasoning step-by-step. Can you write 200 words in a single tweet?"
134
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
135
+
136
+ outputs = model.generate(input_ids, max_new_tokens=256)
137
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
138
+ ```
139
+
140
+ </details>
141
+
142
+ ### Running the model on a GPU
143
+
144
+ <details>
145
+ <summary> Click to expand </summary>
146
+
147
+ ```python
148
+ # pip install accelerate
149
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
150
+
151
+ tokenizer = AutoTokenizer.from_pretrained("Salesforce/dialogstudio-t5-large-v1.0")
152
+ model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/dialogstudio-t5-large-v1.0", device_map="auto")
153
+
154
+ input_text = "Answer the following yes/no question by reasoning step-by-step. Can you write 200 words in a single tweet?"
155
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
156
+
157
+ outputs = model.generate(input_ids, max_new_tokens=256)
158
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
159
+ ```
160
+
161
+ </details>
162
+
163
+ ### Running the model on a GPU using different precisions
164
+
165
+ #### FP16
166
+
167
+ <details>
168
+ <summary> Click to expand </summary>
169
+
170
+ ```python
171
+ # pip install accelerate
172
+ import torch
173
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
174
+
175
+ tokenizer = AutoTokenizer.from_pretrained("Salesforce/dialogstudio-t5-large-v1.0")
176
+ model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/dialogstudio-t5-large-v1.0", device_map="auto", torch_dtype=torch.float16)
177
+
178
+ input_text = "Answer the following yes/no question by reasoning step-by-step. Can you write 200 words in a single tweet?"
179
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
180
+
181
+ outputs = model.generate(input_ids, max_new_tokens=256)
182
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
183
+ ```
184
+
185
+ </details>
186
+
187
+ #### INT8
188
+
189
+ <details>
190
+ <summary> Click to expand </summary>
191
+
192
+ ```python
193
+ # pip install bitsandbytes accelerate
194
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
195
+
196
+ tokenizer = AutoTokenizer.from_pretrained("Salesforce/dialogstudio-t5-large-v1.0")
197
+ model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/dialogstudio-t5-large-v1.0", device_map="auto", load_in_8bit=True)
198
+
199
+ input_text = "Answer the following yes/no question by reasoning step-by-step. Can you write 200 words in a single tweet?"
200
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
201
+
202
+ outputs = model.generate(input_ids, max_new_tokens=256)
203
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
204
+ ```
205
+
206
+ </details>
207
+
208
+ # Uses
209
+
210
+ ## Direct Use and Downstream Use
211
+
212
+ <!-- The authors write in [the original paper's model card](https://arxiv.org/pdf/2210.11416.pdf) that: -->
213
+
214
+ > The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as dialogue response generation, reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models
215
+
216
+
217
+ ## Out-of-Scope Use
218
+
219
+ More information needed.
220
+
221
+ # Bias, Risks, and Limitations
222
+
223
+ The information below in this section are copied and modified from Flan-T5's models card:
224
+
225
+ > Language models, including DialogStudio-T5, can potentially be used for language generation in a harmful way, according to Rae et al. (2021). DialogStudio-T5 should not be used directly in any application, without a prior assessment of safety and fairness concerns specific to the application.
226
+
227
+ ## Ethical considerations and risks
228
+
229
+ > DialogStudio-T5 is fine-tuned on a large corpus of text data that was not filtered for explicit content or assessed for existing biases. As a result the model itself is potentially vulnerable to generating equivalently inappropriate content or replicating inherent biases in the underlying data.
230
+
231
+ ## Known Limitations
232
+
233
+ > DialogStudio-T5 has not been tested in real world applications.
234
+
235
+ ## Sensitive Use:
236
+
237
+ > DialogStudio-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.
238
+
239
+ # Training Details
240
+
241
+ ## Training Data
242
+
243
+ We sample a small amount of dialogues from each commercial supported dataset under three categories of [DialogStudio](https://huggingface.co/datasets/Salesforce/dialogstudio), i.e., KG-Dial, TOD and Open-Domain dialogues. Additionally, we sample at most 150 examples for each non-translation task from [FLAN](https://github.com/google-research/FLAN/tree/main/flan/v2).
244
+
245
+ **Note:**
246
+
247
+ Model Version 1.0 is built on small-scale pre-trained models, this version does not incorporate datasets utilized for training large-scale models (>=7B) like Alpaca, ShareGPT, GPT4ALL, UltraChat from OpenAI's 'GPT-3.5/4', or other datasets such as OASST1 and WizardCoder. As a result, it has certain limitations in terms of writing and creative capabilities. Our initial focus is to update the model versions to enhance existing abilities. Further improvements, including expansion of other capabilities, are part of our roadmap and will be responsive to community requests.
248
+
249
+
250
+ See above **Training formats:** for details of the training formats.
251
+
252
+ ## Training Procedure
253
+
254
+
255
+ > These models are based on Flan-T5 and are fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned DialogStudio model per T5 model size.
256
+
257
+ The model has been trained on 16 A100 GPUs, each with 40G memory, using public [transformer](https://github.com/huggingface/transformers) codebase.
258
+
259
+
260
+ # Evaluation
261
+
262
+ ## Testing Data, Factors & Metrics
263
+
264
+ The authors evaluated the model on several dialogue tasks and general tasks such as 0-shot/5-shot MMLU and 3-shot BBH.
265
+
266
+ ## Results
267
+
268
+ For full results for DialogStudio, see the [research paper](https://arxiv.org/abs/2307.10172).
269
+
270
+ ## Environmental Impact
271
+ More information needed.
272
+
273
+ # Citation
274
+
275
+ **BibTeX:**
276
+
277
+ ```bibtex
278
+ @misc{zhang2023dialogstudio,
279
+ title={DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI},
280
+ author={Jianguo Zhang and Kun Qian and Zhiwei Liu and Shelby Heinecke and Rui Meng and Ye Liu and Zhou Yu and and Huan Wang and Silvio Savarese and Caiming Xiong},
281
+ year={2023},
282
+ eprint={2307.10172},
283
+ archivePrefix={arXiv},
284
+ primaryClass={cs.CL}
285
+ }
286
+ ```
config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": false,
4
+ "bos_token": "<pad>",
5
+ "decoder_start_token": "<pad>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": null,
8
+ "unk_token": "<unk>"
9
+ }
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:806bf6111192fc182579a57f010f8005a96b3bfc35797683fc55637a2d731dd5
3
+ size 786248651
shared_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<USER>",
4
+ "<SYSTEM>",
5
+ "<EXTERNAL KNOWLEDGE>"
6
+ ],
7
+ "eos_token": "</s>",
8
+ "pad_token": "<pad>",
9
+ "unk_token": "<unk>"
10
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<extra_id_0>",
4
+ "<extra_id_1>",
5
+ "<extra_id_2>",
6
+ "<extra_id_3>",
7
+ "<extra_id_4>",
8
+ "<extra_id_5>",
9
+ "<extra_id_6>",
10
+ "<extra_id_7>",
11
+ "<extra_id_8>",
12
+ "<extra_id_9>",
13
+ "<extra_id_10>",
14
+ "<extra_id_11>",
15
+ "<extra_id_12>",
16
+ "<extra_id_13>",
17
+ "<extra_id_14>",
18
+ "<extra_id_15>",
19
+ "<extra_id_16>",
20
+ "<extra_id_17>",
21
+ "<extra_id_18>",
22
+ "<extra_id_19>",
23
+ "<extra_id_20>",
24
+ "<extra_id_21>",
25
+ "<extra_id_22>",
26
+ "<extra_id_23>",
27
+ "<extra_id_24>",
28
+ "<extra_id_25>",
29
+ "<extra_id_26>",
30
+ "<extra_id_27>",
31
+ "<extra_id_28>",
32
+ "<extra_id_29>",
33
+ "<extra_id_30>",
34
+ "<extra_id_31>",
35
+ "<extra_id_32>",
36
+ "<extra_id_33>",
37
+ "<extra_id_34>",
38
+ "<extra_id_35>",
39
+ "<extra_id_36>",
40
+ "<extra_id_37>",
41
+ "<extra_id_38>",
42
+ "<extra_id_39>",
43
+ "<extra_id_40>",
44
+ "<extra_id_41>",
45
+ "<extra_id_42>",
46
+ "<extra_id_43>",
47
+ "<extra_id_44>",
48
+ "<extra_id_45>",
49
+ "<extra_id_46>",
50
+ "<extra_id_47>",
51
+ "<extra_id_48>",
52
+ "<extra_id_49>",
53
+ "<extra_id_50>",
54
+ "<extra_id_51>",
55
+ "<extra_id_52>",
56
+ "<extra_id_53>",
57
+ "<extra_id_54>",
58
+ "<extra_id_55>",
59
+ "<extra_id_56>",
60
+ "<extra_id_57>",
61
+ "<extra_id_58>",
62
+ "<extra_id_59>",
63
+ "<extra_id_60>",
64
+ "<extra_id_61>",
65
+ "<extra_id_62>",
66
+ "<extra_id_63>",
67
+ "<extra_id_64>",
68
+ "<extra_id_65>",
69
+ "<extra_id_66>",
70
+ "<extra_id_67>",
71
+ "<extra_id_68>",
72
+ "<extra_id_69>",
73
+ "<extra_id_70>",
74
+ "<extra_id_71>",
75
+ "<extra_id_72>",
76
+ "<extra_id_73>",
77
+ "<extra_id_74>",
78
+ "<extra_id_75>",
79
+ "<extra_id_76>",
80
+ "<extra_id_77>",
81
+ "<extra_id_78>",
82
+ "<extra_id_79>",
83
+ "<extra_id_80>",
84
+ "<extra_id_81>",
85
+ "<extra_id_82>",
86
+ "<extra_id_83>",
87
+ "<extra_id_84>",
88
+ "<extra_id_85>",
89
+ "<extra_id_86>",
90
+ "<extra_id_87>",
91
+ "<extra_id_88>",
92
+ "<extra_id_89>",
93
+ "<extra_id_90>",
94
+ "<extra_id_91>",
95
+ "<extra_id_92>",
96
+ "<extra_id_93>",
97
+ "<extra_id_94>",
98
+ "<extra_id_95>",
99
+ "<extra_id_96>",
100
+ "<extra_id_97>",
101
+ "<extra_id_98>",
102
+ "<extra_id_99>"
103
+ ],
104
+ "clean_up_tokenization_spaces": true,
105
+ "eos_token": "</s>",
106
+ "extra_ids": 100,
107
+ "model_max_length": 512,
108
+ "pad_token": "<pad>",
109
+ "sp_model_kwargs": {},
110
+ "tokenizer_class": "T5Tokenizer",
111
+ "unk_token": "<unk>"
112
+ }