Wanfq commited on
Commit
9784f66
1 Parent(s): 3684748

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +317 -1
README.md CHANGED
@@ -1,3 +1,319 @@
1
  ---
2
- license: apache-2.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - en
5
  ---
6
+ <p align="center" width="100%">
7
+ </p>
8
+
9
+ <div id="top" align="center">
10
+
11
+ **Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration**
12
+
13
+
14
+ <h4> |<a href="https://arxiv.org/abs/2305.xxxxx"> 📑 Paper </a> |
15
+ <a href="https://huggingface.co/datasets?sort=trending&search=Explore_Instruct"> 🤗 Data </a> |
16
+ <a href="https://huggingface.co/models?sort=trending&search=Explore-LM"> 🤗 Model </a> |
17
+ <a href="https://github.com/fanqiwan/Explore-Instruct"> 🐱 Github Repo </a> |
18
+ </h4>
19
+
20
+ <!-- **Authors:** -->
21
+
22
+ _**Fanqi Wan<sup>†</sup>, Xinting Huang<sup>‡</sup>, Tao Yang<sup>†</sup>, Xiaojun Quan<sup>†</sup>, Wei Bi<sup>‡</sup>, Shuming Shi<sup>‡</sup>**_
23
+
24
+
25
+ <!-- **Affiliations:** -->
26
+
27
+
28
+ _<sup>†</sup> Sun Yat-sen University,
29
+ <sup>‡</sup> Tencent AI Lab_
30
+
31
+ </div>
32
+
33
+
34
+ ## News
35
+ - **Oct 16, 2023:** 🔥 We're excited to announce that the Explore-Instruct datasets in brainstorming, rewriting, and math domains are now available on 🤗 [Huggingface Datasets](https://huggingface.co/datasets?sort=trending&search=Explore_Instruct)! Additionally, we've released Explore-LM models that have been initialized with LLaMA-7B and fine-tuned with the Explore-Instruct data in each domain. You can find these models on 🤗 [Huggingface Models](https://huggingface.co/models?sort=trending&search=Explore-LM). Happy exploring and instructing!
36
+
37
+ ## Contents
38
+
39
+ - [Overview](#overview)
40
+ - [Data Release](#data-release)
41
+ - [Model Release](#model-release)
42
+ - [Data Generation Process](#data-generation-process)
43
+ - [Fine-tuning](#fine-tuning)
44
+ - [Evaluation](#evaluation)
45
+ - [Limitations](#limitations)
46
+ - [License](#license)
47
+ - [Citation](#citation)
48
+ - [Acknowledgements](#acknowledgments)
49
+
50
+ ## Overview
51
+
52
+ We propose Explore-Instruct, a novel approach to enhancing domain-specific instruction coverage. We posit that the domain space is inherently structured akin to a tree, reminiscent of cognitive science ontologies. Drawing from the essence of classical search algorithms and incorporating the power of LLMs, Explore-Instruct is conceived to actively traverse the domain space and generate instruction-tuning data, **not** necessitating a predefined tree structure. Specifically, Explore-Instruct employs two strategic operations: lookahead and backtracking exploration:
53
+
54
+ - **Lookahead** delves into a multitude of potential fine-grained sub-tasks, thereby mapping out a complex network of tasks
55
+
56
+ - **Backtracking** seeks alternative branches to widen the search boundary, hence extending the domain spectrum.
57
+
58
+ <p align="center">
59
+ <img src="https://github.com/fanqiwan/Explore-Instruct/blob/main/assets/fig2.png?raw=true" width="95%"> <br>
60
+ </p>
61
+
62
+ ## Data Release
63
+
64
+ We release the Explore-Instruct data in brainstorming, rewriting, and math domains on 🤗 [Huggingface Datasets](https://huggingface.co/datasets?sort=trending&search=Explore_Instruct). Each domain includes two versions of datasets: the basic and extended version. The base version contains 10k instruction-tuning data and the extended version contains 16k, 32k, and 64k instruction-tuning data for each domain respectively. Each dataset is a structured data file in the JSON format. It consists of a list of dictionaries, with each dictionary containing the following fields:
65
+
66
+ - `instruction`: `str`, describes the task the model should perform.
67
+ - `input`: `str`, optional context or input for the task.
68
+ - `output`: `str`, ground-truth output text for the task and input text.
69
+
70
+ The results of data-centric analysis are shown as follows:
71
+
72
+ <p align="left">
73
+ <img src="https://github.com/fanqiwan/Explore-Instruct/blob/main/assets/fig1.png?raw=true" width="50%"> <br>
74
+ </p>
75
+
76
+ | Method | Brainstorming Unique<br/>V-N pairs | Rewriting Unique<br/>V-N pairs | Math Unique<br/>V-N pairs |
77
+ |:--------------------------------|:----------------------------------:|:------------------------------:|:-------------------------:|
78
+ | _Domain-Specific Human-Curated_ | 2 | 8 | 3 |
79
+ | _Domain-Aware Self-Instruct_ | 781 | 1715 | 451 |
80
+ | Explore-Instruct | **790** | **2015** | **917** |
81
+
82
+ ## Model Release
83
+
84
+ We release the Explore-LM models in brainstorming, rewriting, and math domains on 🤗 [Huggingface Models](https://huggingface.co/models?sort=trending&search=Explore-LM). Each domain includes two versions of models: the basic and extended version trained with the corresponding version of dataset.
85
+
86
+ The results of automatic and human evaluation in three domains are shown as follows:
87
+
88
+ - Automatic evaluation:
89
+
90
+ | Automatic Comparison in the Brainstorming Domain | Win:Tie:Lose | Beat Rate |
91
+ |:-------------------------------------------------|:------------:|:---------:|
92
+ | Explore-LM vs Domain-Curated-LM | 194:1:13 | 93.72 |
93
+ | Explore-LM-Ext vs Domain-Curated-LM | 196:1:11 | 94.69 |
94
+ | Explore-LM vs Domain-Instruct-LM | 114:56:38 | 75.00 |
95
+ | Explore-LM-Ext vs Domain-Instruct-LM | 122:55:31 | 79.74 |
96
+ | Explore-LM vs ChatGPT | 52:71:85 | 37.96 |
97
+ | Explore-LM-Ext vs ChatGPT | 83:69:56 | 59.71 |
98
+
99
+
100
+ | Automatic Comparison in the Rewriting Domain | Win:Tie:Lose | Beat Rate |
101
+ |:---------------------------------------------|:------------:|:---------:|
102
+ | Explore-LM vs Domain-Curated-LM | 50:38:6 | 89.29 |
103
+ | Explore-LM-Ext vs Domain-Curated-LM | 53:37:4 | 92.98 |
104
+ | Explore-LM vs Domain-Instruct-LM | 34:49:11 | 75.56 |
105
+ | Explore-LM-Ext vs Domain-Instruct-LM | 35:53:6 | 85.37 |
106
+ | Explore-LM vs ChatGPT | 11:59:24 | 31.43 |
107
+ | Explore-LM-Ext vs ChatGPT | 12:56:26 | 31.58 |
108
+
109
+
110
+ | Automatic Comparison in the Math Domain | Accuracy Rate |
111
+ |:----------------------------------------|:-------------:|
112
+ | Domain-Curated-LM | 3.4 |
113
+ | Domain-Instruct-LM | 4.0 |
114
+ | Explore-LM | 6.8 |
115
+ | Explore-LM-Ext | 8.4 |
116
+ | ChatGPT | 34.8 |
117
+
118
+ - Human evaluation:
119
+
120
+ <p align="left">
121
+ <img src="https://github.com/fanqiwan/Explore-Instruct/blob/main/assets/fig5.png?raw=true" width="95%"> <br>
122
+ </p>
123
+
124
+ ## Data Generation Process
125
+
126
+ To generate the domain-specific instruction-tuning data, please follow the following commands step by step:
127
+
128
+ ### Domain Space Exploration
129
+ ```
130
+ python3 generate_instruction.py \
131
+ --action extend \
132
+ --save_dir ./en_data/demo_domain \ # input dir include current domain tree for exploration
133
+ --out_dir ./en_data/demo_domain_exploration \ # output dir of the explored new domain tree
134
+ --lang <LANGUAGE> \ # currently support 'en'
135
+ --domain demo_domain \ # domain for exploration
136
+ --extend_nums <TASK_NUMBER_DEPTH_0>,...,<TASK_NUMBER_DEPTH_MAX_DEPTH-1> \ # exploration breadth at each depth
137
+ --max_depth <MAX_DEPTH> \ # exploration depth
138
+ --assistant_name <ASSISTANT_NAME> # currently support openai and claude
139
+ ```
140
+
141
+ ### Instruction-Tuning Data Generation
142
+ ```
143
+ python3 generate_instruction.py \
144
+ --action enrich \
145
+ --save_dir ./en_data/demo_domain_exploration \ # input dir include current domain tree for data generation
146
+ --out_dir ./en_data/demo_domain_generation \ # output dir of the domain tree with generated data
147
+ --lang <LANGUAGE> \ # currently support 'en'
148
+ --domain demo_domain \ # domain for exploration
149
+ --enrich_nums <DATA_NUMBER_DEPTH_0>,...,<DATA_NUMBER_DEPTH_MAX_DEPTH> \ # data number for task at each depth
150
+ --enrich_batch_size <BATCH_SIZE> \ # batch size for data generation
151
+ --assistant_name <ASSISTANT_NAME> # currently support openai and claude
152
+ ```
153
+
154
+ ### Task Pruning
155
+ ```
156
+ python3 generate_instruction.py \
157
+ --action prune \
158
+ --save_dir ./en_data/demo_domain_generation \ # input dir include current domain tree for task pruning
159
+ --out_dir ./en_data/demo_domain_pruning \ # output dir of the domain tree with 'pruned_subtasks_name.json' file
160
+ --lang <LANGUAGE> \ # currently support 'en'
161
+ --domain demo_domain \ # domain for exploration
162
+ --pruned_file ./en_data/demo_domain_pruning/pruned_subtasks_name.json \ # file of pruned tasks
163
+ --prune_threshold <PRUNE_THRESHOLD> \ # threshold of rouge-l overlap between task names
164
+ --assistant_name <ASSISTANT_NAME> # currently support openai and claude
165
+ ```
166
+
167
+ ### Data Filtering
168
+ ```
169
+ python3 generate_instruction.py \
170
+ --action filter \
171
+ --save_dir ./en_data/demo_domain_pruning \ # input dir include current domain tree for data filtering
172
+ --out_dir ./en_data/demo_domain_filtering \ # output dir of the domain tree with fitered data
173
+ --lang <LANGUAGE> \ # currently support 'en'
174
+ --domain demo_domain \ # domain for exploration
175
+ --pruned_file ./en_data/demo_domain_pruning/pruned_subtasks_name.json \ # file of pruned tasks
176
+ --filter_threshold <FILTER_THRESHOLD> \ # threshold of rouge-l overlap between instructions
177
+ --assistant_name <ASSISTANT_NAME> # currently support openai and claude
178
+ ```
179
+
180
+ ### Data Sampling
181
+ ```
182
+ python3 generate_instruction.py \
183
+ --action sample \
184
+ --save_dir ./en_data/demo_domain_filtering \ # input dir include current domain tree for data sampling
185
+ --out_dir ./en_data/demo_domain_sampling \ # output dir of the domain tree with sampled data
186
+ --lang <LANGUAGE> \ # currently support 'en'
187
+ --domain demo_domain \ # domain for exploration
188
+ --pruned_file ./en_data/demo_domain_filtering/pruned_subtasks_name.json \ # file of pruned tasks
189
+ --sample_example_num <SAMPLE_EXAMPLES_NUM> \ # number of sampled examples
190
+ --sample_max_depth <SAMPLE_MAX_DEPTH> \ # max depth for data sampling
191
+ --sample_use_pruned \ # do not sample from pruned tasks
192
+ --assistant_name <ASSISTANT_NAME> # currently support openai and claude
193
+ ```
194
+
195
+ ## Fine-tuning
196
+
197
+ We fine-tune LLaMA-7B with the following hyperparameters:
198
+
199
+ | Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
200
+ |:----------------|-------------------:|---------------:|--------:|------------:|--------------:|
201
+ | LLaMA 7B | 128 | 2e-5 | 3 | 2048 | 0 |
202
+
203
+ To reproduce the training procedure, please use the following command:
204
+
205
+ ```
206
+ deepspeed --num_gpus=8 ./train/train.py \
207
+ --deepspeed ./deepspeed_config/deepspeed_zero3_offload_config.json \
208
+ --model_name_or_path decapoda-research/llama-7b-hf \
209
+ --data_path ./en_data/demo_domain_sampling \
210
+ --fp16 True \
211
+ --output_dir ./training_results/explore-lm-7b-demo-domain \
212
+ --num_train_epochs 3 \
213
+ --per_device_train_batch_size 2 \
214
+ --per_device_eval_batch_size 2 \
215
+ --gradient_accumulation_steps 8 \
216
+ --evaluation_strategy "no" \
217
+ --model_max_length 512 \
218
+ --save_strategy "steps" \
219
+ --save_steps 2000 \
220
+ --save_total_limit 1 \
221
+ --learning_rate 2e-5 \
222
+ --weight_decay 0. \
223
+ --warmup_ratio 0.03 \
224
+ --lr_scheduler_type "cosine" \
225
+ --logging_steps 1 \
226
+ --prompt_type alpaca \
227
+ 2>&1 | tee ./training_logs/explore-lm-7b-demo-domain.log
228
+
229
+ python3 ./train/zero_to_fp32.py \
230
+ --checkpoint_dir ./training_results/explore-lm-7b-demo-domain \
231
+ --output_file ./training_results/explore-lm-7b-demo-domain/pytorch_model.bin
232
+ ```
233
+
234
+ ## Evaluation
235
+
236
+ The evaluation datasets for different domains are as follows:
237
+ - Brainstorming and Rewriting: From the corresponding categories in the translated test set of BELLE. ([en_eval_set.jsonl](./eval/question/en_eval_set.jsonl))
238
+ - Math: From randomly selected 500 questions from the test set of MATH. ([MATH_eval_set_sample.jsonl](./eval/question/MATH_eval_set_sample.jsonl))
239
+
240
+ The evaluation metrics for different domains are as follows:
241
+ - Brainstorming and Rewriting: Both automatic and human evaluations following Vicuna.
242
+ - Math: Accuracy Rate metric in solving math problems.
243
+
244
+ The automatic evaluation commands for different domains are as follows:
245
+
246
+ ```
247
+ # Brainstorming and Rewriting Domain
248
+
249
+ # 1. Inference
250
+ python3 ./eval/generate.py \
251
+ --model_id <MODEL_ID> \
252
+ --model_path <MODEL_PATH> \
253
+ --question_file ./eval/question/en_eval_set.jsonl \
254
+ --answer_file ./eval/answer/<MODEL_ID>.jsonl \
255
+ --num_gpus 8 \
256
+ --num_beams 1 \
257
+ --temperature 0.7 \
258
+ --max_new_tokens 512 \
259
+ --prompt_type alpaca \
260
+ --do_sample
261
+
262
+ 2. Evaluation
263
+ python3 ./eval/chatgpt_score.py \
264
+ --baseline_file ./eval/answer/<MODEL_1>.jsonl \ # answer of baseline model to compare with
265
+ --answer_file ./eval/answer/<MODEL_2>.jsonl \ # answer of evaluation model
266
+ --review_file ./eval/review/<MODEL_1>_cp_<MODEL_2>_<DOMAIN>.jsonl \ # review from chatgpt
267
+ --prompt_file ./eval/prompt/en_review_prompt_compare.jsonl \ # evaluation prompt for chatgpt
268
+ --target_classes <DOMAIN> \ # evaluation domain
269
+ --batch_size <BATCH_SIZE> \
270
+ --review_model "gpt-3.5-turbo-0301"
271
+ ```
272
+
273
+ ```
274
+ # Math Domain
275
+
276
+ # 1. Inference
277
+ python3 ./eval/generate.py \
278
+ --model_id <MODEL_ID> \
279
+ --model_path <MODEL_PATH> \
280
+ --question_file ./eval/question/MATH_eval_set_sample.jsonl \
281
+ --answer_file ./eval/answer/<MODEL_ID>.jsonl \
282
+ --num_gpus 8 \
283
+ --num_beams 10 \
284
+ --temperature 1.0 \
285
+ --max_new_tokens 512 \
286
+ --prompt_type alpaca
287
+
288
+ 2. Evaluation
289
+ python3 ./eval/auto_eval.py \
290
+ --question_file ./eval/question/MATH_eval_set_sample.jsonl \
291
+ --answer_file ./eval/answer/<MODEL_ID>.jsonl
292
+ ```
293
+
294
+ ## Limitations
295
+
296
+ Explore-Instruct is still under development and needs a lot of improvements. We acknowledge that our work focuses on the enhancement of domain-specific instruction coverage and does not address other aspects of instruction-tuning, such as the generation of complex and challenging instructions or the mitigation of toxic and harmful instructions. Future work is needed to explore the potential of our approach in these areas.
297
+
298
+ ## License
299
+
300
+ Explore-Instruct is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. The weights of Explore-LM models are also CC BY NC 4.0 (allowing only non-commercial use).
301
+
302
+ ## Citation
303
+
304
+ If you find this work is relevant with your research or applications, please feel free to cite our work!
305
+ ```
306
+ @misc{wan2023explore,
307
+ title={Explore-Instruct: Enhancing Domain-Specific Instruction Coverage through Active Exploration},
308
+ author={Fanqi, Wan and Xinting, Huang and Tao, Yang and Xiaojun, Quan and Wei, Bi and Shuming, Shi},
309
+ year={2023},
310
+ eprint={2305.xxxxx},
311
+ archivePrefix={arXiv},
312
+ primaryClass={cs.CL}
313
+ }
314
+ ```
315
+
316
+ ## Acknowledgments
317
+
318
+ This repo benefits from [Stanford-Alpaca](https://github.com/tatsu-lab/stanford_alpaca) and [Vicuna](https://github.com/lm-sys/FastChat). Thanks for their wonderful works!
319
+