ghh001 commited on
Commit
7470d35
1 Parent(s): 0bea106

Rename README_CN.md to README_EN.md

Browse files
Files changed (2) hide show
  1. README_CN.md +0 -145
  2. README_EN.md +308 -0
README_CN.md DELETED
@@ -1,145 +0,0 @@
1
- - [1.与 knowlm-13b-zhixi 的区别](#1与-knowlm-13b-zhixi-的区别)
2
- - [2.信息抽取模板](#2信息抽取模板)
3
- - [3.常见的关系类型](#3常见的关系类型)
4
- - [4.转换脚本](#4转换脚本)
5
- - [5.现成数据集](#5现成数据集)
6
- - [6.使用](#6使用)
7
- - [7.评估](#7评估)
8
-
9
-
10
- # 1.与 knowlm-13b-zhixi 的区别
11
-
12
- 与 zjunlp/knowlm-13b-zhixi 相比,zjunlp/knowlm-13b-ie 在信息抽取方面表现出略强的实用性,但其一般适用性下降。
13
-
14
- zjunlp/knowlm-13b-ie 从中英文信息抽取数据集中采样约 10% 的数据,然后进行负采样。例如,如果数据集 A 包含标签 [a,b,c,d,e,f],我们首先从 A 中采样出 10% 的数据。对于给定的样本 s,它可能只包含标签 a 和 b。我们随机地添加原本没有的关系,比如来自指定关系候选列表的 c 和 d。当遇到这些额外的关系时,模型可能会输出类似 'NAN' 的文本。这种方法使模型在一定程度上具备生成 'NAN' 输出的能力,增强了其信息抽取能力,但削弱了其泛化能力。
15
-
16
-
17
-
18
- # 2.信息抽取模板
19
- 关系抽取(RE)支持以下模板:
20
-
21
- ```python
22
- relation_template_zh = {
23
- 0:'已知候选的关系列表:{s_schema},请你根据关系列表,从以下输入中抽取出可能存在的头实体与尾实体,并给出对应的关系三元组。请按照{s_format}的格式回答。',
24
- 1:'我将给你个输入,请根据关系列表:{s_schema},从输入中抽取出可能包含的关系三元组,并以{s_format}的形式回答。',
25
- 2:'我希望你根据关系列表从给定的输入中抽取可能的关系三元组,并以{s_format}的格式回答,关系列表={s_schema}。',
26
- 3:'给定的关系列表是{s_schema}\n根据关系列表抽取关系三元组,在这个句子中可能包含哪些关系三元组?请以{s_format}的格式回答。',
27
- }
28
-
29
- relation_int_out_format_zh = {
30
- 0:['"(头实体,关系,尾实体)"', relation_convert_target0],
31
- 1:['"头实体是\n关系是\n尾实体是\n\n"', relation_convert_target1],
32
- 2:['"关系:头实体,尾实体\n"', relation_convert_target2],
33
- 3:["JSON字符串[{'head':'', 'relation':'', 'tail':''}, ]", relation_convert_target3],
34
- }
35
-
36
- relation_template_en = {
37
- 0:'Identify the head entities (subjects) and tail entities (objects) in the following text and provide the corresponding relation triples from relation list {s_schema}. Please provide your answer as a list of relation triples in the form of {s_format}.',
38
- 1:'From the given text, extract the possible head entities (subjects) and tail entities (objects) and give the corresponding relation triples. The relations are {s_schema}. Please format your answer as a list of relation triples in the form of {s_format}.',
39
- }
40
-
41
- relation_int_out_format_en = {
42
- 0:['(Subject, Relation, Object)', relation_convert_target0_en],
43
- 1:["{'head':'', 'relation':'', 'tail':''}", relation_convert_target1_en],
44
- }
45
-
46
- ```
47
-
48
-
49
- 这些模板中的schema({s_schema})和输出格式 ({s_format})占位符被嵌入在模板中,用户必须指定。
50
- 有关模板的更全面理解,请参阅文件 [ner_template.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/ner_template.py)、[re_template.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/re_template.py)、[ee_template.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/ee_template.py) .
51
-
52
-
53
-
54
- # 3.常见的关系类型
55
-
56
- ```python
57
- {
58
- '组织': ['别名', '位于', '类型', '成立时间', '解散时间', '成员', '创始人', '事件', '子组织', '产品', '成就', '运营'],
59
- '医学': ['别名', '病因', '症状', '可能后果', '包含', '发病部位'],
60
- '事件': ['别名', '类型', '发生时间', '发生地点', '参与者', '主办方', '提名者', '获奖者', '赞助者', '获奖作品', '获胜者', '奖项'],
61
- '运输': ['别名', '位于', '类型', '属于', '途径', '开通时间', '创建时间', '车站等级', '长度', '面积'],
62
- '人造物件': ['别名', '类型', '受众', '成就', '品牌', '产地', '长度', '宽度', '高度', '重量', '价值', '制造商', '型号', '生产时间', '材料', '用途', '发现者或发明者'],
63
- '生物': ['别名', '学名', '类型', '分布', '父级分类单元', '主要食物来源', '用途', '长度', '宽度', '高度', '重量', '特征'],
64
- '建筑': ['别名', '类型', '位于', '临近', '名称由来', '长度', '宽度', '高度', '面积', '创建时间', '创建者', '成就', '事件'],
65
- '自然科学': ['别名', '类型', '性质', '生成物', '用途', '组成', '产地', '发现者或发明者'],
66
- '地理地区': ['别名', '类型', '所在行政领土', '接壤', '事件', '面积', '人口', '行政中心', '产业', '气候'],
67
- '作品': ['别名', '类型', '受众', '产地', '成就', '导演', '编剧', '演员', '平台', '制作者', '改编自', '包含', '票房', '角色', '作曲者', '作词者', '表演者', '出版时间', '出版商', '作者'],
68
- '人物': ['别名', '籍贯', '国籍', '民族', '朝代', '出生时间', '出生地点', '死亡时间', '死亡地点', '专业', '学历', '作品', '职业', '职务', '成就', '所属组织', '父母', '配偶', '兄弟姊妹', '亲属', '同事', '参与'],
69
- '天文对象': ['别名', '类型', '坐标', '发现者', '发现时间', '名称由来', '属于', '直径', '质量', '公转周期', '绝对星等', '临近']
70
- }
71
- ```
72
-
73
- 此处 [schema](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/schema.py) 提供了12种文本主题, 以及该主题下常见的关系类型。
74
-
75
- # 4.转换脚本
76
-
77
- 提供一个名为 [convert.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert.py)、[convert_test.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert_test.py) 的脚本,用于将数据统一转换为可以直接输入 KnowLM 的指令。在执行 convert.py 之前,请参考 [data](https://github.com/zjunlp/DeepKE/tree/main/example/llm/InstructKGC/data) 目录中包含了每个任务的预期数据格式。
78
-
79
- ```bash
80
- python kg2instruction/convert.py \
81
- --src_path data/NER/sample.json \
82
- --tgt_path data/NER/processed.json \
83
- --schema_path data/NER/schema.json \
84
- --language zh \ # 不同语言使用的template及转换脚本不同
85
- --task NER \ # ['RE', 'NER', 'EE']三种任务
86
- --sample 0 \ # 若为-1, 则从4种指令和4种输出格式中随机采样其中一种, 否则即为指定的指令格式, -1<=sample<=3
87
- --all # 是否将指令中指定的抽取类型列表设置为全部schema
88
- ```
89
-
90
- [convert_test.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert_test.py) 不要求数据具有标签(`entity`、`relation`、`event`)字段, 只需要具有 `input` 字段, 以及提供 `schema_path`, 适合用来处理测试数据。
91
-
92
- ```bash
93
- python kg2instruction/convert_test.py \
94
- --src_path data/NER/sample.json \
95
- --tgt_path data/NER/processed.json \
96
- --schema_path data/NER/schema.json \
97
- --language zh \
98
- --task NER \
99
- --sample 0
100
- ```
101
-
102
-
103
- # 5.现成数据集
104
-
105
- 下面是一些现成的处理后的数据:
106
-
107
- | 名称 | 下载 | 数量 | 描述 |
108
- | ------------------- | ---------------------------------------------------------------------------------------------------------------------- | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
109
- | KnowLM-IE.json | [Google drive](https://drive.google.com/file/d/1hY_R6aFgW4Ga7zo41VpOVOShbTgBqBbL/view?usp=sharing) <br/> [HuggingFace](https://huggingface.co/datasets/zjunlp/KnowLM-IE) | 281860 | [InstructIE](https://arxiv.org/abs/2305.11527) 中提到的数据集 |
110
- | KnowLM-ke | [HuggingFace](https://huggingface.co/datasets/zjunlp/knowlm-ke) | XXXX | 训练[zjunlp/knowlm-13b-zhixi](https://huggingface.co/zjunlp/knowlm-13b-zhixi)所用到的所有指令数据(通用、IE、Code、COT等) |
111
-
112
-
113
- `KnowLM-IE.json`:包含 `'id'`(唯一标识符)、`'cate'`(文本主题)、`'instruction'`(抽取指令)、`'input'`(输入文本)、`'output'`(输出文本)字段、`'relation'`(三元组)字段,可以通过`'relation'`自由构建抽取的指令和输出,`'instruction'`有16种格式(4种prompt * 4种输出格式),`'output'`是按照`'instruction'`中指定的输出格式生成的文本。
114
-
115
-
116
- `KnowLM-ke`:仅包含`'instruction'`、`'input'`、`'output'`字段。其目录下的`ee-en.json`、`ee_train.json`、`ner-en.json`、`ner_train.json`、`re-en.json`、`re_train.json`为中英文IE指令数据。
117
-
118
-
119
-
120
- # 6.使用
121
- 我们提供了可直接使用 `zjunlp/knowlm-13b-ie` 模型进行推理的脚本[inference.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/src/inference.py), 请参考 [README.md](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/README.md) 配置环境等。
122
-
123
- ```bash
124
- CUDA_VISIBLE_DEVICES="0" python src/inference.py \
125
- --model_name_or_path 'models/knowlm-13b-ie' \
126
- --model_name 'llama' \
127
- --input_file 'data/NER/processed.json' \
128
- --output_file 'results/ner_test.json' \
129
- --fp16
130
- ```
131
-
132
- 如果GPU显存不足够, 可以采用 `--bits 8` 或 `--bits 4`
133
-
134
-
135
- # 7.评估
136
- 我们提供一个位于 [evaluate.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/evaluate.py) 的脚本,用于将��型的字符串输出转换为列表并计算 F1 分数。
137
-
138
- ```bash
139
- python kg2instruction/evaluate.py \
140
- --standard_path data/NER/processed.json \
141
- --submit_path data/NER/processed.json \
142
- --task ner \
143
- --language zh
144
- ```
145
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README_EN.md ADDED
@@ -0,0 +1,308 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ----
4
+
5
+
6
+ - [1.Differences from knowlm-13b-zhixi](#1differences-from-knowlm-13b-zhixi)
7
+ - [2. Information Extraction Template](#2-information-extraction-template)
8
+ - [3.Common relationship types](#3common-relationship-types)
9
+ - [4.Datasets](#4datasets)
10
+ - [5.Convert script](#5convert-script)
11
+ - [6.Usage](#6usage)
12
+ - [7.Evaluate](#7evaluate)
13
+
14
+
15
+
16
+ # 1.Differences from knowlm-13b-zhixi
17
+ Compared to zjunlp/knowlm-13b-zhixi, zjunlp/knowlm-13b-ie exhibits slightly stronger practicality in information extraction but with a decrease in its general applicability.
18
+
19
+ zjunlp/knowlm-13b-ie samples around 10% of the data from Chinese-English information extraction datasets, which then undergo negative sampling. For instance, if dataset A contains labels [a, b, c, d, e, f], we first sample 10% of the data from A. For a given sample 's', it might only contain labels a and b. We randomly add relationships that it doesn't originally have, such as c and d, from the specified list of relation candidates. When encountering these additional relationships, the model might output text similar to 'NAN'.This method equips the model with the ability to generate 'NAN' outputs to a certain extent, enhancing its information extraction capability while weakening its generalization ability.
20
+
21
+
22
+
23
+
24
+ # 2. Information Extraction Template
25
+ The template `template` is used to construct the instruction `instruction` for input to the model. It consists of three parts:
26
+ 1. Task description
27
+ 2. List of candidate labels {s_schema} (optional)
28
+ 3. Structural output format {s_format}
29
+
30
+
31
+ Template with specified list of candidate labels:
32
+ ```json
33
+ NER: "You are an expert specialized in entity extraction. With the candidate entity types list: {s_schema}, please extract possible entities from the input below, outputting NAN if a certain entity does not exist. Respond in the format {s_format}."
34
+ RE: "You are an expert in extracting relation triples. With the candidate relation list: {s_schema}, please extract the possible head entities and tail entities from the input below and provide the corresponding relation triples. If a relation does not exist, output NAN. Please answer in the {s_format} format."
35
+ EE: "You are a specialist in event extraction. Given the candidate event dictionary: {s_schema}, please extract any possible events from the input below. If an event does not exist, output NAN. Please answer in the format of {s_format}."
36
+ EET: "As an event analysis specialist, you need to review the input and determine possible events based on the event type directory: {s_schema}. All answers should be based on the {s_format} format. If the event type does not match, please mark with NAN."
37
+ EEA: "You are an expert in event argument extraction. Given the event dictionary: {s_schema1}, and the event type and trigger words: {s_schema2}, please extract possible arguments from the following input. If an event argument does not exist, output NAN. Please respond in the {s_format} format."
38
+ ```
39
+
40
+
41
+ Template without specifying a list of candidate labels:
42
+ ```json
43
+ NER: "Analyze the text content and extract the clear entities. Present your findings in the {s_format} format, skipping any ambiguous or uncertain parts."
44
+ RE: "Please extract all the relation triples from the text and present the results in the format of {s_format}. Ignore those entities that do not conform to the standard relation template."
45
+ EE: "Please analyze the following text, extract all identifiable events, and present them in the specified format {s_format}. If certain information does not constitute an event, simply skip it."
46
+ EET: "Examine the following text content and extract any events you deem significant. Provide your findings in the {s_format} format."
47
+ EEA: "Please extract possible arguments based on the event type and trigger word {s_schema2} from the input below. Answer in the format of {s_format}."
48
+ ```
49
+
50
+ <details>
51
+ <summary><b>Candidate Labels {s_schema}</b></summary>
52
+
53
+
54
+ ```json
55
+ NER(Ontonotes): ["date", "organization", "person", "geographical social political", "national religious political", "facility", "cardinal", "location", "work of art", ...]
56
+ RE(NYT): ["ethnicity", "place lived", "geographic distribution", "company industry", "country of administrative divisions", "administrative division of country", ...]
57
+ EE(ACE2005): {"declare bankruptcy": ["organization"], "transfer ownership": ["artifact", "place", "seller", "buyer", "beneficiary"], "marry": ["person", "place"], ...}
58
+ EET(GENIA): ["cell type", "cell line", "protein", "RNA", "DNA"]
59
+ EEA(ACE2005): {"declare bankruptcy": ["organization"], "transfer ownership": ["artifact", "place", "seller", "buyer", "beneficiary"], "marry": ["person", "place"], ...}
60
+ ```
61
+ </details>
62
+
63
+
64
+
65
+ <details>
66
+ <summary><b>Structural Output Format {s_format}</b></summary>
67
+
68
+
69
+ ```json
70
+ NER: (Entity,Entity Type)
71
+ RE: (Subject,Relation,Object)
72
+ EE: (Event Trigger,Event Type,Argument1#Argument Role1;Argument2#Argument Role2)
73
+ EET: (Event Trigger,Event Type)
74
+ EEA: (Event Trigger,Event Type,Argument1#Argument Role1;Argument2#Argument Role2)
75
+ ```
76
+
77
+ </details>
78
+
79
+
80
+
81
+ For a more comprehensive understanding of the templates, please refer to the files [ner_converter.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert/converter/ner_converter.py)、[re_converter.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert/converter/re_converter.py)、[ee_converter.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert/converter/ee_converter.py)、[eet_converter.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert/converter/eet_converter.py)、[eea_converter.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert/converter/eea_converter.py) and [configs](https://github.com/zjunlp/DeepKE/tree/main/example/llm/InstructKGC/configs).
82
+
83
+
84
+
85
+ # 3.Common relationship types
86
+
87
+
88
+ ```python
89
+ wiki_cate_schema_en = {
90
+ 'Person': ['place of birth', 'date of birth', 'country of citizenship', 'occupation', 'work', 'achievement', 'ancestral home', 'position', 'spouse', 'parent', 'alternative name', 'affiliated organization', 'date of death', 'sibling', 'place of death'],
91
+ 'Geographic_Location': ['located in', 'alternative name', 'population', 'capital', 'area', 'achievement', 'length', 'width', 'elevation'],
92
+ 'Building': ['located in', 'alternative name', 'achievement', 'event', 'creation time', 'width', 'length', 'creator', 'height', 'area', 'named after'],
93
+ 'Works': ['author', 'publication date', 'alternative name', 'country of origin', 'based on', 'cast member', 'publisher', 'achievement', 'performer', 'director', 'producer', 'screenwriter', 'tracklist', 'composer', 'lyricist', 'production company', 'box office', 'publishing platform'],
94
+ 'Creature': ['distribution', 'parent taxon', 'length', 'main food source', 'alternative name', 'taxon name', 'weight', 'width', 'height'],
95
+ 'Artificial_Object': ['alternative name', 'brand', 'production date', 'made from material', 'country of origin', 'has use', 'manufacturer', 'discoverer or inventor'],
96
+ 'Natural_Science': ['alternative name', 'properties', 'composition', 'product', 'has use', 'country of origin', 'discoverer or inventor', 'causes'],
97
+ 'Organization': ['located in', 'alternative name', 'has subsidiary', 'date of incorporation', 'product', 'achievement', 'member', 'founded by', 'dissolution time', 'event'],
98
+ 'Transport': ['located in', 'inception', 'connecting line', 'date of official opening', 'pass', 'area', 'alternative name', 'length', 'width', 'achievement', 'class of station'],
99
+ 'Event': ['participant', 'scene', 'occurrence time', 'alternative name', 'sponsor', 'casualties', 'has cause', 'has effect', 'organizer', 'award received', 'winner'],
100
+ 'Astronomy': ['alternative name', 'of', 'time of discovery or invention', 'discoverer or inventor', 'name after', 'absolute magnitude', 'diameter', 'mass'],
101
+ 'Medicine': ['symptoms', 'alternative name', 'affected body part', 'possible consequences', 'etiology']
102
+ }
103
+ ```
104
+
105
+ Here [schema](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/schema.py) provides 12 text topics and common relationship types under the topic.
106
+
107
+
108
+
109
+
110
+ # 4.Datasets
111
+
112
+ | Name | Download | Quantity | Description |
113
+ | ---------------------- | ------------------------------------------------------------ | -------- | ------------------------------------------------------------ |
114
+ | InstructIE-train | [Google drive](https://drive.google.com/file/d/1VX5buWC9qVeVuudh_mhc_nC7IPPpGchQ/view?usp=drive_link) <br/> [HuggingFace](https://huggingface.co/datasets/zjunlp/KnowLM-IE) <br/> [Baidu Netdisk](https://pan.baidu.com/s/1xXVrjkinw4cyKKFBR8BwQw?pwd=x4s7) | 30w+ | InstructIE train set, which is constructed by weak supervision and may contain some noisy data |
115
+ | InstructIE-valid | [Google drive](https://drive.google.com/file/d/1EMvqYnnniKCGEYMLoENE1VD6DrcQ1Hhj/view?usp=drive_link) <br/> [HuggingFace](https://huggingface.co/datasets/zjunlp/KnowLM-IE) <br/> [Baidu Netdisk](https://pan.baidu.com/s/11u_f_JT30W6B5xmUPC3enw?pwd=71ie) | 2000+ | InstructIE validation set |
116
+ | InstructIE-test | [Google drive](https://drive.google.com/file/d/1WdG6_ouS-dBjWUXLuROx03hP-1_QY5n4/view?usp=drive_link) <br/> [HuggingFace](https://huggingface.co/datasets/zjunlp/KnowLM-IE) <br/> [Baidu Netdisk](https://pan.baidu.com/s/1JiRiOoyBVOold58zY482TA?pwd=cyr9) | 2000+ | InstructIE test set |
117
+ | train.json, valid.json | [Google drive](https://drive.google.com/file/d/1vfD4xgToVbCrFP2q-SD7iuRT2KWubIv9/view?usp=sharing) | 5,000 | Preliminary training set and test set for the task "Instruction-Driven Adaptive Knowledge Graph Construction" in [CCKS2023 Open Knowledge Graph Challenge](https://tianchi.aliyun.com/competition/entrance/532080/introduction), randomly selected from instruct_train.json |
118
+
119
+
120
+ - `InstrumentIE-train` contains two files: `InstrumentIE-zh.json` and `InstrumentIE-en.json`, each of which contains the following fields: `'id'` (unique identifier), `'cate'` (text category), `'entity'` and `'relation'` (triples) fields. The extracted instructions and output can be freely constructed through `'entity'` and `'relation'`.
121
+ - `InstrumentIE-valid` and `InstrumentIE-test` are validation sets and test sets, respectively, including bilingual `zh` and `en`.
122
+ - `train.json`: Same fields as `KnowLM-IE.json`, `'instruction'` and `'output'` have only one format, and extraction instructions and outputs can also be freely constructed through `'relation'`.
123
+ - `valid.json`: Same fields as `train.json`, but with more accurate annotations achieved through crowdsour
124
+
125
+
126
+ <details>
127
+ <summary><b>Explanation of each field</b></summary>
128
+
129
+
130
+ | Field | Description |
131
+ | :---------: | :----------------------------------------------------------: |
132
+ | id | Unique identifier |
133
+ | cate | text topic of input (12 topics in total) |
134
+ | input | Model input text (need to extract all triples involved within) |
135
+ | instruction | Instruction for the model to perform the extraction task |
136
+ | output | Expected model output |
137
+ | entity | entities(entity, entity_type) |
138
+ | relation | Relation triples(head, relation, tail) involved in the input |
139
+
140
+
141
+ </details>
142
+
143
+
144
+ <details>
145
+ <summary><b>Example of data</b></summary>
146
+
147
+
148
+ ```json
149
+ {
150
+ "id": "6e4f87f7f92b1b9bd5cb3d2c3f2cbbc364caaed30940a1f8b7b48b04e64ec403",
151
+ "cate": "Person",
152
+ "input": "Dionisio Pérez Gutiérrez (born 1872 in Grazalema (Cádiz) - died 23 February 1935 in Madrid) was a Spanish writer, journalist, and gastronome. He has been called \"one of Spain's most authoritative food writers\" and was an early adopter of the term Hispanidad.\nHis pen name, \"Post-Thebussem\", was chosen as a show of support for Mariano Pardo de Figueroa, who went by the handle \"Dr. Thebussem\".",
153
+ "entity": [
154
+ {"entity": "Dionisio Pérez Gutiérrez", "entity_type": "human"},
155
+ {"entity": "Post-Thebussem", "entity_type": "human"},
156
+ {"entity": "Grazalema", "entity_type": "geographic_region"},
157
+ {"entity": "Cádiz", "entity_type": "geographic_region"},
158
+ {"entity": "Madrid", "entity_type": "geographic_region"},
159
+ {"entity": "gastronome", "entity_type": "event"},
160
+ {"entity": "Spain", "entity_type": "geographic_region"},
161
+ {"entity": "Hispanidad", "entity_type": "architectural_structure"},
162
+ {"entity": "Mariano Pardo de Figueroa", "entity_type": "human"},
163
+ {"entity": "23 February 1935", "entity_type": "time"}
164
+ ],
165
+ "relation": [
166
+ {"head": "Dionisio Pérez Gutiérrez", "relation": "country of citizenship", "tail": "Spain"},
167
+ {"head": "Dionisio Pérez Gutiérrez", "relation": "place of birth", "tail":"Grazalema"},
168
+ {"head": "Dionisio Pérez Gutiérrez", "relation": "place of death", "tail": "Madrid"},
169
+ {"head": "Mariano Pardo de Figueroa", "relation": "country of citizenship", "tail": "Spain"},
170
+ {"head": "Dionisio Pérez Gutiérrez", "relation": "alternative name", "tail": "Post-Thebussem"},
171
+ {"head": "Dionisio Pérez Gutiérrez", "relation": "date of death", "tail": "23 February 1935"}
172
+ ]
173
+ }
174
+ ```
175
+
176
+ </details>
177
+
178
+
179
+ # 5.Convert script
180
+
181
+ A script named [convert.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert.py) is provided to facilitate the uniform conversion of data into KnowLM instructions. The [data](https://github.com/zjunlp/DeepKE/tree/main/example/llm/InstructKGC/data) directory contains the expected data format for each task before executing convert.py.
182
+
183
+
184
+
185
+ ```bash
186
+ python kg2instruction/convert.py \
187
+ --src_path data/NER/sample.json \
188
+ --tgt_path data/NER/processed.json \
189
+ --schema_path data/NER/schema.json \
190
+ --language zh \ # Specifies the language for the conversion script and template, options are ['zh', 'en']
191
+ --task NER \ # Specifies the task type: one of ['RE', 'NER', 'EE', 'EET', 'EEA']
192
+ --sample -1 \ # If -1, randomly samples one of 20 instruction and 4 output formats; if a specific number, uses the corresponding instruction format, range is -1<=sample<20
193
+ --neg_ratio 1 \ # Sets the negative sampling ratio for all samples
194
+ --neg_schema 1 \ # Sets the negative sampling ratio from the schema
195
+ --random_sort # Determines whether to randomly sort the list of schemas in the instruction
196
+ ```
197
+
198
+
199
+ The `schema_path` specifies the path to a schema file (a JSON file). The schema file consists of three lines of JSON strings, organized in a fixed format. Taking Named Entity Recognition (NER) as an example, the meanings of each line are as follows:
200
+
201
+ ```
202
+ ["BookTitle", "Address", "Movie", ...] # List of entity types
203
+ [] # Empty list
204
+ {} # Empty dictionary
205
+ ```
206
+
207
+
208
+ <details>
209
+ <summary><b>More</b></summary>
210
+
211
+
212
+
213
+ ```
214
+ For Relation Extraction (RE) tasks:
215
+ [] # Empty list
216
+ ["Founder", "Number", "RegisteredCapital", ...] # List of relation types
217
+ {} # Empty dictionary
218
+
219
+
220
+ For Event Extraction (EE) tasks:
221
+ ["Social Interaction-Thanks", "Organizational Action-OpeningCeremony", "Competition Action-Withdrawal", ...] # List of event types
222
+ ["DismissingParty", "TerminatingParty", "Reporter", "ArrestedPerson"] # List of argument roles
223
+ {"OrganizationalRelation-Layoff": ["LayoffParty", "NumberLaidOff", "Time"], "LegalAction-Sue": ["Plaintiff", "Defendant", "Time"], ...} # Dictionary of event types
224
+
225
+
226
+ For EET tasks:
227
+ ["Social Interaction-Thanks", "Organizational Action-OpeningCeremony", "Competition Action-Withdrawal", ...] # List of event types
228
+ [] # Empty list
229
+ {} # Empty dictionary
230
+
231
+
232
+ For Event Extraction with Arguments (EEA) tasks:
233
+ ["Social Interaction-Thanks", "Organizational Action-OpeningCeremony", "Competition Action-Withdrawal", ...] # List of event types
234
+ ["DismissingParty", "TerminatingParty", "Reporter", "ArrestedPerson"] # List of argument roles
235
+ {"OrganizationalRelation-Layoff": ["LayoffParty", "NumberLaidOff", "Time"], "LegalAction-Sue": ["Plaintiff", "Defendant", "Time"], ...} # Dictionary of event types
236
+ ```
237
+
238
+ </details>
239
+
240
+
241
+
242
+ [convert_test.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/convert_test.py) does not require data to have label (`entity`, `relation`, `event`) fields, only needs to have an `input` field and provide a `schema_path` is suitable for processing test data.
243
+
244
+ ```bash
245
+ python kg2instruction/convert_test.py \
246
+ --src_path data/NER/sample.json \
247
+ --tgt_path data/NER/processed.json \
248
+ --schema_path data/NER/schema.json \
249
+ --language zh \
250
+ --task NER \
251
+ --sample 0
252
+ ```
253
+
254
+
255
+ Here is an example of data conversion for Named Entity Recognition (NER) task:
256
+
257
+ ```json
258
+ Before Transformation:
259
+ {
260
+ "input": "In contrast, the rain-soaked battle between Qingdao Sea Bulls and Guangzhou Songri Team, although also ended in a 0:0 draw, was uneventful.",
261
+ "entity": [{"entity": "Guangzhou Songri Team", "entity_type": "Organizational Structure"}, {"entity": "Qingdao Sea Bulls", "entity_type": "Organizational Structure"}]
262
+ }
263
+
264
+ After Transformation:
265
+ {
266
+ "id": "e88d2b42f8ca14af1b77474fcb18671ed3cacc0c75cf91f63375e966574bd187",
267
+ "instruction": "Please identify and list the entity types mentioned in the given text ['Organizational Structure', 'Person', 'Geographical Location']. If a type doesn't exist, please indicate it as NAN. Provide your answer in the format (entity, entity type).",
268
+ "input": "In contrast, the rain-soaked battle between Qingdao Sea Bulls and Guangzhou Songri Team, although also ended in a 0:0 draw, was uneventful.",
269
+ "output": "(Qingdao Sea Bulls,Organizational Structure)\n(Guangzhou Songri Team,Organizational Structure)\nNAN\nNAN"
270
+ }
271
+ ```
272
+
273
+ Before conversion, the data format needs to adhere to the structure specified in the `DeepKE/example/llm/InstructKGC/data` directory for each task (such as NER, RE, EE). Taking NER task as an example, the input text should be labeled as the `input` field, and the annotated data should be labeled as the `entity` field, which is a list of dictionaries containing multiple key-value pairs for `entity` and `entity_type`.
274
+
275
+ After data conversion, you will obtain structured data containing the `input` text, `instruction` (providing detailed instructions about candidate entity types ['Organization', 'Person', 'Location'] and the expected output format), and `output` (listing all entity information recognized in the `input` in the form of (entity, entity type)).
276
+
277
+
278
+
279
+
280
+ # 6.Usage
281
+ We provide a script, [inference.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/src/inference.py), for direct inference using the `zjunlp/knowlm-13b-ie model`. Please refer to the [README.md](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/README.md) for environment configuration and other details.
282
+
283
+ ```bash
284
+ CUDA_VISIBLE_DEVICES="0" python src/inference.py \
285
+ --model_name_or_path 'models/knowlm-13b-ie' \
286
+ --model_name 'llama' \
287
+ --input_file 'data/NER/processed.json' \
288
+ --output_file 'results/ner_test.json' \
289
+ --fp16
290
+ ```
291
+
292
+ If GPU memory is not enough, you can use `--bits 8` or `--bits 4`.
293
+
294
+
295
+
296
+ # 7.Evaluate
297
+
298
+ We provide a script at [evaluate.py](https://github.com/zjunlp/DeepKE/blob/main/example/llm/InstructKGC/kg2instruction/evaluate.py) to convert the string output of the model into a list and calculate F1
299
+
300
+ ```bash
301
+ python kg2instruction/evaluate.py \
302
+ --standard_path data/NER/processed.json \
303
+ --submit_path data/NER/processed.json \
304
+ --task NER \
305
+ --language zh
306
+ ```
307
+
308
+