Update README.md

7b2890d almost 2 years ago

5.74 kB

	---
	language:
	- en
	- zh
	tags:
	- GENIUS
	- conditional text generation
	- sketch-based text generation
	- data augmentation

	license: apache-2.0
	datasets:
	- c4
	- beyond/chinese_clean_passages_80m


	widget:
	- text: "[MASK]酸菜鱼火锅[MASK]很美味，味道绝了[MASK]周末真开心[MASK]"
	example_title: "草稿1"
	- text: "自然语言处理[MASK]谷歌公司[MASK]通用人工智能[MASK]"
	example_title: "草稿2"
	- text: "[MASK]疫情[MASK]公园[MASK]散步[MASK]"
	example_title: "草稿3"

	inference:
	parameters:
	max_length: 100
	num_beams: 3
	do_sample: True
	---

	# GENIUS: generating text using sketches!


	- Paper: [GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation](https://arxiv.org/abs/2211.10330)
	- GitHub: [GENIUS, Pre-training/Data Augmentation Tutorial](https://github.com/beyondguo/genius)



	GENIUS中文版可以根据你给出的一个草稿进行填词造句扩写，草稿可以是：
	- 关键词组合，例如“今天[MASK]篮球[MASK]学校[MASK]”
	- 短语组合，例如“自然语言处理[MASK]谷歌[MASK]通用人工智能[MASK]”
	- 短句子组合，例如“我昨天做了一个梦[MASK]又遇见了她[MASK]曾经那段时光让人怀恋[MASK]”
	- 以上的混合

	### How to use / 如何使用
	```python
	# genius-chinese
	from transformers import BertTokenizer, BartForConditionalGeneration, Text2TextGenerationPipeline
	checkpoint = 'beyond/genius-base-chinese'
	tokenizer = BertTokenizer.from_pretrained(checkpoint)
	genius_model = BartForConditionalGeneration.from_pretrained(checkpoint)
	genius_generator = Text2TextGenerationPipeline(genius_model, tokenizer, device=0)
	genius_generator

	sketchs = [
	"今天[MASK]篮球[MASK]学校[MASK]",
	"自然语言处理[MASK]谷歌[MASK]通用人工智能[MASK]",
	"我昨天做了一个梦[MASK]又遇见了她[MASK]曾经那段时光让人怀恋[MASK]",
	"[MASK]疫情[MASK]公园[MASK]散步[MASK]",
	"[MASK]酸菜鱼火锅[MASK]很美味，味道绝了[MASK]周末真开心[MASK]"
	""
	]
	for sketch in sketchs:
	print('input sketch:\n>>> ', sketch)
	print('genius-chinese output:\n>>> ',genius_generator(sketch, max_length=100, do_sample=True, num_beams=3)[0]['generated_text'].replace(' ',''),'\n')
	```

	## Model variations / GENIUS其他版本

	\| Model \| #params \| Language \| comment\|
	\|------------------------\|--------------------------------\|-------\|---------\|
	\| [`genius-large`](https://huggingface.co/beyond/genius-large) \| 406M \| English \| The version used in paper \|
	\| [`genius-large-k2t`](https://huggingface.co/beyond/genius-large-k2t) \| 406M \| English \| keywords-to-text \|
	\| [`genius-base`](https://huggingface.co/beyond/genius-base) \| 139M \| English \| smaller version \|
	\| [`genius-base-ps`](https://huggingface.co/beyond/genius-base) \| 139M \| English \| pre-trained both in paragraphs and short sentences \|
	\| [`genius-base-chinese`](https://huggingface.co/beyond/genius-base-chinese) \| 116M \| 中文 \| 在一千万纯净中文段落上预训练\|


	## Comparison / 效果对比
	The following comes the comparison between [BART-base-chinese](https://huggingface.co/fnlp/bart-base-chinese) and our proposed [GENIUS-base-chinese](https://huggingface.co/beyond/genius-base-chinese).\
	下面对比了[BART-base-chinese](https://huggingface.co/fnlp/bart-base-chinese)和我们提出的GENIUS-base-chinese在填词造句方面的表现：

	```
	input sketch:
	>>> 今天[MASK]篮球[MASK]上海财经大学[MASK]
	BART-chinese output:
	>>> 今天的篮球是上海财经大学篮球
	GENIUS-chinese output:
	>>> 今天，我们邀请到了中国篮球联盟主席、上海财经大学校长孙建国先生作为主题发言。

	input sketch:
	>>> 自然语言处理[MASK]谷歌[MASK]通用人工智能[MASK]
	BART-chinese output:
	>>> 自然语言处理是谷歌的通用人工智能技术
	GENIUS-chinese output:
	>>> 自然语言处理是谷歌在通用人工智能领域的一个重要研究方向，其目的是为了促进人类智能的发展。

	input sketch:
	>>> 我昨天做了一个梦[MASK]又遇见了她[MASK]曾经那段时光让人怀恋[MASK]
	BART-chinese output:
	>>> 我昨天做了一个梦今天又遇见了她我曾经那段时光让人怀恋不已
	GENIUS-chinese output:
	>>> 我昨天做了一个梦，梦见了我的妈妈，又遇见了她，我知道她曾经那段时光让人怀恋，但是现在，我不知道该怎么回事了，我只是想告诉她，不要再回去了。

	input sketch:
	>>> [MASK]疫情[MASK]公园[MASK]漫步[MASK]
	BART-chinese output:
	>>> 在疫情防控公园内漫步徜徉
	GENIUS-chinese output:
	>>> 为了防止疫情扩散，公园内还设置了漫步区。

	input sketch:
	>>> [MASK]酸菜鱼火锅[MASK]很美味，味道绝了[MASK]周末真开心[MASK]
	BART-chinese output:
	>>> 这酸菜鱼火锅真的很美味，味道绝了这周末真开心啊
	GENIUS-chinese output:
	>>> 这个酸菜鱼火锅真的很美味，味道绝了，吃的时间也长了，周末真开心，吃完以后就回家了，很满意的一次，很喜欢的一个品牌。
	```

	可以看出，BART只能填补简单的一些词，无法对这些片段进行很连贯的连接，而GENIUS则可以扩写成连贯的句子甚至段落。


	---

	If you find our paper/code/demo useful, please cite our paper:
	```
	@article{guo2022genius,
	title={GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation},
	author={Guo, Biyang and Gong, Yeyun and Shen, Yelong and Han, Songqiao and Huang, Hailiang and Duan, Nan and Chen, Weizhu},
	journal={arXiv preprint arXiv:2211.10330},
	year={2022}
	}
	```