|
--- |
|
language: |
|
- zh |
|
tags: |
|
- t5 |
|
- pytorch |
|
- zh |
|
- Text2Text-Generation |
|
license: "apache-2.0" |
|
widget: |
|
- text: "对联:丹枫江冷人初去" |
|
|
|
--- |
|
|
|
# T5 for Chinese Couplet(t5-chinese-couplet) Model |
|
T5中文对联生成模型 |
|
|
|
`t5-chinese-couplet` evaluate couplet test data: |
|
|
|
The overall performance of T5 on couplet **test**: |
|
|
|
|prefix|input_text|target_text|pred| |
|
|:-- |:--- |:--- |:-- | |
|
|对联:|春回大地,对对黄莺鸣暖树|日照神州,群群紫燕衔新泥|福至人间,家家紫燕舞和风| |
|
|
|
在Couplet测试集上生成结果满足字数相同、词性对齐、词面对齐、形似要求,而语义对仗工整和平仄合律还不满足。 |
|
|
|
T5的网络结构(原生T5): |
|
|
|
![arch](t5.png) |
|
|
|
## Usage |
|
|
|
本项目开源在文本生成项目:[textgen](https://github.com/shibing624/textgen),可支持T5模型,通过如下命令调用: |
|
|
|
Install package: |
|
```shell |
|
pip install -U textgen |
|
``` |
|
|
|
```python |
|
from textgen import T5Model |
|
model = T5Model("t5", "shibing624/t5-chinese-couplet") |
|
r = model.predict(["对联:丹枫江冷人初去"]) |
|
print(r) # ['白石矶寒客不归'] |
|
``` |
|
|
|
## Usage (HuggingFace Transformers) |
|
Without [textgen](https://github.com/shibing624/textgen), you can use the model like this: |
|
|
|
First, you pass your input through the transformer model, then you get the generated sentence. |
|
|
|
Install package: |
|
``` |
|
pip install transformers |
|
``` |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
|
|
tokenizer = T5Tokenizer.from_pretrained("shibing624/t5-chinese-couplet") |
|
model = T5ForConditionalGeneration.from_pretrained("shibing624/t5-chinese-couplet") |
|
|
|
|
|
def batch_generate(input_texts, max_length=64): |
|
features = tokenizer(input_texts, return_tensors='pt') |
|
outputs = model.generate(input_ids=features['input_ids'], |
|
attention_mask=features['attention_mask'], |
|
max_length=max_length) |
|
return tokenizer.batch_decode(outputs, skip_special_tokens=True) |
|
|
|
|
|
r = batch_generate(["对联:丹枫江冷人初去"]) |
|
print(r) |
|
``` |
|
|
|
output: |
|
```shell |
|
['白石矶寒客不归'] |
|
``` |
|
|
|
模型文件组成: |
|
``` |
|
t5-chinese-couplet |
|
├── config.json |
|
├── model_args.json |
|
├── pytorch_model.bin |
|
├── special_tokens_map.json |
|
├── tokenizer_config.json |
|
├── spiece.model |
|
└── vocab.txt |
|
``` |
|
|
|
|
|
### 训练数据集 |
|
#### 中文对联数据集 |
|
|
|
- 数据:[对联github](https://github.com/wb14123/couplet-dataset)、[清洗过的对联github](https://github.com/v-zich/couplet-clean-dataset) |
|
- 相关内容 |
|
- [Huggingface](https://huggingface.co/) |
|
- LangZhou Chinese [MengZi T5 pretrained Model](https://huggingface.co/Langboat/mengzi-t5-base) and [paper](https://arxiv.org/pdf/2110.06696.pdf) |
|
- [textgen](https://github.com/shibing624/textgen) |
|
|
|
|
|
数据格式: |
|
|
|
```text |
|
head -n 1 couplet_files/couplet/train/in.txt |
|
晚 风 摇 树 树 还 挺 |
|
|
|
head -n 1 couplet_files/couplet/train/out.txt |
|
晨 露 润 花 花 更 红 |
|
``` |
|
|
|
|
|
如果需要训练T5模型,请参考[https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md](https://github.com/shibing624/textgen/blob/main/docs/%E5%AF%B9%E8%81%94%E7%94%9F%E6%88%90%E6%A8%A1%E5%9E%8B%E5%AF%B9%E6%AF%94.md) |
|
|
|
|
|
## Citation |
|
|
|
```latex |
|
@software{textgen, |
|
author = {Xu Ming}, |
|
title = {textgen: Implementation of Text Generation models}, |
|
year = {2022}, |
|
url = {https://github.com/shibing624/textgen}, |
|
} |
|
``` |
|
|
|
|