YanshekWoo's picture
Update README.md
f480fc2
|
raw
history blame
1.88 kB
metadata
language:
  - zh
thumbnail: url to a thumbnail used in social sharing
tags:
  - bart-large-chinese
datasets:
  - Chinese Persona Chat (CPC)
  - LCCC
  - Emotional STC (ESTC)
  - KdConv

dialogue-bart-large-chinese

This is a seq2seq model fine-tuned on several Chinese dialogue datasets, from bart-large-chinese.

Datasets

We utilize 4 Chinese dialogue datasets from LUGE

Count Domain
Chinese Persona Chat (CPC) 23,000 Open
LCCC 11,987,759 Open
Emotional STC (ESTC) 899,207 Open
KdConv 3,000 Movie, Music, Travel

Data format

Input: [CLS] 对话历史:<history> 知识:<knowledge> [SEP]

Output: [CLS] <response> [SEP]

Example

from transformers import BertTokenizer, BartForConditionalGeneration

# Note that tokenizer is an object of BertTokenizer, instead of BartTokenizer
tokenizer = BertTokenizer.from_pretrained("HIT-TMG/dialogue-bart-large-chinese")
model = BartForConditionalGeneration.from_pretrained("HIT-TMG/dialogue-bart-large-chinese")

# an example from CPC dev data
history = ["可以 认识 一下 吗 ?", "当然 可以 啦 , 你好 。", "嘿嘿 你好 , 请问 你 最近 在 忙 什么 呢 ?", "我 最近 养 了 一只 狗狗 , 我 在 训练 它 呢 。"]
history_str = "对话历史:" + tokenizer.sep_token.join(history)
input_ids = tokenizer(history_str, return_tensors='pt').input_ids
output_ids = model.generate(input_ids)[0]
print(tokenizer.decode(output_ids))