File size: 1,881 Bytes
751be9e
f480fc2
 
 
 
 
 
 
 
 
 
 
 
774a2c8
 
 
 
 
7b402da
824e6bf
774a2c8
 
 
 
 
 
ab93ca6
d012232
ac545ce
ab93ca6
1090b08
 
5d4a67f
1090b08
 
 
ab93ca6
 
 
 
85e7694
ab93ca6
 
85e7694
 
e000310
71bb42a
e000310
85e7694
 
ac545ce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language: 
  - zh
thumbnail: "url to a thumbnail used in social sharing"
tags:
- bart-large-chinese
datasets:
- Chinese Persona Chat (CPC)
- LCCC
- Emotional STC (ESTC)
- KdConv
---

# dialogue-bart-large-chinese
This is a seq2seq model fine-tuned on several Chinese dialogue datasets, from bart-large-chinese.


# Datasets
We utilize 4 Chinese dialogue datasets from [LUGE](https://www.luge.ai/#/)

|                              |            |                       |
| ----                         | ----       | ----                  |
|                              | Count      | Domain                |
| Chinese Persona Chat (CPC)   | 23,000     | Open                  | 
| LCCC                         | 11,987,759 | Open                  |
| Emotional STC (ESTC)         | 899,207    | Open                  |
| KdConv                       | 3,000      | Movie, Music, Travel  |
|                              |            |                       |


# Data format
Input: `[CLS] 对话历史:<history> 知识:<knowledge> [SEP]`

Output: `[CLS] <response> [SEP]`


# Example
```python
from transformers import BertTokenizer, BartForConditionalGeneration

# Note that tokenizer is an object of BertTokenizer, instead of BartTokenizer
tokenizer = BertTokenizer.from_pretrained("HIT-TMG/dialogue-bart-large-chinese")
model = BartForConditionalGeneration.from_pretrained("HIT-TMG/dialogue-bart-large-chinese")

# an example from CPC dev data
history = ["可以 认识 一下 吗 ?", "当然 可以 啦 , 你好 。", "嘿嘿 你好 , 请问 你 最近 在 忙 什么 呢 ?", "我 最近 养 了 一只 狗狗 , 我 在 训练 它 呢 。"]
history_str = "对话历史:" + tokenizer.sep_token.join(history)
input_ids = tokenizer(history_str, return_tensors='pt').input_ids
output_ids = model.generate(input_ids)[0]
print(tokenizer.decode(output_ids))
 ```