xiaowenbin
commited on
Commit
•
7db2d26
1
Parent(s):
182c8c2
init commit
Browse files- .gitattributes +1 -0
- 1_Pooling/config.json +7 -0
- README.md +105 -0
- config.json +31 -0
- config_sentence_transformers.json +7 -0
- modules.json +14 -0
- pytorch_model.bin +3 -0
- sentence_bert_config.json +4 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- vocab.txt +0 -0
.gitattributes
CHANGED
@@ -26,3 +26,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
26 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
27 |
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
28 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
26 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
27 |
*.zstandard filter=lfs diff=lfs merge=lfs -text
|
28 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
29 |
+
pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
|
1_Pooling/config.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"word_embedding_dimension": 768,
|
3 |
+
"pooling_mode_cls_token": false,
|
4 |
+
"pooling_mode_mean_tokens": true,
|
5 |
+
"pooling_mode_max_tokens": false,
|
6 |
+
"pooling_mode_mean_sqrt_len_tokens": false
|
7 |
+
}
|
README.md
ADDED
@@ -0,0 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
tags:
|
4 |
+
- sentence-transformers
|
5 |
+
- feature-extraction
|
6 |
+
- sentence-similarity
|
7 |
+
- transformers
|
8 |
+
- semantic-search
|
9 |
+
- chinese
|
10 |
+
---
|
11 |
+
|
12 |
+
# DMetaSoul/sbert-chinese-qmc-finance-v1-distill
|
13 |
+
|
14 |
+
此模型是之前[开源金融问题匹配模型](https://huggingface.co/DMetaSoul/sbert-chinese-qmc-finance-v1)的蒸馏轻量化版本(仅4层 BERT),适用于**金融领域的问题匹配**场景,比如:
|
15 |
+
|
16 |
+
- 8千日利息400元? VS 10000元日利息多少钱
|
17 |
+
- 提前还款是按全额计息 VS 还款扣款不成功怎么还款?
|
18 |
+
- 为什么我借钱交易失败 VS 刚申请的借款为什么会失败
|
19 |
+
|
20 |
+
离线训练好的大模型如果直接用于线上推理,对计算资源有苛刻的需求,而且难以满足业务环境对延迟、吞吐量等性能指标的要求,这里我们使用蒸馏手段来把大模型轻量化。从 12 层 BERT 蒸馏为 4 层后,模型参数量缩小到 44%,大概 latency 减半、throughput 翻倍、精度下降 5% 左右(具体结果详见下文评估小节)。
|
21 |
+
|
22 |
+
# Usage
|
23 |
+
|
24 |
+
## 1. Sentence-Transformers
|
25 |
+
|
26 |
+
通过 [sentence-transformers](https://www.SBERT.net) 框架来使用该模型,首先进行安装:
|
27 |
+
|
28 |
+
```
|
29 |
+
pip install -U sentence-transformers
|
30 |
+
```
|
31 |
+
|
32 |
+
然后使用下面的代码来载入该模型并进行文本表征向量的提取:
|
33 |
+
|
34 |
+
```python
|
35 |
+
from sentence_transformers import SentenceTransformer
|
36 |
+
sentences = ["到期不能按时还款怎么办", "剩余欠款还有多少?"]
|
37 |
+
|
38 |
+
model = SentenceTransformer('DMetaSoul/sbert-chinese-qmc-finance-v1-distill')
|
39 |
+
embeddings = model.encode(sentences)
|
40 |
+
print(embeddings)
|
41 |
+
```
|
42 |
+
|
43 |
+
## 2. HuggingFace Transformers
|
44 |
+
|
45 |
+
如果不想使用 [sentence-transformers](https://www.SBERT.net) 的话,也可以通过 HuggingFace Transformers 来载入该模型并进行文本向量抽取:
|
46 |
+
|
47 |
+
```python
|
48 |
+
from transformers import AutoTokenizer, AutoModel
|
49 |
+
import torch
|
50 |
+
|
51 |
+
|
52 |
+
#Mean Pooling - Take attention mask into account for correct averaging
|
53 |
+
def mean_pooling(model_output, attention_mask):
|
54 |
+
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
55 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
56 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
57 |
+
|
58 |
+
|
59 |
+
# Sentences we want sentence embeddings for
|
60 |
+
sentences = ["到期不能按时还款怎么办", "剩余欠款还有多少?"]
|
61 |
+
|
62 |
+
# Load model from HuggingFace Hub
|
63 |
+
tokenizer = AutoTokenizer.from_pretrained('DMetaSoul/sbert-chinese-qmc-finance-v1-distill')
|
64 |
+
model = AutoModel.from_pretrained('DMetaSoul/sbert-chinese-qmc-finance-v1-distill')
|
65 |
+
|
66 |
+
# Tokenize sentences
|
67 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
68 |
+
|
69 |
+
# Compute token embeddings
|
70 |
+
with torch.no_grad():
|
71 |
+
model_output = model(**encoded_input)
|
72 |
+
|
73 |
+
# Perform pooling. In this case, mean pooling.
|
74 |
+
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
75 |
+
|
76 |
+
print("Sentence embeddings:")
|
77 |
+
print(sentence_embeddings)
|
78 |
+
```
|
79 |
+
|
80 |
+
## Evaluation
|
81 |
+
|
82 |
+
这里主要跟蒸馏前对应的 teacher 模型作了对比:
|
83 |
+
|
84 |
+
*性能:*
|
85 |
+
|
86 |
+
| | Teacher | Student | Gap |
|
87 |
+
| ---------- | --------------------- | ------------------- | ----- |
|
88 |
+
| Model | BERT-12-layers (102M) | BERT-4-layers (45M) | 0.44x |
|
89 |
+
| Cost | 23s | 12s | -47% |
|
90 |
+
| Latency | 38ms | 20ms | -47% |
|
91 |
+
| Throughput | 418 sentence/s | 791 sentence/s | 1.9x |
|
92 |
+
|
93 |
+
*精度:*
|
94 |
+
|
95 |
+
| | **csts_dev** | **csts_test** | **afqmc** | **lcqmc** | **bqcorpus** | **pawsx** | **xiaobu** | **Avg** |
|
96 |
+
| -------------- | ------------ | ------------- | --------- | --------- | ------------ | --------- | ---------- | ------- |
|
97 |
+
| **Teacher** | 77.40% | 74.55% | 36.00% | 75.75% | 73.24% | 11.58% | 54.75% | 57.61% |
|
98 |
+
| **Student** | 75.02% | 71.99% | 32.40% | 67.06% | 66.35% | 7.57% | 49.26% | 52.80% |
|
99 |
+
| **Gap** (abs.) | - | - | - | - | - | - | - | -4.81% |
|
100 |
+
|
101 |
+
*基于1万条数据测试,GPU设备是V100,batch_size=16,max_seq_len=256*
|
102 |
+
|
103 |
+
## Citing & Authors
|
104 |
+
|
105 |
+
E-mail: [email protected]
|
config.json
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "releases/sbert-chinese-qmc-finance-v1-distill/",
|
3 |
+
"architectures": [
|
4 |
+
"BertModel"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"directionality": "bidi",
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 768,
|
12 |
+
"initializer_range": 0.02,
|
13 |
+
"intermediate_size": 3072,
|
14 |
+
"layer_norm_eps": 1e-12,
|
15 |
+
"max_position_embeddings": 512,
|
16 |
+
"model_type": "bert",
|
17 |
+
"num_attention_heads": 12,
|
18 |
+
"num_hidden_layers": 4,
|
19 |
+
"pad_token_id": 0,
|
20 |
+
"pooler_fc_size": 768,
|
21 |
+
"pooler_num_attention_heads": 12,
|
22 |
+
"pooler_num_fc_layers": 3,
|
23 |
+
"pooler_size_per_head": 128,
|
24 |
+
"pooler_type": "first_token_transform",
|
25 |
+
"position_embedding_type": "absolute",
|
26 |
+
"torch_dtype": "float32",
|
27 |
+
"transformers_version": "4.16.0",
|
28 |
+
"type_vocab_size": 2,
|
29 |
+
"use_cache": true,
|
30 |
+
"vocab_size": 21128
|
31 |
+
}
|
config_sentence_transformers.json
ADDED
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"__version__": {
|
3 |
+
"sentence_transformers": "2.1.0",
|
4 |
+
"transformers": "4.16.0",
|
5 |
+
"pytorch": "1.10.2"
|
6 |
+
}
|
7 |
+
}
|
modules.json
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[
|
2 |
+
{
|
3 |
+
"idx": 0,
|
4 |
+
"name": "0",
|
5 |
+
"path": "",
|
6 |
+
"type": "sentence_transformers.models.Transformer"
|
7 |
+
},
|
8 |
+
{
|
9 |
+
"idx": 1,
|
10 |
+
"name": "1",
|
11 |
+
"path": "1_Pooling",
|
12 |
+
"type": "sentence_transformers.models.Pooling"
|
13 |
+
}
|
14 |
+
]
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:c7695f0057b19b353edf6d92db9028b4e0e76b2372bce89f6e6b5a89f82fa7d7
|
3 |
+
size 182288973
|
sentence_bert_config.json
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"max_seq_length": 256,
|
3 |
+
"do_lower_case": false
|
4 |
+
}
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": false, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "releases/sbert-chinese-qmc-finance-v1-distill/", "tokenizer_class": "BertTokenizer"}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|