sihaochen commited on
Commit
684bce1
1 Parent(s): 77e11ce

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc
3
+ language:
4
+ - en
5
+ ---
6
+ This is the proposition segmentation model from "Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations" by Chen et. al. 2023.
7
+
8
+ ## Usage
9
+ The prompt to the model is formatted like: `segment sentence: {input_sentence}`.
10
+
11
+ For each sentence, the model will output the propositions concatenated by `[sep]` as a string.
12
+
13
+ For example, if we use the following example code to segment `"Dracula is a novel by Bram Stoker featuring Count Dracula as the protagonist."`.
14
+
15
+ The model output will be `['Dracula is a novel by Bram Stoker.[sep]Count Dracula is the protagonist of Dracula.']`
16
+
17
+ ```
18
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
19
+
20
+ gen_kwargs = {
21
+ "length_penalty": 0,
22
+ "max_new_tokens": 256,
23
+ "min_length": 10,
24
+ "no_repeat_ngram_size": 0,
25
+ "num_beams": 1,
26
+ }
27
+
28
+ SEGMENT5_PROMPT = "segment sentence: {}"
29
+ SEGMENT5_SEP_TOKEN = "[sep]"
30
+
31
+ model = AutoModelForSeq2SeqLM.from_pretrained("sihaochen/SegmenT5-large")
32
+ tokenizer = AutoTokenizer.from_pretrained("sihaochen/SegmenT5-large")
33
+
34
+ model.eval()
35
+
36
+ # define an example input sentence
37
+ example_sentence = "Dracula is a novel by Bram Stoker featuring Count Dracula as the protagonist."
38
+ example_input = SEGMENT5_PROMPT.format(example_sentence)
39
+
40
+ input_ids = tokenizer(example_input,
41
+ return_tensors="pt",
42
+ padding="max_length",
43
+ max_length=512,
44
+ truncation=True).input_ids
45
+
46
+ logits = model.generate(input_ids, **gen_kwargs)
47
+ outputs = tokenizer.batch_decode(logits, skip_special_tokens=True)
48
+
49
+
50
+ output = outputs[0].split(SEGMENT5_SEP_TOKEN)
51
+
52
+ print(output)
53
+ # Output: ['Dracula is a novel by Bram Stoker.', 'Count Dracula is the protagonist of Dracula.']
54
+ ```
55
+
56
+ ## Sub-Sentence Encoder
57
+ For model checkpoints + code for the sub-sentence encoders, checkout: https://github.com/schen149/sub-sentence-encoder/
58
+
59
+ ## Citation
60
+ ```
61
+ @article{chen2023subsentence,
62
+ title={Sub-Sentence Encoder: Contrastive Learning of Propositional Semantic Representations},
63
+ author={Sihao Chen and Hongming Zhang and Tong Chen and Ben Zhou and Wenhao Yu and Dian Yu and Baolin Peng and Hongwei Wang and Dan Roth and Dong Yu},
64
+ journal={arXiv preprint arXiv:2311.04335},
65
+ year={2023},
66
+ URL = {https://arxiv.org/pdf/2311.04335.pdf}
67
+ }
68
+ ```
69
+