robvanderg commited on
Commit
1a42625
1 Parent(s): 378e92c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -0
README.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ tags:
5
+ - hack
6
+ datasets:
7
+ - Wikipedia
8
+ ---
9
+
10
+
11
+ ## bert-base-multilingual-cased-segment1BERT
12
+
13
+ This is a version of multilingual bert (bert-base-multilingual-cased), where the segment embedding of the 1's is copied into the 0's. Yes, that's all there is to it. We have found that this improves performance substantially in low-resource setups for word-level tasks (e.g. average 2.5 LAS on a variety of UD treebanks). More details are to be released in our LREC2022 paper titled: Frustratingly Easy Performance Improvements for Cross-lingual Transfer: A Tale on BERT and Segment Embeddings.
14
+
15
+ These embeddings are generated by the following code
16
+ ```
17
+ import AutoModel
18
+ baseEmbeddings = AutoModel.from_pretrained("bert-base-multilingual-cased")
19
+ tte = baseEmbeddings.embeddings.token_type_embeddings.weight.clone().detach()
20
+ baseEmbeddings.embeddings.token_type_embeddings.weight[0,:] = tte[1,:]
21
+ ```
22
+
23
+ More details and other varieties can be found in the repo: https://bitbucket.org/robvanderg/segmentembeds/