|
--- |
|
license: mit |
|
language: |
|
- ru |
|
tags: |
|
- PyTorch |
|
- Transformers |
|
--- |
|
|
|
# ruELECTRA large model (cased) for Embeddings in the Russian language. |
|
|
|
The model architecture design, pretraining, and evaluation are documented in our preprint: [**A Family of Pretrained Transformer Language Models for Russian**](https://arxiv.org/abs/2309.10931). |
|
|
|
|
|
## Usage (HuggingFace Models Repository) |
|
|
|
You can use the model directly from the model repository to compute sentence embeddings: |
|
|
|
For better quality, use mean token embeddings. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
#Mean Pooling - Take attention mask into account for correct averaging |
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] #First element of model_output contains all token embeddings |
|
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1) |
|
sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9) |
|
return sum_embeddings / sum_mask |
|
|
|
#Sentences we want sentence embeddings for |
|
sentences = ['Привет! Как твои дела?', |
|
'А правда, что 42 твое любимое число?'] |
|
|
|
#Load AutoModel from huggingface model repository |
|
tokenizer = AutoTokenizer.from_pretrained("ai-forever/ruElectra-large") |
|
model = AutoModel.from_pretrained("ai-forever/ruElectra-large") |
|
|
|
#Tokenize sentences |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, max_length=24, return_tensors='pt') |
|
|
|
#Compute token embeddings |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
#Perform pooling. In this case, mean pooling |
|
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']) |
|
``` |
|
|
|
+ # Authors |
|
+ [SaluteDevices](https://sberdevices.ru/) RnD Team. |
|
+ Aleksandr Abramov: [HF profile](https://huggingface.co/Andrilko), [Github](https://github.com/Ab1992ao), [Kaggle Competitions Master](https://www.kaggle.com/andrilko); |
|
+ Mark Baushenko: [HF profile](https://huggingface.co/e0xexrazy); |
|
+ Artem Snegirev: [HF profile](https://huggingface.co/artemsnegirev) |
|
|
|
# Cite us |
|
``` |
|
@misc{zmitrovich2023family, |
|
title={A Family of Pretrained Transformer Language Models for Russian}, |
|
author={Dmitry Zmitrovich and Alexander Abramov and Andrey Kalmykov and Maria Tikhonova and Ekaterina Taktasheva and Danil Astafurov and Mark Baushenko and Artem Snegirev and Tatiana Shavrina and Sergey Markov and Vladislav Mikhailov and Alena Fenogenova}, |
|
year={2023}, |
|
eprint={2309.10931}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |