roberta-large-1160k
Intended uses
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
How to use
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='AI-Sweden-Models/roberta-large-1160k')
>>> unmasker("Huvudstaden i Sverige är <mask>.")
[{'score': 0.5841221213340759,
'token': 1945,
'token_str': ' Stockholm',
'sequence': 'Huvudstaden i Sverige är Stockholm.'},
{'score': 0.06775698810815811,
'token': 5007,
'token_str': ' Göteborg',
'sequence': 'Huvudstaden i Sverige är Göteborg.'},
{'score': 0.05057400465011597,
'token': 5761,
'token_str': ' Malmö',
'sequence': 'Huvudstaden i Sverige är Malmö.'},
{'score': 0.021936343982815742,
'token': 21449,
'token_str': ' Norrköping',
'sequence': 'Huvudstaden i Sverige är Norrköping.'},
{'score': 0.017798304557800293,
'token': 5658,
'token_str': ' Uppsala',
'sequence': 'Huvudstaden i Sverige är Uppsala.'}]
>>> unmasker("Hovedstaden i Norge er <mask>.")
[{'score': 0.6792309284210205,
'token': 5158,
'token_str': ' Oslo',
'sequence': 'Hovedstaden i Norge er Oslo.'},
{'score': 0.09379775077104568,
'token': 15456,
'token_str': ' Trondheim',
'sequence': 'Hovedstaden i Norge er Trondheim.'},
{'score': 0.052535850554704666,
'token': 11370,
'token_str': ' Bergen',
'sequence': 'Hovedstaden i Norge er Bergen.'},
{'score': 0.03465486690402031,
'token': 29407,
'token_str': ' hovedstaden',
'sequence': 'Hovedstaden i Norge er hovedstaden.'},
{'score': 0.03017985075712204,
'token': 33311,
'token_str': ' Kristiansand',
'sequence': 'Hovedstaden i Norge er Kristiansand.'}]
>>> unmasker("Danmarks hovedstad er <mask>.")
[{'score': 0.11624140292406082,
'token': 4794,
'token_str': ' København',
'sequence': 'Danmarks hovedstad er København.'},
{'score': 0.045051511377096176,
'token': 7680,
'token_str': ' død',
'sequence': 'Danmarks hovedstad er død.'},
{'score': 0.02936543896794319,
'token': 10795,
'token_str': ' lukket',
'sequence': 'Danmarks hovedstad er lukket.'},
{'score': 0.026030730456113815,
'token': 13580,
'token_str': ' Odense',
'sequence': 'Danmarks hovedstad er Odense.'},
{'score': 0.02130937948822975,
'token': 16347,
'token_str': ' Roskilde',
'sequence': 'Danmarks hovedstad er Roskilde.'}]
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('AI-Sweden-Models/roberta-large-1160k')
model = RobertaModel.from_pretrained('AI-Sweden-Models/roberta-large-1160k')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Training data
The Scandinavian subset of the Nordic Pile (Swedish, Norwegian, Danish), consisting of 414 962 688 text samples.
Training procedure
The model was trained with the optimum-habana framework. Utilizing 8X Intel® Gaudi® 2 AI accelerators, managed by Intel Sweden AB.
The weights from https://huggingface.co/FacebookAI/roberta-large are used as initialization, and the tokenizer is trained from scratch.
This model is a checkpoint (1 160 000 / 1 350 790). The final run is 5 epochs. This is epoch: 4.29.
A batch size of 1536 was used.
Evaluation results
When fine-tuned on downstream tasks, this model achieves the following results:
rank | da_rank | no_rank | sv_rank | dansk | angry_tweets | scala_da | scandiqa_da | norne_nb | norne_nn | norec | scala_nb | scala_nn | norquad | suc3 | swerec | scala_sv | scandiqa_sv |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.3 | 1.33 | 1.34 | 1.23 | 74.16 | 51.2 | 73.87 | 49.34 | 92.01 | 87.17 | 60.11 | 72.85 | 65.56 | 60.38 | 82.65 | 77.25 | 77.9 | 49.64 |
As by (2024/03/26) it is ranked #2 at ScandEval after gpt-4-0613.
- Downloads last month
- 529