File size: 2,940 Bytes
5d00664 177980f 5d00664 177980f 5d00664 84e392f 5d00664 5c4046f 5d00664 84e392f 5d00664 d889a42 5d00664 d889a42 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
language:
- ar
- az
- bg
- de
- el
- en
- es
- fr
- hi
- it
- ja
- nl
- pl
- pt
- ru
- sw
- th
- tr
- ur
- vi
- zh
license: cc-by-nc-4.0
tags:
- language detect
pipeline_tag: text-classification
---
# Multilingual Language Detection Model
## Model Description
This repository contains a multilingual language detection model based on the XLM-RoBERTa base architecture. The model is capable of distinguishing between 21 different languages including Arabic, Azerbaijani, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, and Chinese.
## How to Use
You can use this model directly with a pipeline for text classification, or you can use it with the `transformers` library for more custom usage, as shown in the example below.
### Quick Start
First, install the transformers library if you haven't already:
```bash
pip install transformers
```
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/language_detection")
model = AutoModelForSequenceClassification.from_pretrained("LocalDoc/language_detection")
# Prepare text
text = "Əlqasım oğulları vorzakondu"
encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
# Prediction
model.eval()
with torch.no_grad():
outputs = model(**encoded_input)
# Process the outputs
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
predicted_class_index = probabilities.argmax().item()
labels = ["az", "ar", "bg", "de", "el", "en", "es", "fr", "hi", "it", "ja", "nl", "pl", "pt", "ru", "sw", "th", "tr", "ur", "vi", "zh"]
predicted_label = labels[predicted_class_index]
print(f"Predicted Language: {predicted_label}")
```
Training Performance
The model was trained over three epochs, showing consistent improvement in accuracy and loss:
Epoch 1: Training Loss: 0.0127, Validation Loss: 0.0174, Accuracy: 0.9966, F1 Score: 0.9966
Epoch 2: Training Loss: 0.0149, Validation Loss: 0.0141, Accuracy: 0.9973, F1 Score: 0.9973
Epoch 3: Training Loss: 0.0001, Validation Loss: 0.0109, Accuracy: 0.9984, F1 Score: 0.9984
Test Results
The model achieved the following results on the test set:
Loss: 0.0133
Accuracy: 0.9975
F1 Score: 0.9975
Precision: 0.9975
Recall: 0.9975
Evaluation Time: 17.5 seconds
Samples per Second: 599.685
Steps per Second: 9.424
License
The dataset is licensed under the Creative Commons Attribution-NonCommercial 4.0 International license. This license allows you to freely share and redistribute the dataset with attribution to the source but prohibits commercial use and the creation of derivative works.
Contact information
If you have any questions or suggestions, please contact us at [[email protected]]. |