|
--- |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
language: |
|
- az |
|
pipeline_tag: token-classification |
|
tags: |
|
- NER |
|
- Named Entity Recognition |
|
widget: |
|
- text: >- |
|
İyunun 11-i saat 20:55 radələrində Oğuz rayonu Tayıflı, Şirvanlı, Xalxal |
|
kəndlərinə diametri 10 mm olan dolu düşüb. |
|
datasets: |
|
- LocalDoc/azerbaijani-ner-dataset |
|
--- |
|
|
|
# Azerbaijani Named Entity Recognition (NER) Model |
|
|
|
This repository contains the code and model for Named Entity Recognition (NER) in Azerbaijani language. The model is built using the XLM-RoBERTa architecture and fine-tuned on a custom dataset. |
|
|
|
## Model Description |
|
|
|
The model recognizes the following entity types: |
|
|
|
- LABEL_0: **O**: Outside any named entity |
|
- LABEL_1: **PERSON**: Names of individuals |
|
- LABEL_2 :**LOCATION**: Geographical locations, both man-made and natural |
|
- LABEL_3 :**ORGANISATION**: Names of companies, institutions |
|
- LABEL_4 :**DATE**: Dates or periods |
|
- LABEL_5 :**TIME**: Times of the day |
|
- LABEL_6 :**MONEY**: Monetary values |
|
- LABEL_7 :**PERCENTAGE**: Percentage values |
|
- LABEL_8 :**FACILITY**: Buildings, airports, etc. |
|
- LABEL_9 :**PRODUCT**: Products and goods |
|
- LABEL_10 :**EVENT**: Events and occurrences |
|
- LABEL_11 :**ART**: Artworks, titles of books, songs |
|
- LABEL_12 :**LAW**: Legal documents |
|
- LABEL_13 :**LANGUAGE**: Languages |
|
- LABEL_14 :**GPE**: Countries, cities, states |
|
- LABEL_15 :**NORP**: Nationalities or religious or political groups |
|
- LABEL_16 :**ORDINAL**: Ordinal numbers |
|
- LABEL_17 :**CARDINAL**: Cardinal numbers |
|
- LABEL_18 :**DISEASE**: Diseases and medical conditions |
|
- LABEL_19 :**CONTACT**: Contact information, e.g., phone numbers, emails |
|
- LABEL_20 :**ADAGE**: Proverbs, sayings |
|
- LABEL_21 :**QUANTITY**: Measurements and quantities |
|
- LABEL_22 :**MISCELLANEOUS**: Miscellaneous entities |
|
- LABEL_23 :**POSITION**: Professional or social positions |
|
- LABEL_24 :**PROJECT**: Names of projects or programs |
|
|
|
## Installation |
|
|
|
To use the model, you need to install the required libraries. You can do this using `pip`: |
|
|
|
```bash |
|
pip install transformers |
|
pip install datasets |
|
``` |
|
```python |
|
from transformers import pipeline, XLMRobertaTokenizerFast, XLMRobertaForTokenClassification |
|
|
|
# Load the model and tokenizer |
|
tokenizer = XLMRobertaTokenizerFast.from_pretrained("LocalDoc/ner_azerbaijan") |
|
model = XLMRobertaForTokenClassification.from_pretrained("LocalDoc/ner_azerbaijan") |
|
|
|
# Create NER pipeline |
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
# Example text |
|
example = "Komitədən bildirilib ki, sovet dövründə Azərbaycanda cəmi 17 məscid fəaliyyət göstərirdisə, dövlət müstəqilliyinin bərpasından sonra ölkədə 814 məscid tikilib." |
|
|
|
# Perform NER |
|
ner_results = nlp(example) |
|
|
|
# Mapping of label indices to their descriptions |
|
label_mapping = { |
|
0: "O", |
|
1: "PERSON", |
|
2: "LOCATION", |
|
3: "ORGANISATION", |
|
4: "DATE", |
|
5: "TIME", |
|
6: "MONEY", |
|
7: "PERCENTAGE", |
|
8: "FACILITY", |
|
9: "PRODUCT", |
|
10: "EVENT", |
|
11: "ART", |
|
12: "LAW", |
|
13: "LANGUAGE", |
|
14: "GPE", |
|
15: "NORP", |
|
16: "ORDINAL", |
|
17: "CARDINAL", |
|
18: "DISEASE", |
|
19: "CONTACT", |
|
20: "ADAGE", |
|
21: "QUANTITY", |
|
22: "MISCELLANEOUS", |
|
23: "POSITION", |
|
24: "PROJECT" |
|
} |
|
|
|
# Print results with mapped entity types |
|
for result in ner_results: |
|
entity_group = result['entity_group'] |
|
entity_description = label_mapping[int(entity_group.split('_')[-1])] |
|
print({ |
|
'entity_group': entity_description, |
|
'score': result['score'], |
|
'word': result['word'], |
|
'start': result['start'], |
|
'end': result['end'] |
|
}) |
|
``` |
|
|
|
## License |
|
|
|
This model licensed under the CC BY-NC-ND 4.0 license. |
|
What does this license allow? |
|
|
|
Attribution: You must give appropriate credit, provide a link to the license, and indicate if changes were made. |
|
Non-Commercial: You may not use the material for commercial purposes. |
|
No Derivatives: If you remix, transform, or build upon the material, you may not distribute the modified material. |
|
|
|
For more information, please refer to the <a target="_blank" href="https://creativecommons.org/licenses/by-nc-nd/4.0/">CC BY-NC-ND 4.0 license</a>. |
|
|
|
|
|
## Contact |
|
|
|
For more information, questions, or issues, please contact LocalDoc at [[email protected]]. |