--- language: fa tags: - persian - mobilebert license: apache-2.0 pipeline_tag: fill-mask mask_token: '[MASK]' widget: - text: 'در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم.' ---

# Logo Lifeweb

### Shiraz Language Model Welcome to Shiraz, the repository for Lifeweb's language model. First versions of our models are all trained on our own dataset called **Divan** with more than **164 million documents** and more than **10B tokens** which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model! # Use Model You can easily access the models using the sample code provided below. ```python from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline # v1.0 model_name = "lifeweb-ai/shiraz" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForMaskedLM.from_pretrained(model_name) text = "در همین لحظه که شما مشغول خواندن این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم." print(tokenizer.tokenize(text)) # ['در', 'همین', 'لحظه', 'که', 'شما', 'مشغول', 'خواندن', 'این', 'متن', 'هستید،', 'میلیون', '[zwnj]', 'ها', 'دیتا', 'در', 'فضای', 'انلاین', 'در', 'حال', 'تولید', 'است', '.', 'ما', 'در', 'لایف', 'وب', 'به', 'جمع', '[zwnj]', 'اوری', '##،', 'پردازش', 'و', 'تحلیل', 'این', 'کلان', 'داده', '(', 'big', 'data', ')', 'می', '[zwnj]', 'پردازیم', '.', '.'] # fill mask task text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم." classifier = FillMaskPipeline(model=model, tokenizer=tokenizer) result = classifier(text) print(result[0]) #{'score': 0.3584367036819458, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'} ``` # Results The **Shiraz** is evaluated on three downstream NLP tasks comprising **NER**, **Sentiment Analysis**, and **Emotion Detection**. Shiraz is considerably faster, and its accuracy remains highly competitive without compromising much on speed. According to [**MobileBERT paper**](https://arxiv.org/pdf/2004.02984.pdf), this model is 4.3× smaller and 5.5× faster than BERT-base. Obvious from the table below, you can find the colab codes for each task to use as a tutorial besides the macro F1 score.
Model NER Sentiment Emotion
Arman Peyma Sentipers (multi) Snappfood Arman
lifeweb-ai/tehran 71.87%
90.79%
63.75%
88.74%
77.73%
lifeweb-ai/shiraz 67.62%
Colab Code
86.24%
Colab Code
59.17%
Colab Code
88.01%
Colab Code
66.97%
Colab Code
sbunlp/fabert 71.23%
Colab Code
88.53%
Colab Code
58.51%
Colab Code
88.60%
Colab Code
72.65%
ViraIntelligentDataMining/AriaBERT 69.12%
Colab Code
87.15%
Colab Code
59.26%
Colab Code
87.96%
Colab Code
69.11%
HooshvareLab/bert-fa-zwnj-base 67.49%
Colab Code
85.73%
Colab Code
59.61%
Colab Code
87.58%
Colab Code
59.27%
Colab Code
HooshvareLab/roberta-fa-zwnj-base 69.73%
Colab Code
86.21%
Colab Code
56.23%
Colab Code
87.19%
Colab Code
57.96%
Colab Code
If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so that we can add a reference. # Cite You are welcome to use our LM models in your work or research, if so, we kindly ask you to cite it using the following entry: ``` @misc{Shiraz, author = {Mehrdad Azizi, Reza Salehi Chegeni, Parisa Mousavi, Iman Hashemi}, title = {[Optimizing Pre-trained BERT-based Models for Persian Language Processing]}, year = {2024}, publisher = {LifeWeb} } ``` # Contributors - Mehrdad Azizi: [**Linkedin**](https://www.linkedin.com/in/mehrdad-azizi-50839489/), [**Github**](https://github.com/mehrazi) - Reza Salehi Chegeni: [**Linkedin**](https://www.linkedin.com/in/reza-salehi-chegeni-6988ba271/), [**Github**](https://github.com/rezasalehichegeni) - Parisa Mousavi: [**Linkedin**](https://www.linkedin.com/in/seyede-parisa-mousavi/), [**Github**](https://github.com/Mousavi-Parisa) - Iman Hashemi: [**Linkedin**](https://www.linkedin.com/in/iman-hashemi-403738a5), [**Github**](https://github.com/hashemiiman) - Lifeweb: [**HuggingFace**](https://huggingface.co/lifeweb-ai), [**Official Website**](https://lifewebco.com/), [**Linkedin**](https://www.linkedin.com/company/lifewebir/mycompany/) # Releases **v1.0(2024-03-09)** First version of **Shiraz** model trained on **DIVAN**.