File size: 2,505 Bytes
8e25d45 0c58c9f 099969d 0c58c9f 8e25d45 0c58c9f 60c73ee 0c58c9f e9e1766 54dbf4f e9e1766 54dbf4f e9e1766 54dbf4f 0c58c9f 60c73ee 7fc2b28 099969d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
---
language:
- bn
license: apache-2.0
datasets:
- uonlp/CulturaX
- wikipedia
pipeline_tag: text-generation
---
# TituLM-1B-BN-V1
TituLM-1B-BN-V1 is a large language model specifically trained for generating and understanding Bangla text. Utilizing a decoder-style transformer architecture, this model has been extensively trained on a dataset comprising 4.51 billion Bangla tokens. This model is the part of iterative train and release Bangla LLM from Hishab.
## Training
The training process was managed using the robust framework provided by MosaicML's [llm-foundry](https://github.com/mosaicml/llm-foundry) repository. Throughout the training phase, titulm-1b-bn-v1 underwent a total of 59 iterations, allowing for iterative refinements and optimization.
Notable training configs:
- n_nead: 16
- n_layers: 24
- max_sequence_length: 2048
- vocab_size: 72000
- attn_impl: flash
__Training evaluation status__
- Evaluation CrossEntropy Loss
Final loss: 3.11
<img src="https://cdn-uploads.huggingface.co/production/uploads/5f40b34279c1ba4c353d0c7a/Mr0yAg9AfXTm15GATgSTN.png" alt="alt text" width="620" height="620">
- Language Perplexity
Final Perplexity: 22.562
<img src="https://cdn-uploads.huggingface.co/production/uploads/5f40b34279c1ba4c353d0c7a/B-ZC1LfFZdCTO25Twcyth.png" alt="alt text" width="620" height="620">
## Datasets
We add Bangla text datasets from several sources including
- Culturax
- Books
- Bangla Wikipedia
- Banglapedia
- News articles
Our total data size is 58 GB of deduplicated data with 4.51 billion tokens tokenized by our sentencepiece model.
## How to Use
The basic use cases to generate text using this model is simple. Follow the below code to generate text using this model.
Install the following library before running the code:
- pip install transofrmers
- pip install einops
- pip install accelerate
```py
import transformers
from transformers import pipeline
name = 'hishab/titulm-1b-bn-v1'
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.max_seq_len = 2048
model = transformers.AutoModelForCausalLM.from_pretrained(
name,
config=config,
trust_remote_code=True
)
tokenizer = transformers.AutoTokenizer.from_pretrained('hishab/titulm-1b-bn-v1')
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
output = pipe('আমি বাংলায় গান\n',
max_new_tokens=100,
do_sample=True,
use_cache=True)
print(output)
``` |