File size: 2,806 Bytes
8e25d45
 
 
0c58c9f
 
 
099969d
0c58c9f
8e25d45
 
0c58c9f
 
 
 
 
60c73ee
 
 
 
 
 
 
 
ca65849
0c58c9f
e9e1766
54dbf4f
 
e9e1766
 
 
54dbf4f
 
e9e1766
 
 
54dbf4f
0c58c9f
60c73ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00a4300
958b7f3
00a4300
 
 
60c73ee
 
7fc2b28
 
 
00a4300
7fc2b28
00a4300
7fc2b28
 
 
00a4300
7fc2b28
 
 
 
 
 
 
a4f214e
7fc2b28
 
 
 
 
f12166b
 
 
 
 
 
 
 
 
 
 
 
099969d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
language:
- bn
license: apache-2.0
datasets:
- uonlp/CulturaX
- wikipedia
pipeline_tag: text-generation
---

# TituLM-1B-BN-V1

TituLM-1B-BN-V1 is a large language model specifically trained for generating and understanding Bangla text. Utilizing a decoder-style transformer architecture, this model has been extensively trained on a dataset comprising 4.51 billion Bangla tokens. This model is the part of iterative train and release Bangla LLM from Hishab.

## Training
The training process was managed using the robust framework provided by MosaicML's [llm-foundry](https://github.com/mosaicml/llm-foundry) repository. Throughout the training phase, titulm-1b-bn-v1 underwent a total of 59 iterations, allowing for iterative refinements and optimization.
Notable training configs:

- n_nead: 16
- n_layers: 24
- max_sequence_length: 2048
- vocab_size: 72000
- attn_impl: flash
- Trained on 8 H100 GPU on GCP

__Training evaluation status__

- Evaluation CrossEntropy Loss

  Final loss: 3.11
  <img src="https://cdn-uploads.huggingface.co/production/uploads/5f40b34279c1ba4c353d0c7a/Mr0yAg9AfXTm15GATgSTN.png" alt="alt text" width="620" height="620">

- Language Perplexity

  Final Perplexity: 22.562
  <img src="https://cdn-uploads.huggingface.co/production/uploads/5f40b34279c1ba4c353d0c7a/B-ZC1LfFZdCTO25Twcyth.png" alt="alt text" width="620" height="620">

## Datasets
We add Bangla text datasets from several sources including

- Culturax
- Books
- Bangla Wikipedia
- Banglapedia
- News articles

Our total data size is 58 GB of deduplicated data with 4.51 billion tokens tokenized by our sentencepiece model.


## How to Use
The basic use cases to generate text using this model is simple. Follow the below code to generate text using this model.

Install the following library before running the code:

```sh
pip install transformers
pip install einops
pip install accelerate
```

```py
import transformers
from transformers import pipeline

model_name = 'hishab/titulm-1b-bn-v1'

config = transformers.AutoConfig.from_pretrained(model_name, trust_remote_code=True)
config.max_seq_len = 2048

model = transformers.AutoModelForCausalLM.from_pretrained(
  model_name,
  config=config,
  trust_remote_code=True
)

tokenizer = transformers.AutoTokenizer.from_pretrained('hishab/titulm-1b-bn-v1')

pipe = pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
output = pipe('আমি বাংলায় গান',
            max_new_tokens=100,
            do_sample=True,
            use_cache=True)

print(output)
```


## Citation
```bash
@misc{hishab_2024_titulm_1b_bn_v1,
  author = {Hishab Technologies Ltd.},
  title = {TituLM-1B-BN-V1},
  year = {2024},
  publisher = {HuggingFace Models},
  howpublished = {https://huggingface.co/hishab/titulm-1b-bn-v1},
}
```