sagorsarker
commited on
Commit
•
60c73ee
1
Parent(s):
0c58c9f
Update README.md
Browse files
README.md
CHANGED
@@ -12,6 +12,37 @@ pipeline_tag: text-generation
|
|
12 |
TituLM-1B-BN-V1 is a large language model specifically trained for generating and understanding Bangla text. Utilizing a decoder-style transformer architecture, this model has been extensively trained on a dataset comprising 4.51 billion Bangla tokens. This model is the part of iterative train and release Bangla LLM from Hishab.
|
13 |
|
14 |
## Training
|
15 |
-
The training process was managed using the robust framework provided by MosaicML's [llm-foundry](https://github.com/mosaicml/llm-foundry) repository. Throughout the training phase, titulm-1b-bn-v1 underwent a total of 59 iterations, allowing for iterative refinements and optimization
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
## Datasets
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
TituLM-1B-BN-V1 is a large language model specifically trained for generating and understanding Bangla text. Utilizing a decoder-style transformer architecture, this model has been extensively trained on a dataset comprising 4.51 billion Bangla tokens. This model is the part of iterative train and release Bangla LLM from Hishab.
|
13 |
|
14 |
## Training
|
15 |
+
The training process was managed using the robust framework provided by MosaicML's [llm-foundry](https://github.com/mosaicml/llm-foundry) repository. Throughout the training phase, titulm-1b-bn-v1 underwent a total of 59 iterations, allowing for iterative refinements and optimization.
|
16 |
+
Notable training configs:
|
17 |
+
|
18 |
+
- n_nead: 16
|
19 |
+
- n_layers: 24
|
20 |
+
- max_sequence_length: 2048
|
21 |
+
- vocab_size: 72000
|
22 |
+
- attn_impl: flash
|
23 |
|
24 |
## Datasets
|
25 |
+
We add Bangla text datasets from several sources including
|
26 |
+
|
27 |
+
- Culturax
|
28 |
+
- Books
|
29 |
+
- Bangla Wikipedia
|
30 |
+
- Banglapedia
|
31 |
+
- News articles
|
32 |
+
|
33 |
+
Our total data size is 58 GB of deduplicated data with 4.51 billion tokens tokenized by our sentencepiece model.
|
34 |
+
|
35 |
+
|
36 |
+
## How to Use
|
37 |
+
The basic use cases to generate text using this model is simple. Follow the below code to generate text using this model.
|
38 |
+
|
39 |
+
Install the following library before running the code:
|
40 |
+
|
41 |
+
- pip install transofrmers
|
42 |
+
- pip install einops
|
43 |
+
- pip install accelerate
|
44 |
+
|
45 |
+
```py
|
46 |
+
# code will add soon
|
47 |
+
```
|
48 |
+
|