shivendrra commited on
Commit
2846342
1 Parent(s): 013d4b1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md CHANGED
@@ -1,3 +1,118 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ datasets:
4
+ - HuggingFaceTB/cosmopedia
5
+ - bigcode/starcoderdata
6
+ - shivendrra/consolidated-datasets
7
+ language:
8
+ - en
9
+ tags:
10
+ - transformers
11
+ - bert
12
+ - decoder-only
13
+ - encoder-decoder
14
+ - mixture of experts
15
+ - moe
16
+ - MoE
17
+ - aiva-500m
18
+ - transformer model
19
+ - llm
20
+ - small scale model
21
  ---
22
+
23
+ # aiva-4x500m
24
+
25
+ ## Model Details
26
+ This is a transformer based model trained on [cosmopedia] and [starcoder] datasets. This is able to generate new sequences and classify the emotions and sentiments in the speech. Uses MoE same as Mistral's 8x7b model, but uses 4 of 500million models.
27
+
28
+ For now it only has the language models, but I'm working on vision and audio model which will be uploaded soon.
29
+ ### Model Description
30
+
31
+ - **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
32
+ - **License:** [MIT]
33
+ - **Train loss:** 0.2035
34
+ - **Accuracy:** Not yet determined(for next token prediction)
35
+ ### Model Sources
36
+
37
+ - **Repository:** [github/aiva-4x500m](https://github.com/shivendrra/AIVA-4x500m)
38
+ - **Papers:** None
39
+ ## Uses
40
+ For now, language model can be used to generate new tokens, masked token prediction and sentiment analysis. But in future, it will be paired along with the audio and vision models to make it work like AVA from *ex-machina*. It could listen to the human, talk to them and understand sentiments, emotions, and actions using it's vision and audio capabilities.
41
+ ## Training Details
42
+
43
+ ### Training Data
44
+ ---
45
+ Used from this dataset: [cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia), [shivendrra/consolidated-datasets](https://huggingface.co/datasets/shivendrra/consolidated-datasets), [starcoderdata](https://huggingface.co/datasets/bigcode/starcoderdata)
46
+ ### Training Procedure
47
+ ---
48
+ Transformer based model was trained for 35k iteration on 3.5billion tokens for more around 25hrs on google colab's T4 gpu. I had access to a lot more data but I didn't train it further because of budget issues and technical limitations.
49
+ #### Functions:
50
+ This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
51
+
52
+ ```python
53
+ def get_batch(split):
54
+     # generate a small batch of data of inputs x and targets y
55
+     data = train_data if split == 'train' else val_data
56
+     ix = torch.randint(len(data) - block_size, (batch_size,))
57
+     x = torch.stack([data[i:i+block_size] for i in ix])
58
+     y = torch.stack([data[i+1:i+block_size+1] for i in ix])
59
+     x, y = x.to(device), y.to(device)
60
+
61
+     return x, y
62
+
63
+ @torch.no_grad()
64
+ def estimate_loss():
65
+     out = {}
66
+     model.eval()
67
+     for split in ['train', 'val']:
68
+         losses = torch.zeros(eval_iters)
69
+         for k in range(eval_iters):
70
+             X, Y = get_batch(split)
71
+             logits, loss = model(X, Y)
72
+             losses[k] = loss.item()
73
+         out[split] = losses.mean()
74
+     model.train()
75
+     return out
76
+
77
+ for iter in range(max_iters):
78
+   if iter % eval_interval == 0 or iter == max_iters - 1:
79
+     losses = estimate_loss()
80
+     print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
81
+
82
+   xb, yb = get_batch('train')
83
+   logits, loss = model(xb, yb)
84
+   optimizer.zero_grad(set_to_none=True)
85
+   loss.backward()
86
+   optimizer.step()
87
+ ```
88
+
89
+ #### Training Hyperparameters
90
+
91
+ Configurations are saved in the `base/config.json` file. suitable for 500million encoder-decoder model.
92
+
93
+ ```json
94
+ {
95
+ "batch_size": 10,
96
+ "block_size": 256,
97
+ "max_iters": 5000,
98
+ "eval_interval": 50,
99
+ "learning_rate": 3e-5,
100
+ "eval_iters": 100,
101
+ "d_model": 512,
102
+ "n_head": 18,
103
+ "n_layer": 12,
104
+ "dropout": 0.2,
105
+ "norm_eps": 1e-5
106
+ }
107
+ ```
108
+
109
+ ### Model Architecture and Objective
110
+
111
+ There is one trained model uploaded for now, a 536million parameter transformer model that is trained for over 35k iterations. It uses RMS norm and has context size of 256-tokens only. `tiktoken` is used for tokenization, and tokenization file is also included configured accordingly to the trained model
112
+ Decoder-based model isn't uploaded for now, it's a little hard to train due to it's complexity. But will be uploaded soon.
113
+
114
+ ### Highlights
115
+ 1. **RMS Normalization & Pre-normalization:** Both of the model uses RMS normalization same as implemented in LLaMa-2 and uses pre-normalization for model's stability while training.
116
+ 2. **Self-Attention Layer:** Encoder and Final attention layer's have no masking and the key, query and values have bias added to them. Decoder-Attention layer has a triangular mask applied to them, without any biases. Also, Encoder-attention has relative positional embeddings added to attention matrix, before `softmax`.
117
+ 3. **FeedForward:** Basic feed-forward network that has two linear layers with expansion factor of 5. GELU is used as activation function for this model instead of ReLU.
118
+ 4. **Generation:** Token generation function uses top_k, top_p and beaming along with temperature scaling, but there is some bug, because it's not working as it supposed to work. I'll try to correct it and then upload again.