NeuralNovel
commited on
Commit
•
cf7184f
1
Parent(s):
15d92e0
Update README.md
Browse files
README.md
CHANGED
@@ -119,6 +119,12 @@ In the boundless sands ..
|
|
119 |
A model to test how MoE will route without square expansion.
|
120 |
|
121 |
## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
|
123 |
|
124 |
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
|
@@ -136,10 +142,6 @@ At every layer, for every token, a router network chooses two of these groups (t
|
|
136 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/up_I0R2TQGjqTShZp_1Sz.png)
|
137 |
|
138 |
|
139 |
-
[Join our Discord!](https://discord.gg/rJXGjmxqzS)
|
140 |
-
|
141 |
-
<a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
|
142 |
-
|
143 |
Switch Layer
|
144 |
MoE layer from the [Switch Transformers paper](https://arxiv.org/abs/2101.03961)
|
145 |
|
|
|
119 |
A model to test how MoE will route without square expansion.
|
120 |
|
121 |
## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
|
122 |
+
|
123 |
+
[Join our Discord!](https://discord.gg/rJXGjmxqzS)
|
124 |
+
|
125 |
+
<a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
|
126 |
+
|
127 |
+
|
128 |
### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
|
129 |
|
130 |
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
|
|
|
142 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/up_I0R2TQGjqTShZp_1Sz.png)
|
143 |
|
144 |
|
|
|
|
|
|
|
|
|
145 |
Switch Layer
|
146 |
MoE layer from the [Switch Transformers paper](https://arxiv.org/abs/2101.03961)
|
147 |
|