NeuralNovel commited on
Commit
cf7184f
1 Parent(s): 15d92e0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -119,6 +119,12 @@ In the boundless sands ..
119
  A model to test how MoE will route without square expansion.
120
 
121
  ## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
 
 
 
 
 
 
122
  ### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
123
 
124
  The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
@@ -136,10 +142,6 @@ At every layer, for every token, a router network chooses two of these groups (t
136
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/up_I0R2TQGjqTShZp_1Sz.png)
137
 
138
 
139
- [Join our Discord!](https://discord.gg/rJXGjmxqzS)
140
-
141
- <a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
142
-
143
  Switch Layer
144
  MoE layer from the [Switch Transformers paper](https://arxiv.org/abs/2101.03961)
145
 
 
119
  A model to test how MoE will route without square expansion.
120
 
121
  ## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
122
+
123
+ [Join our Discord!](https://discord.gg/rJXGjmxqzS)
124
+
125
+ <a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='36' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
126
+
127
+
128
  ### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
129
 
130
  The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
 
142
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/up_I0R2TQGjqTShZp_1Sz.png)
143
 
144
 
 
 
 
 
145
  Switch Layer
146
  MoE layer from the [Switch Transformers paper](https://arxiv.org/abs/2101.03961)
147