training data word choice fix
Browse files
README.md
CHANGED
@@ -277,9 +277,9 @@ Granite-3.0-2B-Base is based on a decoder-only dense transformer architecture. C
|
|
277 |
| # Training tokens | **12T** | 12T | 10T | 10T |
|
278 |
|
279 |
**Training Data:**
|
280 |
-
This model is trained on a mix of open source and proprietary data following a two-
|
281 |
-
* Stage 1 data: The data for
|
282 |
-
* Stage 2 data: The data for
|
283 |
|
284 |
**Infrastructure:**
|
285 |
We train Granite 3.0 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|
|
|
277 |
| # Training tokens | **12T** | 12T | 10T | 10T |
|
278 |
|
279 |
**Training Data:**
|
280 |
+
This model is trained on a mix of open source and proprietary data following a two-stage training strategy.
|
281 |
+
* Stage 1 data: The data for stage 1 is sourced from diverse domains, such as: web, code, academic sources, books, and math data.
|
282 |
+
* Stage 2 data: The data for stage 2 comprises a curated mix of high-quality data from the same domains, plus multilingual and instruction data. The goal of this second training phase is to enhance the model’s performance on specific tasks.
|
283 |
|
284 |
**Infrastructure:**
|
285 |
We train Granite 3.0 Language Models using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
|