mysticbeing
commited on
Commit
•
8088938
1
Parent(s):
aa4f24d
Update README.md
Browse files
README.md
CHANGED
@@ -51,6 +51,17 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
|
|
51 |
Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads.
|
52 |
It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
[Base model description - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF):
|
55 |
|
56 |
Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.
|
|
|
51 |
Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads.
|
52 |
It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
|
53 |
|
54 |
+
### Quantized models are eco-friendly and cost-effective
|
55 |
+
FP8 quantized models require significantly less storage compared to traditional 32-bit (FP32) or even 16-bit (FP16) models.
|
56 |
+
This reduction can be seen in the total file size comparison, where the FP8 model set is nearly half the size of the higher-precision set.
|
57 |
+
This efficiency enables easier distribution, storage, and access to powerful AI models, even on devices with limited capacity.
|
58 |
+
|
59 |
+
Lower hardware requirements mean reduced costs for businesses and public institutions adopting AI solutions. Small businesses, startups, and government entities, which may lack extensive AI budgets, can leverage high-performance,
|
60 |
+
FP8 quantized models to solve problems with half the infrastructure cost.
|
61 |
+
|
62 |
+
|
63 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6590c65952dc1046ca0f13fe/YfP2hvWReX8T6hPr_7Enl.png)
|
64 |
+
|
65 |
[Base model description - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF):
|
66 |
|
67 |
Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.
|