mysticbeing
/

Llama-3.1-Nemotron-70B-Instruct-HF-FP8-DYNAMIC

Text Generation

Model card Files Files and versions Community

mysticbeing commited on 5 days ago

Commit

8088938

•

1 Parent(s): aa4f24d

Update README.md

Files changed (1) hide show

README.md +11 -0

README.md CHANGED Viewed

@@ -51,6 +51,17 @@ By accessing this model, you are agreeing to the LLama 3.1 terms and conditions
 Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads.
 It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
 [Base model description - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF):
 Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.

 Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF) with the updated 8 KV-heads.
 It achieves an average score of [TBD] on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) benchmark (version 1), whereas the unquantized model achieves 86.79.
+### Quantized models are eco-friendly and cost-effective
+FP8 quantized models require significantly less storage compared to traditional 32-bit (FP32) or even 16-bit (FP16) models.
+This reduction can be seen in the total file size comparison, where the FP8 model set is nearly half the size of the higher-precision set.
+This efficiency enables easier distribution, storage, and access to powerful AI models, even on devices with limited capacity.
+Lower hardware requirements mean reduced costs for businesses and public institutions adopting AI solutions. Small businesses, startups, and government entities, which may lack extensive AI budgets, can leverage high-performance,
+FP8 quantized models to solve problems with half the infrastructure cost.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/6590c65952dc1046ca0f13fe/YfP2hvWReX8T6hPr_7Enl.png)
 [Base model description - Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF):
 Llama-3.1-Nemotron-70B-Instruct-HF is a large language model customized by NVIDIA to improve the helpfulness of LLM generated responses to user queries.