hugging-quants
/

Meta-Llama-3.1-70B-Instruct-AWQ-INT4

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

alvarobartt HF staff commited on Jul 24

Commit

cfb8846

•

1 Parent(s): b5422fe

Update README.md

Files changed (1) hide show

README.md +19 -4

README.md CHANGED Viewed

@@ -162,18 +162,15 @@ curl 0.0.0.0:8080/v1/chat/completions \
   }'
 ```
-Or programatically via the `huggingface_hub` Python client as follows (TGI is fully compatible with OpenAI so its `openai` SDK can also be used):
 ```python
 import os
-# Instead of `from openai import OpenAI`
 from huggingface_hub import InferenceClient
-# Instead of `client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY"))`
 client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
 chat_completion = client.chat.completions.create(
-  # Instead of `model="tgi"`
   model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
   messages=[
     {"role": "system", "content": "You are a helpful assistant."},
@@ -183,6 +180,24 @@ chat_completion = client.chat.completions.create(
 )
 ```
 ### vLLM
 To run vLLM with Llama 3.1 70B Instruct AWQ in INT4, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and run the latest vLLM Docker container as follows:

   }'
 ```
+Or programatically via the `huggingface_hub` Python client as follows:
 ```python
 import os
 from huggingface_hub import InferenceClient
 client = InferenceClient(base_url="http://0.0.0.0:8080", api_key=os.getenv("HF_TOKEN", "-"))
 chat_completion = client.chat.completions.create(
   model="hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
   messages=[
     {"role": "system", "content": "You are a helpful assistant."},
 )
 ```
+Alternatively, the OpenAI Python client can also be used (see [installation notes](https://github.com/openai/openai-python?tab=readme-ov-file#installation)) as follows:
+```python
+import os
+from openai import OpenAI
+client = OpenAI(base_url="http://0.0.0.0:8080/v1", api_key=os.getenv("OPENAI_API_KEY", "-"))
+chat_completion = client.chat.completions.create(
+  model="tgi",
+  messages=[
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "What is Deep Learning?"},
+  ],
+  max_tokens=128,
+)
+```
 ### vLLM
 To run vLLM with Llama 3.1 70B Instruct AWQ in INT4, you will need to have Docker installed (see [installation notes](https://docs.docker.com/engine/install/)) and run the latest vLLM Docker container as follows: