yinsong1986 commited on
Commit
d1f0846
1 Parent(s): 9f347ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -19
README.md CHANGED
@@ -83,11 +83,13 @@ there were some limitations on its performance on longer context. Motivated by i
83
  ## How to Use MistralFlite from Python Code ##
84
  ### Install the necessary packages
85
 
86
- Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, and [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later.
 
87
 
88
  ```shell
89
  pip install transformers==4.34.0
90
  pip install flash-attn==2.3.1.post1 --no-build-isolation
 
91
  ```
92
  ### You can then try the following example code
93
 
@@ -112,7 +114,7 @@ prompt = "<|prompter|>What are the main challenges to support a long context for
112
 
113
  sequences = pipeline(
114
  prompt,
115
- max_new_tokens=200,
116
  do_sample=False,
117
  return_full_text=False,
118
  num_return_sequences=1,
@@ -225,31 +227,55 @@ Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggi
225
  Example Docker parameters:
226
 
227
  ```shell
228
- --model-id amazon/MistralLite --port 3000 --max-input-length 8192 --max-total-tokens 16384 --max-batch-prefill-tokens 16384
 
 
 
 
229
  ```
230
 
231
  ### Perform Inference ###
232
- Example Python code for inference with TGI (requires huggingface-hub 0.17.0 or later):
233
 
234
  ```shell
235
- pip3 install huggingface-hub==0.17.0
236
  ```
237
 
238
  ```python
239
- from huggingface_hub import InferenceClient
240
-
241
- endpoint_url = "https://your-endpoint-url-here"
242
-
243
- prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
244
-
245
- client = InferenceClient(endpoint_url)
246
- response = client.text_generation(prompt,
247
- max_new_tokens=100,
248
- do_sample=False,
249
- temperature=None,
250
- )
251
-
252
- print(f"Model output: {response}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253
  ```
254
 
255
  **Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.
 
83
  ## How to Use MistralFlite from Python Code ##
84
  ### Install the necessary packages
85
 
86
+ Requires: [transformers](https://pypi.org/project/transformers/) 4.34.0 or later, [flash-attn](https://pypi.org/project/flash-attn/) 2.3.1.post1 or later,
87
+ and [accelerate](https://pypi.org/project/accelerate/) 0.23.0 or later.
88
 
89
  ```shell
90
  pip install transformers==4.34.0
91
  pip install flash-attn==2.3.1.post1 --no-build-isolation
92
+ pip install accelerate==0.23.0
93
  ```
94
  ### You can then try the following example code
95
 
 
114
 
115
  sequences = pipeline(
116
  prompt,
117
+ max_new_tokens=400,
118
  do_sample=False,
119
  return_full_text=False,
120
  num_return_sequences=1,
 
227
  Example Docker parameters:
228
 
229
  ```shell
230
+ docker run -d --gpus all --shm-size 1g -p 443:80 ghcr.io/huggingface/text-generation-inference:1.1.0 \
231
+ --model-id amazon/MistralLite \
232
+ --max-input-length 8192 \
233
+ --max-total-tokens 16384 \
234
+ --max-batch-prefill-tokens 16384
235
  ```
236
 
237
  ### Perform Inference ###
238
+ Example Python code for inference with TGI (requires `text_generation` 0.6.1 or later):
239
 
240
  ```shell
241
+ pip install text_generation==0.6.1
242
  ```
243
 
244
  ```python
245
+ from text_generation import Client
246
+
247
+ SERVER_PORT = 443
248
+ SERVER_HOST = "localhost"
249
+ SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
250
+ tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
251
+
252
+ def invoke_falconlite(prompt,
253
+ random_seed=1,
254
+ max_new_tokens=250,
255
+ print_stream=True,
256
+ assist_role=True):
257
+ if (assist_role):
258
+ prompt = f"<|prompter|>{prompt}<|/s|><|assistant|>"
259
+ output = ""
260
+ for response in tgi_client.generate_stream(
261
+ prompt,
262
+ do_sample=False,
263
+ max_new_tokens=max_new_tokens,
264
+ typical_p=0.2,
265
+ temperature=None,
266
+ truncate=None,
267
+ seed=random_seed,
268
+ ):
269
+ if hasattr(response, "token"):
270
+ if not response.token.special:
271
+ snippet = response.token.text
272
+ output += snippet
273
+ if (print_stream):
274
+ print(snippet, end='', flush=True)
275
+ return output
276
+
277
+ prompt = "What are the main challenges to support a long context for LLM?"
278
+ result = invoke_falconlite(prompt)
279
  ```
280
 
281
  **Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.