Commit
•
9ec27a3
1
Parent(s):
8695c51
Linking to the blog
Browse files
notebooks/jais_tgi_inference_endpoints.ipynb
CHANGED
@@ -2,69 +2,11 @@
|
|
2 |
"cells": [
|
3 |
{
|
4 |
"cell_type": "markdown",
|
5 |
-
"id": "
|
6 |
"metadata": {},
|
7 |
"source": [
|
8 |
"# Introduction\n",
|
9 |
-
"
|
10 |
-
"I want [jais-13B](https://huggingface.co/core42/jais-13b-chat) deployed with an API quickly and easily. I'm also scared of mice so ideally I can just use my keyboard. \n",
|
11 |
-
"\n",
|
12 |
-
"## Approach\n",
|
13 |
-
"There are lots of options out there that are \"1-click\" which is really cool! I would like to do even better and make a \"0-click\". This is great for those that are musophobic (scared of mice) or want scripts that can run without human intervention.\n",
|
14 |
-
"\n",
|
15 |
-
"We will be using [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) as our serving toolkit as it is robust and configurable. For our hardware we will be using [Inference Endpoints](https://huggingface.co/inference-endpoints) as it makes the deployment procedure really easy! We will be using the API to reach our aforementioned \"0-click\" goal."
|
16 |
-
]
|
17 |
-
},
|
18 |
-
{
|
19 |
-
"cell_type": "markdown",
|
20 |
-
"id": "2086a136-6710-45af-b2b1-7224b5cbbca7",
|
21 |
-
"metadata": {},
|
22 |
-
"source": [
|
23 |
-
"# Pre-requisites\n",
|
24 |
-
"Deploying LLMs is a tough process. There are a number of challenges! \n",
|
25 |
-
"- These models are huge\n",
|
26 |
-
" - Slow to load \n",
|
27 |
-
" - Won't fit on convenient HW\n",
|
28 |
-
"- Generative transformers require iterative decoding\n",
|
29 |
-
"- Many of the optimizations are not consolidated\n",
|
30 |
-
"\n",
|
31 |
-
"TGI solves many of these, and while I don't want to dedicate this blog to TGI there are a few concepts we need to cover to properly understand how to configure our deployment.\n",
|
32 |
-
"\n",
|
33 |
-
"\n",
|
34 |
-
"## Prefilling Phase\n",
|
35 |
-
"> In the prefill phase, the LLM processes the input tokens to compute the intermediate states (keys and values), which are used to generate the “first” new token. Each new token depends on all the previous tokens, but because the full extent of the input is known, at a high level this is a matrix-matrix operation that’s highly parallelized. It effectively saturates GPU utilization.\n",
|
36 |
-
"\n",
|
37 |
-
"~[Nvidia Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)\n",
|
38 |
-
"\n",
|
39 |
-
"Prefilling is relatively fast.\n",
|
40 |
-
"\n",
|
41 |
-
"## Decoding Phase\n",
|
42 |
-
"> In the decode phase, the LLM generates output tokens autoregressively one at a time, until a stopping criteria is met. Each sequential output token needs to know all the previous iterations’ output states (keys and values). This is like a matrix-vector operation that underutilizes the GPU compute ability compared to the prefill phase. The speed at which the data (weights, keys, values, activations) is transferred to the GPU from memory dominates the latency, not how fast the computation actually happens. In other words, this is a memory-bound operation.\n",
|
43 |
-
"\n",
|
44 |
-
"~[Nvidia Blog](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/)\n",
|
45 |
-
"\n",
|
46 |
-
"Decoding is relatively slow.\n",
|
47 |
-
"\n",
|
48 |
-
"## Example\n",
|
49 |
-
"Lets take an example of sentiment analysis:\n",
|
50 |
-
"\n",
|
51 |
-
"Below we have input tokens that the LLM will pre-fill. Note that we know what the next token is during the pre-filling phase. We can use this to our advantage.\n",
|
52 |
-
"```text\n",
|
53 |
-
"### Instruction: What is the sentiment of the input?\n",
|
54 |
-
"### Examples\n",
|
55 |
-
"I wish the screen was bigger - Negative\n",
|
56 |
-
"I hate the battery - Negative\n",
|
57 |
-
"I love the default appliations - Positive\n",
|
58 |
-
"### Input\n",
|
59 |
-
"I am happy with this purchase - \n",
|
60 |
-
"### Response\n",
|
61 |
-
"```\n",
|
62 |
-
"\n",
|
63 |
-
"Below we have output tokens generated during decoding phase. Despite being few in this example we dont know what the next token will be until we have generated it.\n",
|
64 |
-
"\n",
|
65 |
-
"```text\n",
|
66 |
-
"Positive\n",
|
67 |
-
"```"
|
68 |
]
|
69 |
},
|
70 |
{
|
@@ -142,7 +84,7 @@
|
|
142 |
"metadata": {},
|
143 |
"source": [
|
144 |
"## Config\n",
|
145 |
-
"
|
146 |
]
|
147 |
},
|
148 |
{
|
@@ -191,7 +133,7 @@
|
|
191 |
"source": [
|
192 |
"Some users might have payment registered in an organization. This allows you to connect to an organization (that you are a member of) with a payment method.\n",
|
193 |
"\n",
|
194 |
-
"Leave it blank
|
195 |
]
|
196 |
},
|
197 |
{
|
@@ -267,32 +209,6 @@
|
|
267 |
")"
|
268 |
]
|
269 |
},
|
270 |
-
{
|
271 |
-
"cell_type": "markdown",
|
272 |
-
"id": "bbc82ce5-d7fa-4167-adc1-b25e567f5559",
|
273 |
-
"metadata": {},
|
274 |
-
"source": [
|
275 |
-
"This is one of the most important parts of this tutorial to understand well. Its important that we choose the deployment settings that best represent our needs and our hardware. I'll just leave some high-level information here and we can go deeper in a future tutorial. It would be interesting to show the difference in how you would optimize your deployment between a chat application and RAG.\n",
|
276 |
-
"\n",
|
277 |
-
"`MAX_BATCH_PREFILL_TOKENS` | [docs](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxbatchprefilltokens) |\n",
|
278 |
-
"> Limits the number of tokens for the prefill operation. Since this operation take the most memory and is compute bound, it is interesting to limit the number of requests that can be sent\n",
|
279 |
-
"\n",
|
280 |
-
"`MAX_INPUT_LENGTH` | [docs](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxinputlength) |\n",
|
281 |
-
"> This is the maximum allowed input length (expressed in number of tokens) for users. The larger this value, the longer prompt users can send which can impact the overall memory required to handle the load. Please note that some models have a finite range of sequence they can handle\n",
|
282 |
-
"\n",
|
283 |
-
"I left this quite large as I want to give a lot of freedom to the user more than I want to trade performance. It's important in RAG applications to give more freedom here. But for few turn chat applications you can be more restrictive.\n",
|
284 |
-
"\n",
|
285 |
-
"`MAX_TOTAL_TOKENS` | [docs](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#maxtotaltokens) | \n",
|
286 |
-
"> This is the most important value to set as it defines the \"memory budget\" of running clients requests. Clients will send input sequences and ask to generate `max_new_tokens` on top. with a value of `1512` users can send either a prompt of `1000` and ask for `512` new tokens, or send a prompt of `1` and ask for `1511` max_new_tokens. The larger this value, the larger amount each request will be in your RAM and the less effective batching can be.\n",
|
287 |
-
"\n",
|
288 |
-
"`TRUST_REMOTE_CODE` This is set to `true` as jais requires it.\n",
|
289 |
-
"\n",
|
290 |
-
"`QUANTIZE` | [docs](https://huggingface.co/docs/text-generation-inference/basic_tutorials/launcher#quantize) |\n",
|
291 |
-
"> Whether you want the model to be quantized\n",
|
292 |
-
"\n",
|
293 |
-
"With jais, you really only have the bitsandbytes option. The tradeoff is that inference is a bit slower, but you can use much smaller GPUs (~3x smaller) without noticably losing performance. It's one of the better reads IMO and I recommend checking out the [paper](https://arxiv.org/abs/2208.07339)."
|
294 |
-
]
|
295 |
-
},
|
296 |
{
|
297 |
"cell_type": "code",
|
298 |
"execution_count": 7,
|
@@ -461,10 +377,10 @@
|
|
461 |
"id": "41abea64-379d-49de-8d9a-355c2f4ce1ac",
|
462 |
"metadata": {},
|
463 |
"source": [
|
464 |
-
"
|
465 |
"1. Go to your `dashboard_url` printed below\n",
|
466 |
-
"1.
|
467 |
-
"1.
|
468 |
]
|
469 |
},
|
470 |
{
|
@@ -493,8 +409,7 @@
|
|
493 |
"id": "b953d5be-2494-4ff8-be42-9daf00c99c41",
|
494 |
"metadata": {},
|
495 |
"source": [
|
496 |
-
"
|
497 |
-
"We should see a `200` if everything went correctly."
|
498 |
]
|
499 |
},
|
500 |
{
|
|
|
2 |
"cells": [
|
3 |
{
|
4 |
"cell_type": "markdown",
|
5 |
+
"id": "db41d8ba-71c0-4951-9a88-e1ae01a282ec",
|
6 |
"metadata": {},
|
7 |
"source": [
|
8 |
"# Introduction\n",
|
9 |
+
"Please check out my [blog post](https://datavistics.github.io/posts/jais-inference-endpoints/) for more details!"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
]
|
11 |
},
|
12 |
{
|
|
|
84 |
"metadata": {},
|
85 |
"source": [
|
86 |
"## Config\n",
|
87 |
+
"Choose your `ENDPOINT_NAME` if you like."
|
88 |
]
|
89 |
},
|
90 |
{
|
|
|
133 |
"source": [
|
134 |
"Some users might have payment registered in an organization. This allows you to connect to an organization (that you are a member of) with a payment method.\n",
|
135 |
"\n",
|
136 |
+
"Leave it blank if you want to use your username."
|
137 |
]
|
138 |
},
|
139 |
{
|
|
|
209 |
")"
|
210 |
]
|
211 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
212 |
{
|
213 |
"cell_type": "code",
|
214 |
"execution_count": 7,
|
|
|
377 |
"id": "41abea64-379d-49de-8d9a-355c2f4ce1ac",
|
378 |
"metadata": {},
|
379 |
"source": [
|
380 |
+
"## Analyze Usage\n",
|
381 |
"1. Go to your `dashboard_url` printed below\n",
|
382 |
+
"1. Check the dashboard\n",
|
383 |
+
"1. Analyze the Usage & Cost tab"
|
384 |
]
|
385 |
},
|
386 |
{
|
|
|
409 |
"id": "b953d5be-2494-4ff8-be42-9daf00c99c41",
|
410 |
"metadata": {},
|
411 |
"source": [
|
412 |
+
"## Delete Endpoint"
|
|
|
413 |
]
|
414 |
},
|
415 |
{
|