tgi-multi-lora-space-share

Paused

File size: 11,438 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "db41d8ba-71c0-4951-9a88-e1ae01a282ec",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "Please check out my [blog post](https://datavistics.github.io/posts/jais-inference-endpoints/) for more details!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2534669-003d-490c-9d7a-32607fa5f404",
   "metadata": {},
   "source": [
    "# Setup"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c830114-dd88-45a9-81b9-78b0e3da7384",
   "metadata": {},
   "source": [
    "## Requirements"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "35386f72-32cb-49fa-a108-3aa504e20429",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m A new release of pip is available: \u001B[0m\u001B[31;49m23.2.1\u001B[0m\u001B[39;49m -> \u001B[0m\u001B[32;49m23.3.2\u001B[0m\n",
      "\u001B[1m[\u001B[0m\u001B[34;49mnotice\u001B[0m\u001B[1;39;49m]\u001B[0m\u001B[39;49m To update, run: \u001B[0m\u001B[32;49mpip install --upgrade pip\u001B[0m\n",
      "Note: you may need to restart the kernel to use updated packages.\n"
     ]
    }
   ],
   "source": [
    "%pip install -q \"huggingface-hub>=0.20\" ipywidgets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6f72042-173d-4a72-ade1-9304b43b528d",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99f60998-0490-46c6-a8e6-04845ddda7be",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from huggingface_hub import login, whoami, create_inference_endpoint\n",
    "from getpass import getpass"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5eece903-64ce-435d-a2fd-096c0ff650bf",
   "metadata": {},
   "source": [
    "## Config\n",
    "Choose your `ENDPOINT_NAME` if you like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "dcd7daed-6aca-4fe7-85ce-534bdcd8bc87",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "ENDPOINT_NAME = \"jais13b-demo\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0ca1140c-3fcc-4b99-9210-6da1505a27b7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "login()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f4ba0a8-0a6c-4705-a73b-7be09b889610",
   "metadata": {},
   "source": [
    "Some users might have payment registered in an organization. This allows you to connect to an organization (that you are a member of) with a payment method.\n",
    "\n",
    "Leave it blank if you want to use your username."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "88cdbd73-5923-4ae9-9940-b6be935f70fa",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdin",
     "output_type": "stream",
     "text": [
      "What is your Hugging Face 🤗 username or organization? (with an added payment method) ········\n"
     ]
    }
   ],
   "source": [
    "who = whoami()\n",
    "organization = getpass(prompt=\"What is your Hugging Face 🤗 username or organization? (with an added payment method)\")\n",
    "\n",
    "namespace = organization or who['name']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93096cbc-81c6-4137-a283-6afb0f48fbb9",
   "metadata": {},
   "source": [
    "# Inference Endpoints\n",
    "## Create Inference Endpoint\n",
    "We are going to use the [API](https://huggingface.co/docs/inference-endpoints/api_reference) to create an [Inference Endpoint](https://huggingface.co/inference-endpoints). This should provide a few main benefits:\n",
    "- It's convenient (No clicking)\n",
    "- It's repeatable (We have the code to run it easily)\n",
    "- It's cheaper (No time spent waiting for it to load, and automatically shut it down)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1cf8334d-6500-412e-9d6d-58990c42c110",
   "metadata": {},
   "source": [
    "Here is a convenient table of instance details you can use when selecting a GPU. Once you have chosen a GPU in Inference Endpoints, you can use the corresponding `instanceType` and `instanceSize`.\n",
    "\n",
    "| hw_desc             | instanceType   | instanceSize | vRAM  |\n",
    "|---------------------|----------------|--------------|-------|\n",
    "| 1x Nvidia Tesla T4  | g4dn.xlarge    | small        | 16GB  |\n",
    "| 4x Nvidia Tesla T4  | g4dn.12xlarge  | large        | 64GB  |\n",
    "| 1x Nvidia A10G      | g5.2xlarge     | medium       | 24GB  |\n",
    "| 4x Nvidia A10G      | g5.12xlarge    | xxlarge      | 96GB  |\n",
    "| 1x Nvidia A100      | p4de           | xlarge       | 80GB  |\n",
    "| 2x Nvidia A100      | p4de           | 2xlarge      | 160GB |\n",
    "\n",
    "Note: To use a node (multiple GPUs) you will need to use a sharded version of jais. I'm not sure if there is currently a version like this on the hub. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "89c7cc21-3dfe-40e6-80ff-1dcc8558859e",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "hw_dict = dict(\n",
    "    accelerator=\"gpu\",\n",
    "    vendor=\"aws\",\n",
    "    region=\"us-east-1\",\n",
    "    type=\"protected\",\n",
    "    instance_type=\"p4de\",\n",
    "    instance_size=\"xlarge\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "f4267bce-8516-4f3a-b1cc-8ccd6c14a9c7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "tgi_env = {\n",
    "    \"MAX_BATCH_PREFILL_TOKENS\": \"2048\",\n",
    "    \"MAX_INPUT_LENGTH\": \"2000\",\n",
    "    'TRUST_REMOTE_CODE':'true',\n",
    "    \"QUANTIZE\": 'bitsandbytes', \n",
    "    \"MODEL_ID\": \"/repository\"\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74fd83a0-fef0-4e47-8ff1-f4ba7aed131d",
   "metadata": {},
   "source": [
    "A couple notes on my choices here:\n",
    "- I used `derek-thomas/jais-13b-chat-hf` because that repo has SafeTensors merged which will lead to faster loading of the TGI container\n",
    "- I'm using the latest TGI container as of the time of writing (1.3.4)\n",
    "- `min_replica=0` allows [zero scaling](https://huggingface.co/docs/inference-endpoints/autoscaling#scaling-to-0) which is really useful for your wallet though think through if this makes sense for your use-case as there will be loading times\n",
    "- `max_replica` allows you to handle high throughput. Make sure you read through the [docs](https://huggingface.co/docs/inference-endpoints/autoscaling#scaling-criteria) to understand how this scales"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "9e59de46-26b7-4bb9-bbad-8bba9931bde7",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "endpoint = create_inference_endpoint(\n",
    "    ENDPOINT_NAME,\n",
    "    repository=\"derek-thomas/jais-13b-chat-hf\",  \n",
    "    framework=\"pytorch\",\n",
    "    task=\"text-generation\",\n",
    "    **hw_dict,\n",
    "    min_replica=0,\n",
    "    max_replica=1,\n",
    "    namespace=namespace,\n",
    "    custom_image={\n",
    "        \"health_route\": \"/health\",\n",
    "        \"env\": tgi_env,\n",
    "        \"url\": \"ghcr.io/huggingface/text-generation-inference:1.3.4\",\n",
    "    },\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96d173b2-8980-4554-9039-c62843d3fc7d",
   "metadata": {},
   "source": [
    "## Wait until its running"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f3a8bd2-753c-49a8-9452-899578beddc5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "endpoint.wait()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "189b26f0-d404-4570-a1b9-e2a9d486c1f7",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'POSITIVE'"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "endpoint.client.text_generation(\"\"\"\n",
    "### Instruction: What is the sentiment of the input?\n",
    "### Examples\n",
    "I wish the screen was bigger - Negative\n",
    "I hate the battery - Negative\n",
    "I love the default appliations - Positive\n",
    "### Input\n",
    "I am happy with this purchase - \n",
    "### Response\n",
    "\"\"\",\n",
    "                               do_sample=True,\n",
    "                               repetition_penalty=1.2,\n",
    "                               top_p=0.9,\n",
    "                               temperature=0.3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bab97c7b-7bac-4bf5-9752-b528294dadc7",
   "metadata": {},
   "source": [
    "## Pause Inference Endpoint\n",
    "Now that we have finished, lets pause the endpoint so we don't incur any extra charges, this will also allow us to analyze the cost."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "540a0978-7670-4ce3-95c1-3823cc113b85",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Endpoint Status: paused\n"
     ]
    }
   ],
   "source": [
    "endpoint = endpoint.pause()\n",
    "\n",
    "print(f\"Endpoint Status: {endpoint.status}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41abea64-379d-49de-8d9a-355c2f4ce1ac",
   "metadata": {},
   "source": [
    "## Analyze Usage\n",
    "1. Go to your `dashboard_url` printed below\n",
    "1. Check the dashboard\n",
    "1. Analyze the Usage & Cost tab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16815445-3079-43da-b14e-b54176a07a62",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "dashboard_url = f'https://ui.endpoints.huggingface.co/{namespace}/endpoints/{ENDPOINT_NAME}/analytics'\n",
    "print(dashboard_url)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b953d5be-2494-4ff8-be42-9daf00c99c41",
   "metadata": {},
   "source": [
    "## Delete Endpoint"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "c310c0f3-6f12-4d5c-838b-3a4c1f2e54ad",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Endpoint deleted successfully\n"
     ]
    }
   ],
   "source": [
    "endpoint = endpoint.delete()\n",
    "\n",
    "if not endpoint:\n",
    "    print('Endpoint deleted successfully')\n",
    "else:\n",
    "    print('Delete Endpoint in manually') "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "611e1345-8d8c-46b1-a9f8-cff27eecb426",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}