File size: 8,338 Bytes

8cf2f51
 
 
 
 
 
 
 
24579e8
8cf2f51
 
 
 
 
 
 
 
 
70b93f5
8cf2f51
 
 
 
 
70b93f5
 
 
 
 
8cf2f51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a151e8
8cf2f51
5a151e8
 
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
5a151e8
8cf2f51
 
 
 
 
 
 
 
 
 
 
 
 
5a151e8
8cf2f51
70b93f5
 
 
 
 
8cf2f51
 
70b93f5
 
8cf2f51
 
 
70b93f5
 
 
 
 
5a151e8
 
 
8cf2f51
 
 
 
 
70b93f5
8cf2f51
 
70b93f5
8cf2f51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70b93f5
8cf2f51

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "28e3460d-59b1-4d4c-b62e-510987fb2f28",
   "metadata": {},
   "source": [
    "# Introduction\n",
    "The purpose of this notebook is to show how to launch the TGI Benchmark tool. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0de3cc9-c6cd-45b3-9dd0-84b3cb2fc8b2",
   "metadata": {},
   "source": [
    "Here we can see the different settings for TGI Benchmark. \n",
    "\n",
    "Here are some of the more important ones:\n",
    "\n",
    "- `--tokenizer-name` This is required so the tool knows what tokenizer to use\n",
    "- `--batch-size` This is important for load testing. We should use more and more values to see what happens to throughput and latency\n",
    "- `--sequence-length` AKA input tokens, it is important to match your use-case needs\n",
    "- `--decode-length` AKA output tokens, it is important to match your use-case needs\n",
    "- `--runs` 10 is the default\n",
    "\n",
    "<blockquote style=\"border-left: 5px solid #80CBC4; background: #263238; color: #CFD8DC; padding: 0.5em 1em; margin: 1em 0;\">\n",
    "  <strong>💡 Tip:</strong> Use a low number for <code style=\"background: #37474F; color: #FFFFFF; padding: 2px 4px; border-radius: 4px;\">--runs</code> when you are exploring but a higher number as you finalize to get more precise statistics\n",
    "</blockquote>\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "694df6d6-a521-4dab-977b-2828d4250781",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Text Generation Benchmarking tool\n",
      "\n",
      "\u001B[1m\u001B[4mUsage:\u001B[0m \u001B[1mtext-generation-benchmark\u001B[0m [OPTIONS] \u001B[1m--tokenizer-name\u001B[0m <TOKENIZER_NAME>\n",
      "\n",
      "\u001B[1m\u001B[4mOptions:\u001B[0m\n",
      "  \u001B[1m-t\u001B[0m, \u001B[1m--tokenizer-name\u001B[0m <TOKENIZER_NAME>\n",
      "          The name of the tokenizer (as in model_id on the huggingface hub, or local path) [env: TOKENIZER_NAME=]\n",
      "      \u001B[1m--revision\u001B[0m <REVISION>\n",
      "          The revision to use for the tokenizer if on the hub [env: REVISION=] [default: main]\n",
      "  \u001B[1m-b\u001B[0m, \u001B[1m--batch-size\u001B[0m <BATCH_SIZE>\n",
      "          The various batch sizes to benchmark for, the idea is to get enough batching to start seeing increased latency, this usually means you're moving from memory bound (usual as BS=1) to compute bound, and this is a sweet spot for the maximum batch size for the model under test\n",
      "  \u001B[1m-s\u001B[0m, \u001B[1m--sequence-length\u001B[0m <SEQUENCE_LENGTH>\n",
      "          This is the initial prompt sent to the text-generation-server length in token. Longer prompt will slow down the benchmark. Usually the latency grows somewhat linearly with this for the prefill step [env: SEQUENCE_LENGTH=] [default: 10]\n",
      "  \u001B[1m-d\u001B[0m, \u001B[1m--decode-length\u001B[0m <DECODE_LENGTH>\n",
      "          This is how many tokens will be generated by the server and averaged out to give the `decode` latency. This is the *critical* number you want to optimize for LLM spend most of their time doing decoding [env: DECODE_LENGTH=] [default: 8]\n",
      "  \u001B[1m-r\u001B[0m, \u001B[1m--runs\u001B[0m <RUNS>\n",
      "          How many runs should we average from [env: RUNS=] [default: 10]\n",
      "  \u001B[1m-w\u001B[0m, \u001B[1m--warmups\u001B[0m <WARMUPS>\n",
      "          Number of warmup cycles [env: WARMUPS=] [default: 1]\n",
      "  \u001B[1m-m\u001B[0m, \u001B[1m--master-shard-uds-path\u001B[0m <MASTER_SHARD_UDS_PATH>\n",
      "          The location of the grpc socket. This benchmark tool bypasses the router completely and directly talks to the gRPC processes [env: MASTER_SHARD_UDS_PATH=] [default: /tmp/text-generation-server-0]\n",
      "      \u001B[1m--temperature\u001B[0m <TEMPERATURE>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TEMPERATURE=]\n",
      "      \u001B[1m--top-k\u001B[0m <TOP_K>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_K=]\n",
      "      \u001B[1m--top-p\u001B[0m <TOP_P>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_P=]\n",
      "      \u001B[1m--typical-p\u001B[0m <TYPICAL_P>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TYPICAL_P=]\n",
      "      \u001B[1m--repetition-penalty\u001B[0m <REPETITION_PENALTY>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: REPETITION_PENALTY=]\n",
      "      \u001B[1m--frequency-penalty\u001B[0m <FREQUENCY_PENALTY>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: FREQUENCY_PENALTY=]\n",
      "      \u001B[1m--watermark\u001B[0m\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: WATERMARK=]\n",
      "      \u001B[1m--do-sample\u001B[0m\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: DO_SAMPLE=]\n",
      "      \u001B[1m--top-n-tokens\u001B[0m <TOP_N_TOKENS>\n",
      "          Generation parameter in case you want to specifically test/debug particular decoding strategies, for full doc refer to the `text-generation-server` [env: TOP_N_TOKENS=]\n",
      "  \u001B[1m-h\u001B[0m, \u001B[1m--help\u001B[0m\n",
      "          Print help (see more with '--help')\n",
      "  \u001B[1m-V\u001B[0m, \u001B[1m--version\u001B[0m\n",
      "          Print version\n"
     ]
    }
   ],
   "source": [
    "!text-generation-benchmark -h"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42d9561b-1aea-4c8c-9fe8-e36af43482fe",
   "metadata": {},
   "source": [
    "Here is an example command. Notice that I add the batch sizes of interest repeatedly to make sure all of them are used by the benchmark tool. I'm also considering which batch sizes are important based on estimated user activity.\n",
    "\n",
    "<blockquote style=\"border-left: 5px solid #FFAB91; background: #37474F; color: #FFCCBC; padding: 0.5em 1em; margin: 1em 0;\">\n",
    "  <strong>⚠️ Warning:</strong> Please note that the TGI Benchmark tool is designed to work in a terminal, not a jupyter notebook. This means you will need to copy/paste the command in a jupyter terminal tab. I am putting them here for convenience.\n",
    "</blockquote>\n",
    "\n",
    "```bash\n",
    "text-generation-benchmark \\\n",
    "--tokenizer-name astronomer/Llama-3-8B-Instruct-GPTQ-8-Bit \\\n",
    "--sequence-length 70 \\\n",
    "--decode-length 50 \\\n",
    "--batch-size 1 \\\n",
    "--batch-size 2 \\\n",
    "--batch-size 4 \\\n",
    "--batch-size 8 \\\n",
    "--batch-size 16 \\\n",
    "--batch-size 32 \\\n",
    "--batch-size 64 \\\n",
    "--batch-size 128 \n",
    "```\n",
    "\n",
    "Hit `q` to stop the tool."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13ac475b-44e1-47e4-85ce-def2db6879c9",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}