Group size is 128 or 1 for main branch?
According to Readme.md for the main branch (https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ#provided-files),
Branch Bits Group Size Act Order (desc_act) File Size ExLlama Compatible? Made With Description
main 4 128 False 35.33 GB True AutoGPTQ Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options.
but the file name is gptq_model-4bit--1g.safetensors
rather than gptq_model-4bit--128g.safetensors
. Thus, which one is correct?
Sorry, README is wrong - main branch is groupsize -1 (no group size). I'll fix that
Could you clarify whether only the main branch supports GPTQ-for-LLaMa, as other branches seem to be unable? I've attempted to use TGI to start up the gptq_model-4bit--1g.safetensors, which worked fine, but using 2 GPUs failed due to the groupsize not being >= 2. I am seeking a version with a groupsize >= 2. However, my attempts to start other branches through TGI have failed.
That's confusing. I thought it was the exact opposite - that the main branch wouldn't work with TGI because for this model I used an old GPTQ-for-LLaMa version, and that all the others would work because they were made with AutoGPTQ. Actually no, I made all these with AutoGPTQ so I would expect them all to work.
What problems do you have with the ones in the other branches?
Just to note, I'm using TGI v0.9.4.
I encounter a 'ShardCannotStart' error, yet it works fine when I initiate using the main branch and a single GPU.
However, for instance, with 'gptq-4bit-128g-actorder_True' and 2 GPUs:
{"timestamp":"2023-08-18T09:05:32.861563Z","level":"INFO","fields":{"message":"Args { model_id: \"/tmp/datadrive/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 4096, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 8192, max_batch_total_tokens: Some(8192), max_waiting_tokens: 20, hostname: \"0.0.0.0\", port: 1234, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: Some(\"/tmp/datadrive/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True\"), disable_custom_kernels: false, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:32.861603Z","level":"INFO","fields":{"message":"Sharding model on 2 processes"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:32.861714Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-08-18T09:05:42.632849Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:44.881424Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-08-18T09:05:44.881694Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:05:44.881742Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:45.037129Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:45.037129Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:55.044955Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:55.044955Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:09:04.654342Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 78, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 180, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 150, in serve_inner\n create_exllama_buffers()\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/exllama.py\", line 52, in create_exllama_buffers\n prepare_buffers(DEVICE, temp_state, temp_dq)\n\nTypeError: prepare_buffers(): incompatible function arguments. The following argument types are supported:\n 1. (arg0: torch.device, arg1: torch.Tensor, arg2: torch.Tensor) -> None\n\nInvoked with: None, tensor([[0.]], dtype=torch.float16), tensor([[0.]], dtype=torch.float16)\n"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:09:04.745107Z","level":"ERROR","fields":{"message":"Shard 1 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:09:04.745148Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:09:04.986644Z","level":"INFO","fields":{"message":"Shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
OK if it works for one GPU then it's not an issue with my GPTQs I think. I don't know what's required for sharding exactly. Could you raise it on the TGI Github
Thanks @TheBloke , the problem is resolved after I used the latest TGI code.
How about the group size in the main branch of Llama-2-13B-chat-GPTQ? As there is another branch called GPTQ-gptq-4bit-128g-actorder_True, is the only difference between these two branches "actorder"?
Yes that's correct. The model with act-order = True has higher quality, but in the past using act-order + group_size has caused performance problems for some GPTQ clients.
That may now be resolved, and I don't know if it ever affected TGI.
So try 128g + True first and only use 128g + False if performance seems slow. In future I may make 128g + True the 'main' model, or even drop 128 + False entirely, if the performance issues are confirmed to be resolved.
@TheBloke Do you have any recommendation about which hyperparameters we should use to have the fastest inference speed for GPTQ models? As I did an experiment on TGI with both quantized and non-quantized LLaMa-2 models, I'm confused why the GPTQ models always have slower inference speed for the same request body, and the GPU memory usage is almost similar between every model on TGI. FYI, I'm using A100 80GB for testing.
Model | No. of GPU(s) | Parameters | Quantization Method | Bits | GPTQ Group Size | ExLlama Compatible? | Processing time / request | GPU Memory Used | Sharded |
---|---|---|---|---|---|---|---|---|---|
Llama-2-7b-chat-hf | 2 | 7B | - | 16 | - | - | 4.00 s | 147.1 GB | - |
Llama-2-7b-chat-hf | 1 | 7B | - | 16 | - | - | 3.10 s | 78.7 GB | - |
Llama-2-7b-chat-hf | 1 | 7B | - | 16 | - | - | 11.30 s | 78.8 GB | False |
Llama-2-7b-Chat-GPTQ (main) | 1 | 7B | GPTQ | 4 | 128 | Yes | 4.50 s | 79.3 GB | - |
Llama-2-7b-Chat-GPTQ (main) | 1 | 7B | GPTQ | 4 | 128 | Yes | 11.35 s | 79.3 GB | False |
Llama-2-13b-chat-hf | 1 | 13B | - | 16 | - | - | 5.35 s | 78.4 GB | - |
Llama-2-13B-chat-GPTQ (main) | 1 | 13B | GPTQ | 4 | 128 | Yes | 8.80 s | 79.1 GB | - |
Llama-2-13B-chat-GPTQ (gptq-4bit-128g-actorder_True) | 1 | 13B | GPTQ | 4 | 128 | Yes | - | Not Enough Memory | - |
Llama-2-13B-chat-GPTQ (gptq-8bit--1g-actorder_True) | 1 | 13B | GPTQ | 8 | -1 | No | 11.35 s | 78.8 GB | - |
Llama-2-13B-chat-GPTQ (gptq-8bit-128g-actorder_False) | 1 | 13B | GPTQ | 8 | 128 | No | 11.75 s | 78.7 GB | - |
Llama-2-70b-chat-hf | 2 | 70B | - | 16 | - | - | 11.4 s | 159.5 GB | - |
Llama-2-70b-chat-hf | 1 | 70B | bitsandbytes | 4 | - | - | 35.5 s | 74.2 GB | - |
Llama-2-70B-chat-GPTQ (main) | 1 | 70B | GPTQ | 4 | -1 | Yes | 23.95 s | 77.8 GB | - |
Llama-2-70B-chat-GPTQ (main) | 1 | 70B | GPTQ | 4 | -1 | Yes | 23.95 s | 77.8 GB | False |
Llama-2-70B-chat-GPTQ (gptq-4bit-32g-actorder_True) | 2 | 70B | GPTQ | 4 | 32 | Yes | 33.8 s | 86.12 GB | - |
Llama-2-70B-chat-GPTQ (gptq-4bit-128g-actorder_True) | 1 | 70B | GPTQ | 4 | 128 | Yes | - | Not Enough Memory | - |