TheBloke/Llama-2-70B-Chat-GPTQ · Group size is 128 or 1 for main branch?

Aug 18, 2023

According to Readme.md for the main branch (https://huggingface.co/TheBloke/Llama-2-70B-chat-GPTQ#provided-files),

Branch	Bits	Group Size	Act Order (desc_act)	File Size	ExLlama Compatible?	Made With	Description
main	4	128	False	35.33 GB	True	AutoGPTQ	Most compatible option. Good inference speed in AutoGPTQ and GPTQ-for-LLaMa. Lower inference quality than other options.

but the file name is gptq_model-4bit--1g.safetensors rather than gptq_model-4bit--128g.safetensors. Thus, which one is correct?

TheBloke

Owner Aug 18, 2023

Sorry, README is wrong - main branch is groupsize -1 (no group size). I'll fix that

brendanlui

Aug 18, 2023

Could you clarify whether only the main branch supports GPTQ-for-LLaMa, as other branches seem to be unable? I've attempted to use TGI to start up the gptq_model-4bit--1g.safetensors, which worked fine, but using 2 GPUs failed due to the groupsize not being >= 2. I am seeking a version with a groupsize >= 2. However, my attempts to start other branches through TGI have failed.

TheBloke

Owner Aug 18, 2023

•

edited Aug 18, 2023

That's confusing. I thought it was the exact opposite - that the main branch wouldn't work with TGI because for this model I used an old GPTQ-for-LLaMa version, and that all the others would work because they were made with AutoGPTQ. Actually no, I made all these with AutoGPTQ so I would expect them all to work.

What problems do you have with the ones in the other branches?

brendanlui

Aug 18, 2023

Just to note, I'm using TGI v0.9.4.

I encounter a 'ShardCannotStart' error, yet it works fine when I initiate using the main branch and a single GPU.
However, for instance, with 'gptq-4bit-128g-actorder_True' and 2 GPUs:

{"timestamp":"2023-08-18T09:05:32.861563Z","level":"INFO","fields":{"message":"Args { model_id: \"/tmp/datadrive/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 4096, max_total_tokens: 8192, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 8192, max_batch_total_tokens: Some(8192), max_waiting_tokens: 20, hostname: \"0.0.0.0\", port: 1234, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: Some(\"/tmp/datadrive/Llama-2-70B-chat-GPTQ-gptq-4bit-128g-actorder_True\"), disable_custom_kernels: false, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:32.861603Z","level":"INFO","fields":{"message":"Sharding model on 2 processes"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:32.861714Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-08-18T09:05:42.632849Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:05:44.881424Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-08-18T09:05:44.881694Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:05:44.881742Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:45.037129Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:45.037129Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:55.044955Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:08:55.044955Z","level":"INFO","fields":{"message":"Waiting for shard to be ready..."},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:09:04.654342Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type llama to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n  File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n    sys.exit(app())\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 78, in serve\n    server.serve(\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 180, in serve\n    asyncio.run(\n\n  File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n    return loop.run_until_complete(main)\n\n  File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n    return future.result()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 150, in serve_inner\n    create_exllama_buffers()\n\n  File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/gptq/exllama.py\", line 52, in create_exllama_buffers\n    prepare_buffers(DEVICE, temp_state, temp_dq)\n\nTypeError: prepare_buffers(): incompatible function arguments. The following argument types are supported:\n    1. (arg0: torch.device, arg1: torch.Tensor, arg2: torch.Tensor) -> None\n\nInvoked with: None, tensor([[0.]], dtype=torch.float16), tensor([[0.]], dtype=torch.float16)\n"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2023-08-18T09:09:04.745107Z","level":"ERROR","fields":{"message":"Shard 1 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:09:04.745148Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2023-08-18T09:09:04.986644Z","level":"INFO","fields":{"message":"Shard terminated"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart

TheBloke

Owner Aug 18, 2023

OK if it works for one GPU then it's not an issue with my GPTQs I think. I don't know what's required for sharding exactly. Could you raise it on the TGI Github

brendanlui

Aug 24, 2023

Thanks @TheBloke , the problem is resolved after I used the latest TGI code.

How about the group size in the main branch of Llama-2-13B-chat-GPTQ? As there is another branch called GPTQ-gptq-4bit-128g-actorder_True, is the only difference between these two branches "actorder"?

TheBloke

Owner Aug 24, 2023

•

edited Aug 24, 2023

Yes that's correct. The model with act-order = True has higher quality, but in the past using act-order + group_size has caused performance problems for some GPTQ clients.

That may now be resolved, and I don't know if it ever affected TGI.

So try 128g + True first and only use 128g + False if performance seems slow. In future I may make 128g + True the 'main' model, or even drop 128 + False entirely, if the performance issues are confirmed to be resolved.

brendanlui

Aug 28, 2023

@TheBloke Do you have any recommendation about which hyperparameters we should use to have the fastest inference speed for GPTQ models? As I did an experiment on TGI with both quantized and non-quantized LLaMa-2 models, I'm confused why the GPTQ models always have slower inference speed for the same request body, and the GPU memory usage is almost similar between every model on TGI. FYI, I'm using A100 80GB for testing.

Model	No. of GPU(s)	Parameters	Quantization Method	Bits	GPTQ Group Size	ExLlama Compatible?	Processing time / request	GPU Memory Used	Sharded
Llama-2-7b-chat-hf	2	7B	-	16	-	-	4.00 s	147.1 GB	-
Llama-2-7b-chat-hf	1	7B	-	16	-	-	3.10 s	78.7 GB	-
Llama-2-7b-chat-hf	1	7B	-	16	-	-	11.30 s	78.8 GB	False
Llama-2-7b-Chat-GPTQ (main)	1	7B	GPTQ	4	128	Yes	4.50 s	79.3 GB	-
Llama-2-7b-Chat-GPTQ (main)	1	7B	GPTQ	4	128	Yes	11.35 s	79.3 GB	False
Llama-2-13b-chat-hf	1	13B	-	16	-	-	5.35 s	78.4 GB	-
Llama-2-13B-chat-GPTQ (main)	1	13B	GPTQ	4	128	Yes	8.80 s	79.1 GB	-
Llama-2-13B-chat-GPTQ (gptq-4bit-128g-actorder_True)	1	13B	GPTQ	4	128	Yes	-	Not Enough Memory	-
Llama-2-13B-chat-GPTQ (gptq-8bit--1g-actorder_True)	1	13B	GPTQ	8	-1	No	11.35 s	78.8 GB	-
Llama-2-13B-chat-GPTQ (gptq-8bit-128g-actorder_False)	1	13B	GPTQ	8	128	No	11.75 s	78.7 GB	-
Llama-2-70b-chat-hf	2	70B	-	16	-	-	11.4 s	159.5 GB	-
Llama-2-70b-chat-hf	1	70B	bitsandbytes	4	-	-	35.5 s	74.2 GB	-
Llama-2-70B-chat-GPTQ (main)	1	70B	GPTQ	4	-1	Yes	23.95 s	77.8 GB	-
Llama-2-70B-chat-GPTQ (main)	1	70B	GPTQ	4	-1	Yes	23.95 s	77.8 GB	False
Llama-2-70B-chat-GPTQ (gptq-4bit-32g-actorder_True)	2	70B	GPTQ	4	32	Yes	33.8 s	86.12 GB	-
Llama-2-70B-chat-GPTQ (gptq-4bit-128g-actorder_True)	1	70B	GPTQ	4	128	Yes	-	Not Enough Memory	-