Deploy on Azure 4 x A100 (80 GB) got hang
#23
by
hugging-face-infrax
- opened
Hi abacusai team,
We're currently trying to deploy the abacusai/Smaug-72B-v0.1
model in Azure with the following specifications:
GPU: 4 x A100 (80 GB)
CPU: 96 vCPU
Memory: 880 GB
Nvidia CUDA: 12.3
Nvidia Driver Version: v545.23.08
Text Generation Inference: v1.4.2
The docker manifest file:
services:
text-generation-inference:
image: ghcr.io/huggingface/text-generation-inference:1.4.2
container_name: text-generation-inference
command: >
--model-id abacusai/Smaug-72B-v0.1
--max-total-tokens 16384
--max-input-length 8192
--num-shard 4
--sharded true
--max-top-n-tokens 1
--max-best-of 1
--disable-custom-kernels
--trust-remote-code
--max-stop-sequences 1
--validation-workers 1
--waiting-served-ratio 0
--max-batch-total-tokens 16384
--max-waiting-tokens 8192
--cuda-memory-fraction 0.8
--max-concurrent-requests 512
--max-batch-prefill-tokens 16384
--json-output
volumes:
- ./data:/data
ports:
- 8080:80
shm_size: '1gb'
restart: always
env_file:
- .env
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:80/health"]
interval: 30s
timeout: 45s
start_period: 180s
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
We expected the model to run smoothly with these specs, but we're encountering hanging issues. Could you please provide some recommendations for troubleshooting this problem?