meta-llama/Llama-3.1-70B-Instruct · Issue with Tokenizer when deploying with TGI

Hey all, I am having an issue with tokenization when trying to deploy Llama 3.1 70B.

I am running this command from CLI. I am running 4x A100 40GB.

sudo docker run --gpus all --shm-size 128g -p 8080:80 -v $volume:/data -v $host_cache_dir:/root/.cache/huggingface -e HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN -e RANK=0 -e WORLD_SIZE=1 -e MASTER_ADDR=localhost -e MASTER_PORT=29500 -e CUDA_VISIBLE_DEVICES=0,1,2,3 ghcr.io/huggingface/text-generation-inference:2.2.0 --model-id $model --sharded true --quantize eetq --cuda-memory-fraction 0.6

Deployment Arguments:
model_id: "meta-llama/Meta-Llama-3.1-70B-Instruct", revision: None, validation_workers: 2, sharded: Some( true, ), num_shard: None, quantize: Some( Eetq, ), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 0.3, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0541246362b8", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some( "/data", ), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 0.6, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, otlp_service_name: "text-generation-inference.router", cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4, lora_adapters: None, disable_usage_stats: false, disable_crash_reports: false, }

This runs properly, but whenever I try to generate a response from /generate, I am getting this response:

data: {"index":1,"token":{"id":52107,"text":"\\\\","logprob":0.0,"special":false},"top_tokens":[{"id":52107,"text":"\\\\","logprob":0.0,"special":false}],"generated_text":null,"details":null}

data: {"index":2,"token":{"id":52107,"text":"\\\\","logprob":0.0,"special":false},"top_tokens":[{"id":52107,"text":"\\\\","logprob":0.0,"special":false}],"generated_text":null,"details":null}

...

my initial thoughts were issues with the tokenizer, so I tested the /tokenize endpoint:

[
{
"id": 5159,
"text": "",
"start": 0,
"stop": 0
},
{
"id": 836,
"text": "",
"start": 2,
"stop": 2
},
{
"id": 374,
"text": "",
"start": 7,
"stop": 7
},
{
"id": 78118,
"text": "",
"start": 10,
"stop": 10
},
{
"id": 323,
"text": "",
"start": 18,
"stop": 18
},

okay, so the fact that the text was not popping up was that there must have been an issue with the tokenizer. I then downloaded the tokenizer.json from the model files and manually set it when running the model using the --tokenizer-config-path, but it still did not fix the issue. Has anyone ran into a similar issue?

any help would be greatly appreciated.