can I run it on CPU ?
can I run it on CPU ?
Yes, you can.
Checkout llama.cpp or ChatLLM.cpp.
Thank you! I was trying to run it on TGI but I am getting the following error,
(base) compute:data hadra002$ docker run --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference --model-id $model --disable-custom-kernels
2024-04-19T20:40:42.937880Z INFO text_generation_launcher: Args { model_id: "meta-llama/Meta-Llama-3-8B-Instruct", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "4fe31fe89102", port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: true, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-19T20:40:42.939590Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-04-19T20:40:43.309344Z INFO text_generation_launcher: Default max_input_tokens
to 4095
2024-04-19T20:40:43.309463Z INFO text_generation_launcher: Default max_total_tokens
to 4096
2024-04-19T20:40:43.309480Z INFO text_generation_launcher: Default max_batch_prefill_tokens
to 4145
2024-04-19T20:40:43.309492Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-19T20:40:43.309949Z INFO download: text_generation_launcher: Starting download process.
2024-04-19T20:40:52.029571Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-19T20:40:53.146755Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-19T20:40:53.148336Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-19T20:41:00.442091Z WARN text_generation_launcher: We're not using custom kernels.
2024-04-19T20:41:00.464385Z WARN text_generation_launcher: Could not import Flash Attention enabled models: CUDA is not available
2024-04-19T20:41:01.703605Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 71, in serve
from text_generation_server import server
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 16, in
from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/vlm_causal_lm.py", line 14, in
from text_generation_server.models.flash_mistral import (
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_mistral.py", line 18, in
from text_generation_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 29, in
from text_generation_server.utils import paged_attention, flash_attn
File "/opt/conda/lib/python3.10/site-packages/text_generation_server/utils/flash_attn.py", line 12, in
raise ImportError("CUDA is not available")
ImportError: CUDA is not available
rank=0
Error: ShardCannotStart
2024-04-19T20:41:01.809661Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-19T20:41:01.809775Z INFO text_generation_launcher: Shutting down shards
(base) compute:data hadra002$
import transformers
import torch
model_id = "meta-llama/Meta-Llama-3-8B"
pipeline = transformers.pipeline(
"text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto"
)
pipeline("hi")
why carsh and not give response?
i run it on colab
I think it's one of the worst idea to adapt model like such as Llama into CPU machine.