I'm running out of memory while generating on a RTX A5000 (24 GB)
It runs out of memory every time
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain.llms import HuggingFacePipeline
model_name_or_path = "microsoft/Phi-3-mini-128k-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens = 1048,
return_full_text= False,
temperature = 0.3,
do_sample = True,
)
llm = HuggingFacePipeline(pipeline=pipe)
torch.cuda.empty_cache()
handler = StdOutCallbackHandler()
qa_with_sources_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever = ensemble_retriever,
callbacks=[handler],
chain_type_kwargs={"prompt": custom_prompt},
return_source_documents=True
)
Install flash-attn
!pip install flash-attn --no-build-isolation
Add 'attn_implementation="flash_attention_2"' to your AutoModelForCausalLM.from_pretrained
arguments.model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto", trust_remote_code=True, attn_implementation="flash_attention_2")
Flash Attention helps reduce memory usage, it helped reduce my VRAM usage by about 10 Gigs when Quantizing with GPTQ.
Also facing a similar issue with a V100 GPU.
I tried attn_implementation="flash_attention_2" as suggested but I am getting the following error:
ValueError: The current flash attention version does not support sliding window attention.
Based on my research you are supposed to install flash-attn sepearately but I already did that (and restarted my kernel) and still getting the error.
Name: flash-attn
Version: 2.5.9.post1
from transformers.utils import is_flash_attn_2_available
is_flash_attn_2_available()
True
Uninstall and reinstall flash-attnpip list | grep flash
pip uninstall ...
pip install flash-attn --no-build-isolation
Looking at the Phi3 modeling implementation in transformers, it seems it may be failing due to Phi3 not being compatible with output_attentions
# Phi3FlashAttention2 attention does not support output_attentions
if not _flash_supports_window_size:
logger.warning_once(
"The current flash attention version does not support sliding window attention. Please use `attn_implementation='eager'` or upgrade flash-attn library."
)
raise ValueError("The current flash attention version does not support sliding window attention.")
Are you trying to run Phi3 Causally or some other method like Seq2Seq?
Thank you I loaded an entirely new Kernel and got that part resolved but then discovered that my nVidia V100 GPU is not supported by Flash Attention.
I am using Phi3 causally for this. Well, I guess I will continue researching on my own and perhaps open a new conversation as I don't want to hijack OPs post.