llava-hf
/

llava-v1.6-vicuna-13b-hf

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

nielsr HF staff commited on Mar 21

Commit

7daabef

•

1 Parent(s): d1bfb9f

Update README.md

Files changed (1) hide show

README.md +27 -0

README.md CHANGED Viewed

@@ -55,6 +55,33 @@ output = model.generate(**inputs, max_new_tokens=100)
 print(processor.decode(output[0], skip_special_tokens=True))
 ```
 ### BibTeX entry and citation info

 print(processor.decode(output[0], skip_special_tokens=True))
 ```
+### Model optimization
+#### 4-bit quantization through `bitsandbytes` library
+First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
+```diff
+model = LlavaNextForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.float16,
+    low_cpu_mem_usage=True,
++   load_in_4bit=True
+)
+```
+#### Use Flash-Attention 2 to further speed-up generation
+First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
+```diff
+model = LlavaNextForConditionalGeneration.from_pretrained(
+    model_id,
+    torch_dtype=torch.float16,
+    low_cpu_mem_usage=True,
++   use_flash_attention_2=True
+).to(0)
+```
 ### BibTeX entry and citation info