Update README.md

16d8f63 verified 5 months ago

6.02 kB

	---
	license: llama3
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- text-generation-inference

	extra_gated_fields:
	First Name: text
	Last Name: text
	Country: country
	Affiliation: text
	I want to use this model for:
	type: select
	options:
	- Research
	- Education
	- label: Other
	value: other
	I agree to use this model in accordance to META LLAMA 3 COMMUNITY LICENSE AGREEMENT: checkbox
	---

	# Dragonfly Model Card

	Note: Users are permitted to use this model in accordance with the Llama 3 Community License Agreement.

	## Model Details

	Dragonfly is a multimodal visual-language model, trained by instruction tuning on Llama 3.

	- Developed by: [Together AI](https://www.together.ai/)
	- Model type: An autoregressive visual-language model based on the transformer architecture
	- License: [Llama 3 Community License Agreement](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
	- Finetuned from model: [Llama 3](https://github.com/meta-llama/llama3)

	### Model Sources

	- Repository: https://github.com/togethercomputer/Dragonfly
	- Blog: https://www.together.ai/blog/dragonfly-v1
	- Paper: https://arxiv.org/abs/2406.00977

	## Uses

	The primary use of Dragonfly is research on large visual-language models.
	It is primarily intended for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.


	## How to Get Started with the Model

	### 💿 Installation

	Create a conda environment and install necessary packages
	```bash
	conda env create -f environment.yml
	conda activate dragonfly_env
	```

	Install flash attention
	```bash
	pip install flash-attn --no-build-isolation
	```

	As a final step, please run the following command.
	```bash
	pip install --upgrade -e .
	```

	### 🧠 Inference

	If you have successfully completed the installation process, then you should be able to follow the steps below.

	Question: Summarize the visual content of the image.

	![Skateboard](skateboard.png)

	Load necessary packages
	```python
	import torch
	from PIL import Image
	from transformers import AutoProcessor, AutoTokenizer

	from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
	from dragonfly.models.processing_dragonfly import DragonflyProcessor
	from pipeline.train.train_utils import random_seed
	```

	Instantiate the tokenizer, processor, and model.
	```python
	device = torch.device("cuda:0")

	tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3-8B-Dragonfly-v1")
	clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
	image_processor = clip_processor.image_processor
	processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")

	model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3-8B-Dragonfly-v1")
	model = model.to(torch.bfloat16)
	model = model.to(device)
	```

	Now, lets load the image and process them.
	```python
	image = Image.open("./test_images/skateboard.png")
	image = image.convert("RGB")
	images = [image]
	# images = [None] # if you do not want to pass any images

	text_prompt = "<\|start_header_id\|>user<\|end_header_id\|>\n\nSummarize the visual content of the image.<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>\n\n"

	inputs = processor(text=[text_prompt], images=images, max_length=2048, return_tensors="pt", is_generate=True)
	inputs = inputs.to(device)
	```

	Finally, let us generate the responses from the model
	```python
	temperature = 0

	with torch.inference_mode():
	generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<\|eot_id\|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)

	generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)
	```

	An example response.
	```plaintext
	In the heart of a vibrant skatepark, a skateboarder is caught in a moment of pure exhilaration. The skateboarder, dressed in a black t-shirt adorned with a yellow graphic and black pants, is suspended in mid-air, performing an impressive trick on a concrete ramp. The skateboarder's arms are outstretched, adding balance to the daring stunt.

	The skatepark itself is a concrete playground, with the skateboarder's ramp being the main focus. In the background, palm trees sway gently, adding a touch of nature to the urban setting. A few spectators can be seen in the distance, their attention riveted on the airborne skateboarder.

	The image captures not just a moment, but a story of skill, courage, and the joy of skateboarding.<\|eot_id\|>
	```

	## Training Details

	See more details in the "Implementation" section of our [paper](https://arxiv.org/abs/2406.00977).

	## Evaluation

	See more details in the "Results" section of our [paper](https://arxiv.org/abs/2406.00977).

	## 🏆 Credits

	We would like to acknowledge the following resources that were instrumental in the development of Dragonfly:

	- [Meta Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B): We utilized the Llama 3 model as our foundational language model.
	- [CLIP](https://huggingface.co/openai/clip-vit-base-patch32): Our vision backbone is CLIP model from OpenAI.
	- Our codebase is built upon the following two codebases:
	- [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://github.com/Luodian/Otter)
	- [LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images](https://github.com/thunlp/LLaVA-UHD)

	## 📚 BibTeX

	```bibtex
	@misc{chen2024dragonfly,
	title={Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model},
	author={Kezhen Chen and Rahul Thapa and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
	year={2024},
	eprint={2406.00977},
	archivePrefix={arXiv},
	primaryClass={cs.CV}
	}
	```

	## Model Card Authors
	Rahul Thapa, Kezhen Chen, Rahul Chalamala

	## Model Card Contact
	Rahul Thapa ([email protected]), Kezhen Chen ([email protected])