Edit model card

IDEFICS2-OCR

Finetuned of Idefics2-8b with fp16 weight update on nielsr/docvqa_1200_examples_donut dataset for document VQA pairs.

Usage

from transformers import BitsAndBytesConfig, AutoModelForVision2Seq, AutoProcessor
from transformers.image_utils import load_image

processor = AutoProcessor.from_pretrained("smishr-18/Idefics2-OCR", do_image_splitting=False)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForVision2Seq.from_pretrained(
    "smishr-18/Idefics2-OCR",
    quantization_config=bnb_config,
    device_map=device,
    low_cpu_mem_usage=True
    )

image = load_image("https://images.pokemontcg.io/pl1/1_hires.png")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Explain."},
            {"type": "image"},
            {"type": "text", "text": "What is the reflex energy in the image?"}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=[text.strip()], images=[image4], return_tensors="pt", padding=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Generate texts
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
# The reflex energy in the image is 70.

Limitations

The model was finetuned on limited T4 GPU and could be fintuned with more adapters on devices with torch.cuda.get_device_capability()[0] >= 8 or Ampere GPUs.

  • Developed by: Shubh Mishra, Aug 2024
  • Model Type: VLM
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: HuggingFaceM4/idefics2-8b
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train smishr-18/Idefics2-OCR