microsoft/Florence-2-large · Batch: inefficient memory

A batch size of 10 eats 40GB of VRAM!

VRAM Allocated: 3147.43 MB
VRAM Reserved: 39532.00 MB

model_name = "microsoft/Florence-2-large-ft"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True).to('cuda')
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)


def generate_batch(prompts, images):
    # Process inputs in batches
    inputs = processor(text=prompts, images=images, return_tensors="pt").to('cuda')
    
    generated_ids = model.generate(
        input_ids=inputs["input_ids"],
        pixel_values=inputs["pixel_values"],
        max_new_tokens=1024,
        do_sample=False,
        num_beams=3
    )

    generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=False)
    parsed_answers = [
        processor.post_process_generation(text, task="<OD>", image_size=(img.width, img.height))
        for text, img in zip(generated_texts, images)
    ]

    return parsed_answers, generated_ids

Even when I repurpose the colab exactly as-is, VRAM usage seems to linearly grow with batch size.

Any help is greatly appreciated