[BUG] unable to inference it in batchsize=4

#2
by alexyywwdd - opened

Hi, I just test your open-sourced code and model few days ago. It works quite well when in batchsize = 1, but fail to produce normally when batchsize > 1. More precisely, for all samples, it generates content that looks quite similar with sample[0].

The problem with the current code probably lies in Kosmos2_5VisionLayer https://huggingface.co/kirp/kosmos2_5/blob/bef6ac6ae6e461316affd896206a106abf8cdb3e/modeling_kosmos2_5.py#L867-L874

        self_attention_outputs, _ = self.attention(
            hidden_states,
            attention_mask=attention_mask,
            layer_head_mask=head_mask,
            output_attentions=output_attentions,
        )
        attention_output = self_attention_outputs[0]
        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights

The attention module, for instance, Kosmos2_5VisionAttention, returns tuple:

class Kosmos2_5VisionAttention(nn.Module):
    # ...

    def forward(
        self,
        hidden_states,
        attention_mask=None,
        position_bias=None,
        layer_head_mask=None,
        output_attentions=False,
    ):
         return attn_output, attn_weights

but the logic in Kosmos2_5VisionLayer seems to ignore the attn_weights, and parse attn_output as if it were a tuple like (attn_output, attn_weights), hence the model will take only the first output with shape (4096, 1536). The broadcast mechanism in residual connection makes it won't report any error... But it indeed seems incorrect.
To correct, modify it like:

self_attention_outputs, _ = self.attention(
👇
self_attention_outputs = self.attention(

will help so. If a PR is needed, I will be willing to raise it.

Please let me know I get it right or not. Thanks!

Did you run the code?! I can't even get the code to run. Have you resolved this issue by any chance? I would appreciate it if you could let me know.

Traceback (most recent call last):
  File "/shared/workspace/koshug/hug_me.py", line 17, in <module>
    inputs = processor(text=prompt, images=image, return_tensors="pt")
  File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2945, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3053, in _call_one
    return self.encode_plus(
  File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3127, in encode_plus
    return self._encode_plus(
  File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 601, in _encode_plus
    batched_output = self._batch_encode_plus(
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'
Owner

This repo is just for testing. I haven’t finish the batch generating yet.

Kosmos2_5VisionLayer, I will check this latter. Thank you for your reminder.

Did you run the code?! I can't even get the code to run. Have you resolved this issue by any chance? I would appreciate it if you could let me know.

Traceback (most recent call last):
  File "/shared/workspace/koshug/hug_me.py", line 17, in <module>
    inputs = processor(text=prompt, images=image, return_tensors="pt")
  File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2945, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3053, in _call_one
    return self.encode_plus(
  File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 3127, in encode_plus
    return self._encode_plus(
  File "/root/anaconda3/envs/kos/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 601, in _encode_plus
    batched_output = self._batch_encode_plus(
TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'images'

Seems you haven't successfully loaded Kosmos2_5Processor. I ran the code in this repo and it goes well(except for batchsize > 1). Maybe @kirp can help.
BR.

Thank you. After looking at the hint you gave me, I solved the issue by calling the tokenizer like this: from kosmos2_5.processing_kosmos2_5 import Kosmos2_5Processor

@alexyywwdd Now batch is supported. You need to pip install git+https://github.com/tic-top/transformers.git --upgrade

# batch generate
inputs = processor(text=[prompt, prompt], images=[image,image], return_tensors="pt")

# Get the original width and height
raw_width, raw_height = image.size

# NOTE: If the processor receives a single image, it will return int; if a batch of image recived, return List[int].
height, width = inputs.pop("height"), inputs.pop("width")

# Here we use height[0], and width[0] to get resized height and width of first image
scale_height = raw_height / height[0]
scale_width = raw_width / width[0]

Thanks, I'm closing this issue now.

alexyywwdd changed discussion status to closed

Sign up or log in to comment