Sample code for inference in Google Colab? RuntimeError: "slow_conv2d_cuda" not implemented for 'Byte'

#12
by sanjeev-bhandari01 - opened

Hi, I want to test the inference of this model in google Colab (free-tier). I have tried different method to make it work but it didn't work. One of the script and error was from below script:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers import BitsAndBytesConfig
import torch
from transformers.generation import GenerationConfig

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL", trust_remote_code=True)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="fp4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model_kwargs = dict(
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
    quantization_config=quantization_config,
)
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL", **model_kwargs).eval()

query = tokenizer.from_list_format([
    {'image': 'https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg'},
    {'text': 'Generate the caption in English with grounding:'},
])
inputs = tokenizer(query, return_tensors='pt')
inputs = inputs.to(model.device)
pred = model.generate(**inputs)

But it returns error from line pred=model.generate(**inputs):

 RuntimeError: "slow_conv2d_cuda" not implemented for 'Byte'

i think quantization_config is causing the issue, you probably just need to pass load_in_4bit=True inside AutoModelForCausalLM.from_pretrained

Sign up or log in to comment