How can I get dense, colbert embeddings with transformers?
#80
by
Calvinnncy97
- opened
Given
from transformers import AutoModel, AutoTokenizer
from torch import Tensor
import torch
model_path = 'BAAI/bge-m3'
model = AutoModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
test_sentence = ["this is a test sentence"]
batch_dict = tokenizer(test_sentence, return_tensors='pt', max_length=128, padding=True, truncation=True)
outputs = model(**batch_dict)
I get BaseModelOutputWithPoolingAndCrossAttentions
with pooler_output
and last_hidden_state keys
. Is pooler_output
the CLS embedding and last_hidden_state
all the token embeddings?
Kindly clarify. Thank you.
@Calvinnncy97 , the cls embedding (dense) is the first embedding in last_hidden_state, other embedding (except the first embedding) is colbert embeddings.