Upload ONNX weights
Conversion code:
import os
import torch
from transformers import AutoModel, AutoTokenizer
from sklearn.preprocessing import normalize
query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
queries = [
"What are some ways to reduce stress?",
"What are the benefits of drinking green tea?",
]
queries = [query_prompt + query for query in queries]
# docs do not need any prompts
docs = [
"There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
"Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]
# The path of your model after cloning it
model_dir = "./stella_en_400M_v5"
vector_dim = 1024
vector_linear_directory = f"2_Dense_{vector_dim}"
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True, use_memory_efficient_attention=False, unpad_inputs=False).eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
vector_linear_dict = {
k.replace("linear.", ""): v for k, v in
torch.load(os.path.join(model_dir, f"{vector_linear_directory}/pytorch_model.bin"), map_location=torch.device('cpu')).items()
}
vector_linear.load_state_dict(vector_linear_dict)
vector_linear.eval()
model.vector_linear = vector_linear
original_forward = model.forward
def patched_forward(input_ids, attention_mask, token_type_ids):
last_hidden_state = original_forward(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0]
last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
return model.vector_linear(query_vectors)
model.forward = patched_forward
# Embed the queries
with torch.no_grad():
input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
outputs = model(**input_data)
query_vectors = normalize(outputs.cpu().numpy())
# Embed the documents
with torch.no_grad():
input_data = tokenizer(docs, padding="longest", truncation=True, max_length=512, return_tensors="pt")
outputs = model(**input_data)
docs_vectors = normalize(outputs.cpu().numpy())
print(query_vectors.shape, docs_vectors.shape)
# (2, 1024) (2, 1024)
similarities = query_vectors @ docs_vectors.T
print(similarities)
# [[0.8397531 0.29900077]
# [0.32818374 0.80954516]]
Followed by
input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
# Export the model
torch.onnx.export(model, # model being run
(input_data['input_ids'], input_data['attention_mask'], input_data['token_type_ids']), # model input (or a tuple for multiple inputs)
"model.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=14, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names = ['input_ids', 'attention_mask', 'token_type_ids'], # the model's input names
output_names = ['sentence_embedding'], # the model's output names
dynamic_axes={
"input_ids": {0: "batch_size", 1: "sequence_length"},
"attention_mask": {0: "batch_size", 1: "sequence_length"},
"token_type_ids": {0: "batch_size", 1: "sequence_length"},
"sentence_embedding": {0: "batch_size"},
}
)
and then simplified with ONNXSlim
@Xenova
Would it be possible to publish this to your account so that we users could use it without waiting for the PR to be merged? Or if you could kindly tell me how to pull a PR to my local machine that'd be greatly appreciated! (I have tried pulling the refs/pr/3
branch but it didn't work.)
Thank you very much in advance!
@netw0rkf10w
in your code, you should be able to specify revision='refs/pr/3'
. Which library are you running the model with? If Transformers.js, you can specify { revision: 'refs/pr/3' }
as an option.
@Xenova
Thanks for the prompt reply!
I'm using https://github.com/huggingface/text-embeddings-inference and I would like to download the model to a specific local folder before loading it for inference. I'm downloading your ONNX files manually for now but it would be great if HF could provide a way to checkout a PR branch.
@Xenova Unfortunately I obtained an error when loading your Onnx files:
2024-07-25T12:40:25.316131Z INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
Error: Could not create backend
Caused by:
Could not start backend: Failed to create ONNX Runtime session: Load model from /data/stella_en_400M_v5/onnx/model.onnx failed:/home/runner/work/onnxruntime-build/onnxruntime-build/onnxruntime/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const PathString&, const IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9
I guess this is because you use IR version 10 to create the Onnx file, is that correct?
Right, you just need to upgrade your version of onnxruntime/onnx :)
@netw0rkf10w
I got around the unsupported model IR version error in text-embeddings-inference
by updating ort
from version 2.0.0-rc.2
to 2.0.0-rc.4
in backends/ort/Cargo.toml
, then rebuilding the docker container using the project's Dockerfile.
However, now I've got the following error after it tries to load the model:
Unknown output keys: [Output { name: "sentence_embedding", output_type: Tensor { ty: Float32, dimensions: [-1, 1024] } }]
Anyone know how to get around this or has gotten this onnx version to run under TEI?
@randai2
Yes upgrading the ort
package to 2.0.0-rc.4
seems to be the way to go. I also posted this suggestion in the TEI repo: https://github.com/huggingface/text-embeddings-inference/issues/355
Unfortunately I'm taken by some other urgent stuffs so I haven't tried that yet but I would suggest to ask the question in the TEI repo.
@randai2 Let's vote for the support of this model in TEI: https://github.com/huggingface/text-embeddings-inference/issues/359
@netw0rkf10w @randai2 i made a PR #361 that supports model IR version 10!
In the meantime, TEI gets the output from the layer named last_hidden_state
(or token_embeddings
) for the embedding model. you can check the code. So, to run the ONNX model with TEI, the output_names
should be last_hidden_state
(== the output of the original_forward
in the above code) instead of sentence_embedding
i guess.
Hi, I still get the same error as
@randai2
when after build docker image for TEI using latest code. I have checked that ort
version is already 2.0.0-rc.4
.
Unknown output keys: [Output { name: "sentence_embedding", output_type: Tensor { ty: Float32, dimensions: [-1, 1024] } }]
Is there any solution for this?