'--- pipeline_tag: sentence-similarity tags: - ctranslate2 - int8 - float16 - sentence-transformers - feature-extraction - sentence-similarity language: en license: apache-2.0 datasets: - s2orc - flax-sentence-embeddings/stackexchange_xml - MS Marco - gooaq - yahoo_answers_topics - code_search_net - search_qa - eli5 - snli - multi_nli - wikihow - natural_questions - trivia_qa - embedding-data/sentence-compression - embedding-data/flickr30k-captions - embedding-data/altlex - embedding-data/simple-wiki - embedding-data/QQP - embedding-data/SPECTER - embedding-data/PAQ_pairs - embedding-data/WikiAnswers --- # # Fast-Inference with Ctranslate2 Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU. quantized version of [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2) ```bash pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1 ``` ```python # from transformers import AutoTokenizer model_name = "michaelfeil/ct2fast-all-MiniLM-L12-v2" model_name_orig="sentence-transformers/all-MiniLM-L12-v2" from hf_hub_ctranslate2 import EncoderCT2fromHfHub model = EncoderCT2fromHfHub( # load in int8 on CUDA model_name_or_path=model_name, device="cuda", compute_type="int8_float16" ) outputs = model.generate( text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"], max_length=64, ) # perform downstream tasks on outputs outputs["pooler_output"] outputs["last_hidden_state"] outputs["attention_mask"] # alternative, use SentenceTransformer Mix-In # for end-to-end Sentence embeddings generation # (not pulling from this CT2fast-HF repo) from hf_hub_ctranslate2 import CT2SentenceTransformer model = CT2SentenceTransformer( model_name_orig, compute_type="int8_float16", device="cuda" ) embeddings = model.encode( ["I like soccer", "I like tennis", "The eiffel tower is in Paris"], batch_size=32, convert_to_numpy=True, normalize_embeddings=True, ) print(embeddings.shape, embeddings) scores = (embeddings @ embeddings.T) * 100 # Hint: you can also host this code via REST API and # via github.com/michaelfeil/infinity ``` Checkpoint compatible to [ctranslate2>=3.17.1](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2>=2.12.0](https://github.com/michaelfeil/hf-hub-ctranslate2) - `compute_type=int8_float16` for `device="cuda"` - `compute_type=int8` for `device="cpu"` Converted on 2023-10-13 using ``` LLama-2 -> removed token. ``` # Licence and other remarks: This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo. # Original description