SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: BAAI/bge-base-en-v1.5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v14")
# Run inference
sentences = [
    "office of the \u200b\u200bFederal Commissioner for Data Protection and Freedom of Information, with its headquarters in the city of Bonn. It is led by a Federal Commissioner, elected via a vote by the German Bundestag. Eligibility criteria include being at least 35 years old, appropriate qualifications in the field of data protection law gained through relevant professional experience. The Commissioner's term is for five years, which can be extended once. The Commissioner has the responsibility to act as the primary office responsible for enforcing the Federal Data Protection Act within Germany. Some of the office's key responsibilities include: Advising the Bundestag, the Bundesrat, and the Federal Government on administrative and legislative measures related to data protection within the country; To oversee and implement both the GDPR and Federal Data Protection Act within Germany; To promote awareness within the public related to the risks, rules, safeguards, and rights concerning the processing of personal data; To handle all,  within Germany. It supplements and aligns with the requirements of the EU GDPR. Yes, Germany is covered by GDPR (General Data Protection Regulation). GDPR is a regulation that applies uniformly across all EU member states, including Germany. The Federal Data Protection Act established the office of the \u200b\u200bFederal Commissioner for Data Protection and Freedom of Information, with its headquarters in the city of Bonn. It is led by a Federal Commissioner, elected via a vote by the German Bundestag. Germany's interpretation is the Bundesdatenschutzgesetz (BDSG), the German Federal Data Protection Act. It mirrors the GDPR in all key areas while giving local German regulatory authorities the power to enforce it more efficiently nationally. ## Join Our Newsletter Get all the latest information, law updates and more delivered to your inbox ### Share Copy 14 ### More Stories that May Interest You View More",
    'What are the main responsibilities of the Federal Commissioner for Data Protection and Freedom of Information in enforcing data protection laws in Germany, including the GDPR and the Federal Data Protection Act?',
    'What is the collection and use of personal information by businesses?',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Dataset: dim_768
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.6804
cosine_accuracy@3	0.9072
cosine_accuracy@5	0.9485
cosine_accuracy@10	0.9691
cosine_precision@1	0.6804
cosine_precision@3	0.3024
cosine_precision@5	0.1897
cosine_precision@10	0.0969
cosine_recall@1	0.6804
cosine_recall@3	0.9072
cosine_recall@5	0.9485
cosine_recall@10	0.9691
cosine_ndcg@10	0.8366
cosine_mrr@10	0.7925
cosine_map@100	0.7937

Information Retrieval

Dataset: dim_512
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.6907
cosine_accuracy@3	0.8763
cosine_accuracy@5	0.9278
cosine_accuracy@10	0.9691
cosine_precision@1	0.6907
cosine_precision@3	0.2921
cosine_precision@5	0.1856
cosine_precision@10	0.0969
cosine_recall@1	0.6907
cosine_recall@3	0.8763
cosine_recall@5	0.9278
cosine_recall@10	0.9691
cosine_ndcg@10	0.833
cosine_mrr@10	0.7889
cosine_map@100	0.7896

Information Retrieval

Dataset: dim_256
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.6907
cosine_accuracy@3	0.8557
cosine_accuracy@5	0.8969
cosine_accuracy@10	0.9278
cosine_precision@1	0.6907
cosine_precision@3	0.2852
cosine_precision@5	0.1794
cosine_precision@10	0.0928
cosine_recall@1	0.6907
cosine_recall@3	0.8557
cosine_recall@5	0.8969
cosine_recall@10	0.9278
cosine_ndcg@10	0.8132
cosine_mrr@10	0.7759
cosine_map@100	0.7795

Information Retrieval

Dataset: dim_128
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.5979
cosine_accuracy@3	0.7732
cosine_accuracy@5	0.8247
cosine_accuracy@10	0.8866
cosine_precision@1	0.5979
cosine_precision@3	0.2577
cosine_precision@5	0.1649
cosine_precision@10	0.0887
cosine_recall@1	0.5979
cosine_recall@3	0.7732
cosine_recall@5	0.8247
cosine_recall@10	0.8866
cosine_ndcg@10	0.7462
cosine_mrr@10	0.701
cosine_map@100	0.7047

Information Retrieval

Dataset: dim_64
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.5155
cosine_accuracy@3	0.6907
cosine_accuracy@5	0.7113
cosine_accuracy@10	0.7732
cosine_precision@1	0.5155
cosine_precision@3	0.2302
cosine_precision@5	0.1423
cosine_precision@10	0.0773
cosine_recall@1	0.5155
cosine_recall@3	0.6907
cosine_recall@5	0.7113
cosine_recall@10	0.7732
cosine_ndcg@10	0.6471
cosine_mrr@10	0.6064
cosine_map@100	0.6137

Training Details

Training Dataset

Unnamed Dataset

Size: 7,872 training samples
Columns: positive and anchor
Approximate statistics based on the first 1000 samples:
positive anchor
type string string
details
min: 18 tokens
mean: 206.12 tokens
max: 414 tokens

min: 9 tokens
mean: 21.62 tokens
max: 102 tokens

	positive	anchor
type	string	string
details	min: 18 tokens mean: 206.12 tokens max: 414 tokens	min: 9 tokens mean: 21.62 tokens max: 102 tokens

Samples:

positive	anchor
`Automation PrivacyCenter.Cloud`	Data Mapping
on both in terms of material and territorial scope. ### 1.1 Material Scope The Spanish data protection law affords blanket protection for all data that may have been collected on a data subject. There are only a handful of exceptions that include: Information subject to a pending legal case Information collected concerning the investigation of terrorism or organised crime Information classified as "Confidential" for matters related to Spain's national security ### 1.2 Territorial Scope The Spanish data protection law applies to all data handlers that are: Carrying out data collection activities in Spain Not established in Spain but carrying out data collection activities on Spanish territory Not established within the European Union but carrying out data collection activities on Spanish residents unless for data transit purposes only ## 2. Obligations for Organizations Under Spanish Data Protection Law The Spanish data protection law and GDPR lay out specific obligations for all data handlers. These obligations ensure, . ### 2.3 Privacy Policy Requirements Spain's data protection law requires all data handlers to inform the data subject of the following in their privacy policy: The purpose of collecting the data and the recipients of the information The obligatory or voluntary nature of the reply to the questions put to them The consequences of obtaining the data or of refusing to provide them The possibility of exercising rights of access, rectification, erasure, portability, and objection The identity and address of the controller or their local Spanish representative ### 2.4 Security Requirements Article 9 of Spain's Data Protection Law is direct and explicit in stating the responsibility of the data handler is to take adequate measures to ensure the protection of any data collected. It mandates all data handlers to adopt technical and organisational measures necessary to ensure the security of the personal data and prevent their alteration, loss, and unauthorised processing or access. Additionally, collection of any	`What are the requirements for organizations under the Spanish data protection law regarding privacy policies and security measures?`
before the point of collection of their personal information. ## Right to Erasure The right to erasure gives consumers the right to request deleting all their data stored by the organization. Organizations are supposed to comply within 45 days and must deliver a report to the consumer confirming the deletion of their information. ## Right to Opt-in for Minors Personal information containing minors' personal information cannot be sold by a business unless the minor (age of 13 to 16 years) or the Parent/Guardian (if the minor is aged below 13 years) opt-ins to allow this sale. Businesses can be held liable for the sale of minors' personal information if they either knew or wilfully disregarded the consumer's status as a minor and the minor or Parent/Guardian had not willingly opted in. ## Right to Continued Protection Even when consumers choose to allow a business to collect and sell their personal information, businesses' must sign written	`What are the conditions under which businesses can sell minors' personal information?`

Loss: MatryoshkaLoss with these parameters:

{
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [
        768,
        512,
        256,
        128,
        64
    ],
    "matryoshka_weights": [
        1,
        1,
        1,
        1,
        1
    ],
    "n_dims_per_step": -1
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: epoch
per_device_train_batch_size: 32
per_device_eval_batch_size: 16
learning_rate: 2e-05
num_train_epochs: 2
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: True
tf32: True
load_best_model_at_end: True
optim: adamw_torch_fused
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: epoch
prediction_loss_only: True
per_device_train_batch_size: 32
per_device_eval_batch_size: 16
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 2
max_steps: -1
lr_scheduler_type: cosine
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: True
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: False
hub_always_push: False
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
dispatch_batches: None
split_batches: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional

Training Logs

Epoch	Step	Training Loss	dim_128_cosine_map@100	dim_256_cosine_map@100	dim_512_cosine_map@100	dim_64_cosine_map@100	dim_768_cosine_map@100
0.0407	10	7.3941	-	-	-	-	-
0.0813	20	6.0968	-	-	-	-	-
0.1220	30	4.9439	-	-	-	-	-
0.1626	40	3.8622	-	-	-	-	-
0.2033	50	3.0938	-	-	-	-	-
0.2439	60	1.8775	-	-	-	-	-
0.2846	70	2.3808	-	-	-	-	-
0.3252	80	4.0718	-	-	-	-	-
0.3659	90	2.2182	-	-	-	-	-
0.4065	100	1.914	-	-	-	-	-
0.4472	110	1.5123	-	-	-	-	-
0.4878	120	1.7047	-	-	-	-	-
0.5285	130	2.9509	-	-	-	-	-
0.5691	140	1.0605	-	-	-	-	-
0.6098	150	1.762	-	-	-	-	-
0.6504	160	1.6545	-	-	-	-	-
0.6911	170	3.0971	-	-	-	-	-
0.7317	180	1.3791	-	-	-	-	-
0.7724	190	1.9717	-	-	-	-	-
0.8130	200	5.1309	-	-	-	-	-
0.8537	210	1.4047	-	-	-	-	-
0.8943	220	1.4391	-	-	-	-	-
0.9350	230	3.6443	-	-	-	-	-
0.9756	240	3.721	-	-	-	-	-
1.0122	249	-	0.6625	0.7330	0.7497	0.5784	0.7568
1.0041	250	1.3171	-	-	-	-	-
1.0447	260	5.2603	-	-	-	-	-
1.0854	270	4.0513	-	-	-	-	-
1.1260	280	2.5508	-	-	-	-	-
1.1667	290	1.7385	-	-	-	-	-
1.2073	300	1.1692	-	-	-	-	-
1.2480	310	0.788	-	-	-	-	-
1.2886	320	1.2322	-	-	-	-	-
1.3293	330	3.3735	-	-	-	-	-
1.3699	340	1.2204	-	-	-	-	-
1.4106	350	0.8458	-	-	-	-	-
1.4512	360	0.7586	-	-	-	-	-
1.4919	370	0.8964	-	-	-	-	-
1.5325	380	1.9721	-	-	-	-	-
1.5732	390	0.5605	-	-	-	-	-
1.6138	400	0.9648	-	-	-	-	-
1.6545	410	1.0002	-	-	-	-	-
1.6951	420	2.138	-	-	-	-	-
1.7358	430	0.8221	-	-	-	-	-
1.7764	440	2.124	-	-	-	-	-
1.8171	450	2.7892	-	-	-	-	-
1.8577	460	0.9088	-	-	-	-	-
1.8984	470	0.9254	-	-	-	-	-
1.9390	480	3.1205	-	-	-	-	-
1.9797	490	3.014	-	-	-	-	-
1.9878	492	-	0.7047	0.7795	0.7896	0.6137	0.7937

The bold row denotes the saved checkpoint.

Framework Versions

Python: 3.10.14
Sentence Transformers: 3.0.1
Transformers: 4.41.2
PyTorch: 2.1.2+cu121
Accelerate: 0.31.0
Datasets: 2.19.1
Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

MugheesAwan11
/

bge-base-securiti-dataset-1-v14

SentenceTransformer based on BAAI/bge-base-en-v1.5

Model Details

Model Description

Model Sources

Full Model Architecture

Usage

Direct Usage (Sentence Transformers)

Evaluation

Metrics

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Training Details

Training Dataset

Unnamed Dataset

Training Hyperparameters

Non-Default Hyperparameters

All Hyperparameters

Training Logs

Framework Versions

Citation

BibTeX

Sentence Transformers

MatryoshkaLoss

MultipleNegativesRankingLoss

Model tree for MugheesAwan11/bge-base-securiti-dataset-1-v14

Evaluation results