mys/bert-base-turkish-cased-nli-mean

Acknowledgement

Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉

What is this?

This model is a finetuned version of dbmdz/bert-base-turkish-cased to be used in zero-shot tasks in Turkish. It is finetuned with an NLI task by using sentence-transformers and uses mean of the token embeddings as the aggregation function. I also converted it to TensorFlow with the aggregation function rewritten in TF to use it in my ai-aas repo on GitHub for production-grade deployment, but a simple usage example is as follows:

Usage

import time
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer

texts = ["Galatasaray, bu akşamki maçın ardından şampiyonluğunu ilan etmeye hazırlanıyor."]
labels = ["spor", "siyaset", "kültür"]
model_name = 'mys/bert-base-turkish-cased-nli-mean'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name)


def label_text(model, tokenizer, texts, labels):
    texts_length = len(texts)
    tokens = tokenizer(texts + labels, padding=True, return_tensors='tf')
    embs = model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)

    dists = tf.experimental.numpy.inner(masked_embs[:texts_length], masked_embs[texts_length:])
    scores = tf.nn.softmax(dists)
    results = list(zip(labels, scores.numpy().squeeze().tolist()))
    sorted_results = sorted(results, key=lambda x: x[1], reverse=True)
    sorted_results = [{"label": label, "score": f"{score:.4f}"} for label, score in sorted_results]
    return sorted_results


start = time.time()
sorted_results = label_text(model, tokenizer, texts, labels)
alapsed = time.time() - start

print(sorted_results)
print(f"Processed in {alapsed:.2f} secs")

Output:

[{'label': 'spor', 'score': '1.0000'}, {'label': 'siyaset', 'score': '0.0000'}, {'label': 'kültür', 'score': '0.0000'}]
Processed in 0.22 secs

How it works

label_text() function runs the BERT model with a concatenation of texts and labels as the input, and it agregates per-token hidden states outputted by the BERT model to produce a single vector per sequence. Then, the inner product of text embeddings and label embeddings is calculated as the similarity metric, and softmax is applied to convert these distance values to probabilities.

Dataset

Emrah Budur, Rıza Özçelik, Tunga Güngör and Christopher Potts. 2020. Data and Representation for Turkish Natural Language Inference. To appear in Proceedings of EMNLP. [pdf] [bib]

@inproceedings{budur-etal-2020-data,
    title = "Data and Representation for Turkish Natural Language Inference",
    author = "Budur, Emrah and
      \"{O}z\c{c}elik, R{\i}za and
      G\"{u}ng\"{o}r, Tunga",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}