Acknowledgement
Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉
What is this?
This model is a finetuned version of dbmdz/bert-base-turkish-cased to be used in zero-shot tasks in Turkish. It is finetuned with an NLI task by using sentence-transformers
and uses mean
of the token embeddings as the aggregation function. I also converted it to TensorFlow with the aggregation function rewritten in TF to use it in my ai-aas
repo on GitHub for production-grade deployment, but a simple usage example is as follows:
Usage
import time
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer
texts = ["Galatasaray, bu akşamki maçın ardından şampiyonluğunu ilan etmeye hazırlanıyor."]
labels = ["spor", "siyaset", "kültür"]
model_name = 'mys/bert-base-turkish-cased-nli-mean'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name)
def label_text(model, tokenizer, texts, labels):
texts_length = len(texts)
tokens = tokenizer(texts + labels, padding=True, return_tensors='tf')
embs = model(**tokens)[0]
attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
dists = tf.experimental.numpy.inner(masked_embs[:texts_length], masked_embs[texts_length:])
scores = tf.nn.softmax(dists)
results = list(zip(labels, scores.numpy().squeeze().tolist()))
sorted_results = sorted(results, key=lambda x: x[1], reverse=True)
sorted_results = [{"label": label, "score": f"{score:.4f}"} for label, score in sorted_results]
return sorted_results
start = time.time()
sorted_results = label_text(model, tokenizer, texts, labels)
alapsed = time.time() - start
print(sorted_results)
print(f"Processed in {alapsed:.2f} secs")
Output:
[{'label': 'spor', 'score': '1.0000'}, {'label': 'siyaset', 'score': '0.0000'}, {'label': 'kültür', 'score': '0.0000'}]
Processed in 0.22 secs
How it works
label_text()
function runs the BERT model with a concatenation of texts
and labels
as the input, and it agregates per-token hidden states outputted by the BERT model to produce a single vector per sequence. Then, the inner product of text embeddings and label embeddings is calculated as the similarity metric, and softmax
is applied to convert these distance values to probabilities.
Dataset
Emrah Budur, Rıza Özçelik, Tunga Güngör and Christopher Potts. 2020. Data and Representation for Turkish Natural Language Inference. To appear in Proceedings of EMNLP. [pdf] [bib]
@inproceedings{budur-etal-2020-data,
title = "Data and Representation for Turkish Natural Language Inference",
author = "Budur, Emrah and
\"{O}z\c{c}elik, R{\i}za and
G\"{u}ng\"{o}r, Tunga",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics"
}
- Downloads last month
- 50