description.md · ZurichNLP/unsupervised-semantic-diff at main

Small print

Warning: This demo is highly experimental and not ready for production use.

This demo is a proof of concept for visualizing the semantic differences between two text documents. The input documents may or may not be written in the same language.

In our paper, we evaluate three simple, unsupervised approaches based on BERT-like encoder models. This demo implements the approaches DiffAlign and DiffDel using the model ZurichNLP/unsup-simcse-xlm-roberta-base. See the model tags for a list of the ~100 supported languages.

DiffAlign aligns the words of the two documents using cosine similarity between the word embeddings (cf. SimAlign, BERTScore). Words with low similarity are highlighted.
DiffDel calculates sentence similarity between the two input documents (cf. SimCSE). The algorithm highlights words whose deletion has a positive effect on the similarity score.

More resources:

Citation

@inproceedings{vamvas-sennrich-2023-rsd,
      title={Towards Unsupervised Recognition of Token-level Semantic Differences in Related Documents},
      author={Jannis Vamvas and Rico Sennrich},
      month = dec,
      year = "2023",
      booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
      address = "Singapore",
      publisher = "Association for Computational Linguistics",
}