Edit model card

Multiword expressions recognition.

A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs.

Model description

camembert-mwer is a model that was fine-tuned from CamemBERT as a token classification task specifically on the Sequoia dataset for the MWER task.

How to use

You can use this model directly with a pipeline for token classification:

>>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
>>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer")
>>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer")
>>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer)
>>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants."
>>> mwes = mwe_classifier(sentence)

[{'entity': 'B-MWE',
  'score': 0.99492574,
  'index': 4,
  'word': '▁rendez',
  'start': 15,
  'end': 22},
 {'entity': 'I-MWE',
  'score': 0.9344883,
  'index': 5,
  'word': '-',
  'start': 22,
  'end': 23},
 {'entity': 'I-MWE',
  'score': 0.99398583,
  'index': 6,
  'word': 'vous',
  'start': 23,
  'end': 27},
 {'entity': 'B-VID',
  'score': 0.9827843,
  'index': 22,
  'word': '▁mettre',
  'start': 106,
  'end': 113},
 {'entity': 'I-VID',
  'score': 0.9835186,
  'index': 23,
  'word': '▁en',
  'start': 113,
  'end': 116},
 {'entity': 'I-VID',
  'score': 0.98324823,
  'index': 24,
  'word': '▁bouche',
  'start': 116,
  'end': 123}]

>>> mwe_classifier.group_entities(mwes)

[{'entity_group': 'MWE',
  'score': 0.9744666,
  'word': 'rendez-vous',
  'start': 15,
  'end': 27},
 {'entity_group': 'VID',
  'score': 0.9831837,
  'word': 'mettre en bouche',
  'start': 106,
  'end': 123}]

Training data

The Sequoia dataset is divided into train/dev/test sets:

Sequoia train dev test
#sentences 3099 1955 273 871
#MWEs 3450 2170 306 974
#Unseen MWEs _ _ 100 300

This dataset has 6 distinct categories:

  • MWE: Non-verbal MWEs (e.g. à peu près)
  • IRV: Inherently reflexive verb (e.g. s'occuper)
  • LVC.cause: Causative light-verb construction (e.g. causer le bouleversement)
  • LVC.full: Light-verb construction (e.g. avoir pour but de )
  • MVC: Multi-verb construction (e.g. faire remarquer)
  • VID: Verbal idiom (e.g. voir le jour)

Training procedure

Preprocessing

The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology.

Pretraining

The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs.

Evaluation results

On the test set, this model achieves the following results:

Global MWE-based Unseen MWE-based
PrecisionRecallF1 PrecisionRecallF1
83.7883.7883.78 57.0560.6758.80

BibTeX entry and citation info

@article{martin2019camembert,
  title={CamemBERT: a tasty French language model},
  author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
  journal={arXiv preprint arXiv:1911.03894},
  year={2019}
}

@article{candito2020french,
  title={A French corpus annotated for multiword expressions and named entities},
  author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo},
  journal={Journal of Language Modelling},
  volume={8},
  number={2},
  year={2020},
  publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN}
}
Downloads last month
3
Safetensors
Model size
336M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.