|
--- |
|
library_name: transformers |
|
tags: |
|
- flattery |
|
- business calls |
|
- speech |
|
language: |
|
- en |
|
pipeline_tag: audio-classification |
|
inference: false |
|
--- |
|
|
|
# Flattery Prediction from Speech |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
This Wav2Vec2 model was finetuned to predict **flattery from speech** English **earning calls**. It was introduced in [This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an |
|
Audio-Textual Transformer-Based Approach](http://arxiv.org/abs/2406.17667), which was accepted at INTERSPEECH 2024. |
|
If you are looking for the text-based classifier (based on RoBERTa) introduced in the paper, please see [here](https://huggingface.co/chrlukas/flattery_prediction_text). |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is a (further) fine-tuned variant of a [Wav2Vec2 model for Speech Emotion Recognition in MSP](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim). It is trained using a dataset comprising single sentences uttered in business calls, |
|
which were labeled for flattery in a binary manner. The training set comprised 7167 sentences, 1878 sentences were used as development set. For more details, please |
|
refer to [the paper(TODO)](#), especially Sections 2 for the dataset, 3.2.2 for the training procedure and 4.2 for the results. The checkpoint provided here was trained without further pruning the model. |
|
It achieves Unweighed Average Recall (UAR) values of .8001 and .8084 on the development and test partition, respectively. |
|
|
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [More Information Needed] |
|
- **Paper [optional]:** [More Information Needed] |
|
|
|
|
|
## Usage |
|
|
|
The following snippet illustrates the usage of the model. |
|
```python |
|
from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification |
|
from torch import sigmoid |
|
import librosa |
|
|
|
# initialize model and tokenizer |
|
checkpoint = "chrlukas/flattery_prediction_speech" |
|
processor = AutoFeatureExtractor.from_pretrained(checkpoint) |
|
model = Wav2Vec2ForSequenceClassification.from_pretrained(checkpoint) |
|
model.eval() |
|
|
|
# predict flattery in a sentence |
|
example_file = 'example.wav' |
|
# audio must be resampled to 16Hz |
|
y, _ = librosa.load(test_file, sr=16000) |
|
inp = processor(y, sampling_rate=16000, return_tensors='pt') |
|
with torch.no_grad(): |
|
logits = model(**inp).logits |
|
prediction = sigmoid(logits).item() |
|
flattery = prediction >= 0.5 |
|
print(f'Flattery detected? {flattery}') |
|
``` |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
The model is trained on a highly-domain specific dataset sourced from earning calls, i.e., typically conversations between business analysts and CEOs of US-American companies. Hence, it can not be expected to generalize well to other |
|
domains and contexts. Moreover, the majority of speakers (162/178) in the training dataset are male. However, we found this to have rather little impact on the model's performance for |
|
held-out female speakers (cf. Section 4.4 in the paper) |
|
|
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
TODO |