File size: 3,301 Bytes
0d8f9eb
 
ec74a13
 
 
 
 
 
 
 
0d8f9eb
 
ec74a13
0d8f9eb
 
 
2c349ee
 
ec74a13
0d8f9eb
 
 
 
 
 
 
ec74a13
 
 
 
0d8f9eb
 
ec74a13
0d8f9eb
 
 
 
 
 
 
f3c82e8
0d8f9eb
ec74a13
 
 
 
 
 
 
 
 
 
f3c82e8
ec74a13
 
 
 
 
 
f3c82e8
 
ec74a13
 
 
 
0d8f9eb
 
 
 
 
 
ec74a13
 
 
0d8f9eb
 
ec74a13
0d8f9eb
 
 
 
 
ec74a13
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
library_name: transformers
tags:
- flattery
- business calls
- speech
language:
- en
pipeline_tag: audio-classification
inference: false
---

# Flattery Prediction from Speech

<!-- Provide a quick summary of what the model is/does. -->

This Wav2Vec2 model was finetuned to predict **flattery from speech** English **earning calls**. It was introduced in [This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an
  Audio-Textual Transformer-Based Approach](http://arxiv.org/abs/2406.17667), which was accepted at INTERSPEECH 2024.
If you are looking for the text-based classifier (based on RoBERTa) introduced in the paper, please see [here](https://huggingface.co/chrlukas/flattery_prediction_text).

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is a (further) fine-tuned variant of a [Wav2Vec2 model for Speech Emotion Recognition in MSP](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim). It is trained using a dataset comprising single sentences uttered in business calls, 
which were labeled for flattery in a binary manner. The training set comprised 7167 sentences, 1878 sentences were used as development set. For more details, please 
refer to [the paper(TODO)](#), especially Sections 2 for the dataset, 3.2.2 for the training procedure and 4.2 for the results. The checkpoint provided here was trained without further pruning the model. 
It achieves Unweighed Average Recall (UAR) values of .8001 and .8084 on the development and test partition, respectively. 


### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]


## Usage

The following snippet illustrates the usage of the model. 
```python
from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification
from torch import sigmoid
import librosa

# initialize model and tokenizer
checkpoint = "chrlukas/flattery_prediction_speech"
processor = AutoFeatureExtractor.from_pretrained(checkpoint)
model = Wav2Vec2ForSequenceClassification.from_pretrained(checkpoint)
model.eval()

# predict flattery in a sentence
example_file = 'example.wav'
# audio must be resampled to 16Hz
y, _ = librosa.load(test_file, sr=16000)
inp = processor(y, sampling_rate=16000, return_tensors='pt')
with torch.no_grad():
  logits = model(**inp).logits
prediction = sigmoid(logits).item()
flattery = prediction >= 0.5
print(f'Flattery detected? {flattery}')
```


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The model is trained on a highly-domain specific dataset sourced from earning calls, i.e., typically conversations between business analysts and CEOs of US-American companies. Hence, it can not be expected to generalize well to other
domains and contexts. Moreover, the majority of speakers (162/178) in the training dataset are male. However, we found this to have rather little impact on the model's performance for 
held-out female speakers (cf. Section 4.4 in the paper)


## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

TODO