--- library_name: transformers tags: - flattery - business calls - speech language: - en pipeline_tag: audio-classification inference: false --- # Flattery Prediction from Speech This Wav2Vec2 model was finetuned to predict **flattery from speech** English **earning calls**. It was introduced in [This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach](http://arxiv.org/abs/2406.17667), which was accepted at INTERSPEECH 2024. If you are looking for the text-based classifier (based on RoBERTa) introduced in the paper, please see [here](https://huggingface.co/chrlukas/flattery_prediction_text). ## Model Details ### Model Description This is a (further) fine-tuned variant of a [Wav2Vec2 model for Speech Emotion Recognition in MSP](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim). It is trained using a dataset comprising single sentences uttered in business calls, which were labeled for flattery in a binary manner. The training set comprised 7167 sentences, 1878 sentences were used as development set. For more details, please refer to [the paper(TODO)](#), especially Sections 2 for the dataset, 3.2.2 for the training procedure and 4.2 for the results. The checkpoint provided here was trained without further pruning the model. It achieves Unweighed Average Recall (UAR) values of .8001 and .8084 on the development and test partition, respectively. ### Model Sources - **Repository:** [More Information Needed] - **Paper [optional]:** [More Information Needed] ## Usage The following snippet illustrates the usage of the model. ```python from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification from torch import sigmoid import librosa # initialize model and tokenizer checkpoint = "chrlukas/flattery_prediction_speech" processor = AutoFeatureExtractor.from_pretrained(checkpoint) model = Wav2Vec2ForSequenceClassification.from_pretrained(checkpoint) model.eval() # predict flattery in a sentence example_file = 'example.wav' # audio must be resampled to 16Hz y, _ = librosa.load(test_file, sr=16000) inp = processor(y, sampling_rate=16000, return_tensors='pt') with torch.no_grad(): logits = model(**inp).logits prediction = sigmoid(logits).item() flattery = prediction >= 0.5 print(f'Flattery detected? {flattery}') ``` ## Bias, Risks, and Limitations The model is trained on a highly-domain specific dataset sourced from earning calls, i.e., typically conversations between business analysts and CEOs of US-American companies. Hence, it can not be expected to generalize well to other domains and contexts. Moreover, the majority of speakers (162/178) in the training dataset are male. However, we found this to have rather little impact on the model's performance for held-out female speakers (cf. Section 4.4 in the paper) ## Citation **BibTeX:** TODO