chrlukas
/

flattery_prediction_speech

Audio Classification

Model card Files Files and versions Community

flattery_prediction_speech / README.md

chrlukas's picture

Update README.md

2c349ee verified 5 months ago

|

history blame contribute delete

3.3 kB

	---
	library_name: transformers
	tags:
	- flattery
	- business calls
	- speech
	language:
	- en
	pipeline_tag: audio-classification
	inference: false
	---

	# Flattery Prediction from Speech

	<!-- Provide a quick summary of what the model is/does. -->

	This Wav2Vec2 model was finetuned to predict flattery from speech English earning calls. It was introduced in [This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an
	Audio-Textual Transformer-Based Approach](http://arxiv.org/abs/2406.17667), which was accepted at INTERSPEECH 2024.
	If you are looking for the text-based classifier (based on RoBERTa) introduced in the paper, please see [here](https://huggingface.co/chrlukas/flattery_prediction_text).

	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This is a (further) fine-tuned variant of a [Wav2Vec2 model for Speech Emotion Recognition in MSP](https://huggingface.co/audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim). It is trained using a dataset comprising single sentences uttered in business calls,
	which were labeled for flattery in a binary manner. The training set comprised 7167 sentences, 1878 sentences were used as development set. For more details, please
	refer to [the paper(TODO)](#), especially Sections 2 for the dataset, 3.2.2 for the training procedure and 4.2 for the results. The checkpoint provided here was trained without further pruning the model.
	It achieves Unweighed Average Recall (UAR) values of .8001 and .8084 on the development and test partition, respectively.


	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [More Information Needed]
	- Paper [optional]: [More Information Needed]


	## Usage

	The following snippet illustrates the usage of the model.
	```python
	from transformers import AutoFeatureExtractor, Wav2Vec2ForSequenceClassification
	from torch import sigmoid
	import librosa

	# initialize model and tokenizer
	checkpoint = "chrlukas/flattery_prediction_speech"
	processor = AutoFeatureExtractor.from_pretrained(checkpoint)
	model = Wav2Vec2ForSequenceClassification.from_pretrained(checkpoint)
	model.eval()

	# predict flattery in a sentence
	example_file = 'example.wav'
	# audio must be resampled to 16Hz
	y, _ = librosa.load(test_file, sr=16000)
	inp = processor(y, sampling_rate=16000, return_tensors='pt')
	with torch.no_grad():
	logits = model(**inp).logits
	prediction = sigmoid(logits).item()
	flattery = prediction >= 0.5
	print(f'Flattery detected? {flattery}')
	```


	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	The model is trained on a highly-domain specific dataset sourced from earning calls, i.e., typically conversations between business analysts and CEOs of US-American companies. Hence, it can not be expected to generalize well to other
	domains and contexts. Moreover, the majority of speakers (162/178) in the training dataset are male. However, we found this to have rather little impact on the model's performance for
	held-out female speakers (cf. Section 4.4 in the paper)


	## Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	TODO