|
--- |
|
license: cc-by-nc-sa-4.0 |
|
language: |
|
- en |
|
- de |
|
- zh |
|
- fr |
|
- nl |
|
- el |
|
- it |
|
- es |
|
- my |
|
- he |
|
- sv |
|
- fa |
|
- tr |
|
- ur |
|
library_name: transformers |
|
pipeline_tag: audio-classification |
|
tags: |
|
- Speech Emotion Recognition |
|
- SER |
|
- Transformer |
|
- HuBERT |
|
- PyTorch |
|
--- |
|
|
|
# **ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets** |
|
Authors: Shahin Amiriparian, Filip Packań, Maurice Gerczuk, Björn W. Schuller |
|
|
|
Fine-tuned [**HuBERT Large**](https://huggingface.co/facebook/hubert-large-ls960-ft) on EmoSet++, comprising 37 datasets, totaling 150,907 samples and spanning a cumulative duration of 119.5 hours. |
|
The model is expecting a 3 second long raw waveform resampled to 16 kHz. The original 6 Ouput classes are combinations of low/high arousal and negative/neutral/positive |
|
valence. |
|
Further details are available in the corresponding [**paper**](https://arxiv.org/) |
|
|
|
**Note**: This model is for research purpose only. |
|
|
|
### EmoSet++ subsets used for fine-tuning the model: |
|
|
|
| | | | | | |
|
| :---: | :---: | :---: | :---: | :---: | |
|
| ABC [[1]](#1)| AD [[2]](#2) | BES [[3]](#3) | CASIA [[4]](#4) | CVE [[5]](#5) | |
|
| Crema-D [[6]](#6)| DES [[7]](#) | DEMoS [[8]](#8) | EA-ACT [[9]](#9) | EA-BMW [[9]](#9) | |
|
| EA-WSJ [[9]](#9) | EMO-DB [[10]](#10) | EmoFilm [[11]](#11) | EmotiW-2014 [[12]](#12) | EMOVO [[13]](#13) | |
|
| eNTERFACE [[14]](#14) | ESD [[15]](#15) | EU-EmoSS [[16]](#16) | EU-EV [[17]](#17) | FAU Aibo [[18]](#18) | |
|
| GEMEP [[19]](#19) | GVESS [[20]](#20) | IEMOCAP [[21]](#21) | MES [[3]](#3) | MESD [[22]](#22) | |
|
| MELD [[23]](#23)| PPMMK [[2]](#2) | RAVDESS [[24]](#24) | SAVEE [[25]](#25) | ShEMO [[26]](#26) | |
|
| SmartKom [[27]](#27) | SIMIS [[28]](#28) | SUSAS [[29]](#29) | SUBSECO [[30]](#30) | TESS [[31]](#31) | |
|
| TurkishEmo [[2]](#2) | Urdu [[32]](#32) | | | | |
|
|
|
|
|
|
|
### Usage |
|
|
|
```python |
|
import torch |
|
import torch.nn as nn |
|
from transformers import AutoModelForAudioClassification, Wav2Vec2FeatureExtractor |
|
|
|
|
|
|
|
# CONFIG and MODEL SETUP |
|
model_name = 'amiriparian/ExHuBERT' |
|
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/hubert-base-ls960") |
|
model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True,revision="b158d45ed8578432468f3ab8d46cbe5974380812") |
|
|
|
# Freezing half of the encoder |
|
model.freeze_og_encoder() |
|
|
|
sampling_rate=16000 |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
|
|
|
|
``` |
|
|
|
### Citation Info |
|
|
|
|
|
``` |
|
@inproceedings{Amiriparian24-EEH, |
|
author = {Shahin Amiriparian and Filip Packan and Maurice Gerczuk and Bj\"orn W.\ Schuller}, |
|
title = {{ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets}}, |
|
booktitle = {{Proc. INTERSPEECH}}, |
|
year = {2024}, |
|
editor = {}, |
|
volume = {}, |
|
series = {}, |
|
address = {Kos Island, Greece}, |
|
month = {September}, |
|
publisher = {ISCA}, |
|
} |
|
|
|
|
|
``` |
|
|
|
### References |
|
|
|
<a id="1">[1]</a> |
|
B. Schuller, D. Arsic, G. Rigoll, M. Wimmer, and B. Radig. Audiovisual Behavior |
|
Modeling by Combined Feature Spaces. In 2007 IEEE International Conference on |
|
Acoustics, Speech and Signal Processing - ICASSP ’07, volume 2, pages II–733–II– |
|
736, Apr. 2007. |
|
|
|
|
|
<a id="2">[2]</a> |
|
M. Gerczuk, S. Amiriparian, S. Ottl, and B. W. Schuller. EmoNet: A Transfer |
|
Learning Framework for Multi-Corpus Speech Emotion Recognition. IEEE Trans- |
|
actions on Affective Computing, 14(2):1472–1487, Apr. 2023. |
|
|
|
|
|
<a id="3">[3]</a> |
|
T. L. Nwe, S. W. Foo, and L. C. De Silva. Speech emotion recognition using hidden |
|
Markov models. Speech Communication, 41(4):603–623, Nov. 2003. |
|
|
|
|
|
<a id="4">[4]</a> |
|
The selected speech emotion database of institute of automation chineseacademy of |
|
sciences (casia). http://www.chineseldc.org/resource_info.php?rid=76. accessed March 2024. |
|
|
|
|
|
<a id="5">[5]</a> |
|
P. Liu and M. D. Pell. Recognizing vocal emotions in Mandarin Chinese: A val- |
|
idated database of Chinese vocal emotional stimuli. Behavior Research Methods, |
|
44(4):1042–1051, Dec. 2012. |
|
|
|
|
|
<a id="6">[6]</a> |
|
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma. |
|
CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE transactions on affective computing, 5(4):377–390, 2014. |
|
|
|
|
|
|
|
<a id="7">[7]</a> |
|
I. S. Engberg, A. V. Hansen, O. K. Andersen, and P. Dalsgaard. Design Record- |
|
ing and Verification of a Danish Emotional Speech Database: Design Recording |
|
and Verification of a Danish Emotional Speech Database. EUROSPEECH’97 : 5th |
|
European Conference on Speech Communication and Technology, Patras, Rhodes, |
|
Greece, 22-25 September 1997, pages Vol. 4, pp. 1695–1698, 1997. |
|
|
|
|
|
|
|
<a id="8">[8]</a> |
|
E. Parada-Cabaleiro, G. Costantini, A. Batliner, M. Schmitt, and B. W. Schuller. |
|
DEMoS: An Italian emotional speech corpus. Language Resources and Evaluation, |
|
54(2):341–383, June 2020. |
|
|
|
|
|
<a id="9">[9]</a> |
|
B. Schuller. Automatische Emotionserkennung Aus Sprachlicher Und Manueller |
|
Interaktion. PhD thesis, Technische Universit¨at M¨unchen, 2006. |
|
|
|
|
|
<a id="10">[10]</a> |
|
F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database |
|
of German emotional speech. In Interspeech 2005, pages 1517–1520. ISCA, Sept. |
|
2005. |
|
|
|
|
|
<a id="11">[11]</a> |
|
E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, and B. Schuller. |
|
Categorical vs Dimensional Perception of Italian Emotional Speech. In Interspeech 2018, |
|
pages 3638–3642. ISCA, Sept. 2018. |
|
|
|
|
|
<a id="12">[12]</a> |
|
A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon. Emotion Recognition In |
|
The Wild Challenge 2014: Baseline, Data and Protocol. In Proceedings of the 16th |
|
International Conference on Multimodal Interaction, ICMI ’14, pages 461–466, New |
|
York, NY, USA, Nov. 2014. Association for Computing Machinery. |
|
|
|
|
|
<a id="13">[13]</a> |
|
G. Costantini, I. Iaderola, A. Paoloni, and M. Todisco. EMOVO Corpus: An Italian |
|
Emotional Speech Database. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, |
|
B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceed- |
|
ings of the Ninth International Conference on Language Resources and Evaluation |
|
(LREC’14), pages 3501–3504, Reykjavik, Iceland, May 2014. European Language |
|
Resources Association (ELRA). |
|
|
|
|
|
|
|
<a id="14">[14]</a> |
|
O. Martin, I. Kotsia, B. Macq, and I. Pitas. The eNTERFACE’ 05 Audio-Visual |
|
Emotion Database. In 22nd International Conference on Data Engineering Work- |
|
shops (ICDEW’06), pages 8–8, Apr. 2006. |
|
|
|
|
|
|
|
|
|
<a id="15">[15]</a> |
|
K. Zhou, B. Sisman, R. Liu, and H. Li. Seen and Unseen emotional style transfer |
|
for voice conversion with a new emotional speech dataset, Feb. 2021. |
|
|
|
|
|
|
|
<a id="16">[16]</a> |
|
H. O’Reilly, D. Pigat, S. Fridenson, S. Berggren, S. Tal, O. Golan, S. B¨olte, S. Baron- |
|
Cohen, and D. Lundqvist. The EU-Emotion Stimulus Set: A validation study. |
|
Behavior Research Methods, 48(2):567–576, June 2016. |
|
|
|
|
|
|
|
<a id="17">[17]</a> |
|
A. Lassalle, D. Pigat, H. O’Reilly, S. Berggen, S. Fridenson-Hayo, S. Tal, S. Elfstr¨om, |
|
A. R˚ade, O. Golan, S. B¨olte, S. Baron-Cohen, and D. Lundqvist. The EU-Emotion |
|
Voice Database. Behavior Research Methods, 51(2):493–506, Apr. 2019. |
|
|
|
|
|
<a id="18">[18]</a> |
|
A. Batliner, S. Steidl, and E. Noth. Releasing a thoroughly annotated and processed |
|
spontaneous emotional database: The FAU Aibo Emotion Corpus. 2008. |
|
|
|
|
|
<a id="19">[19]</a> |
|
K. R. Scherer, T. B¨anziger, and E. Roesch. A Blueprint for Affective Computing: |
|
A Sourcebook and Manual. OUP Oxford, Sept. 2010. |
|
|
|
|
|
<a id="20">[20]</a> |
|
R. Banse and K. R. Scherer. Acoustic profiles in vocal emotion expression. Journal |
|
of Personality and Social Psychology, 70(3):614–636, 1996. |
|
|
|
|
|
<a id="21">[21]</a> |
|
C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, |
|
S. Lee, and S. S. Narayanan. IEMOCAP: Interactive emotional dyadic motion |
|
capture database. Language Resources and Evaluation, 42(4):335–359, Dec. 2008. |
|
|
|
<a id="22">[22]</a> |
|
M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate. The Mexican Emo- |
|
tional Speech Database (MESD): Elaboration and assessment based on machine |
|
learning. Annual International Conference of the IEEE Engineering in Medicine |
|
and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual |
|
International Conference, 2021:1644–1647, Nov. 2021. |
|
|
|
<a id="23">[23]</a> |
|
S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. MELD: |
|
A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, June |
|
2019. |
|
|
|
<a id="24">[24]</a> |
|
S. R. Livingstone and F. A. Russo. The Ryerson Audio-Visual Database of Emo- |
|
tional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal |
|
expressions in North American English. PLOS ONE, 13(5):e0196391, May 2018. |
|
|
|
|
|
<a id="25">[25]</a> |
|
S. Haq and P. J. B. Jackson. Speaker-dependent audio-visual emotion recognition. |
|
In Proc. AVSP 2009, pages 53–58, 2009. |
|
|
|
|
|
<a id="26">[26]</a> |
|
O. Mohamad Nezami, P. Jamshid Lou, and M. Karami. ShEMO: A large-scale |
|
validated database for Persian speech emotion detection. Language Resources and |
|
Evaluation, 53(1):1–16, Mar. 2019. |
|
|
|
|
|
<a id="27">[27]</a> |
|
F. Schiel, S. Steininger, and U. T¨urk. The SmartKom Multimodal Corpus at BAS. In |
|
M. Gonz´alez Rodr´ıguez and C. P. Suarez Araujo, editors, Proceedings of the Third |
|
International Conference on Language Resources and Evaluation (LREC’02), Las |
|
Palmas, Canary Islands - Spain, May 2002. European Language Resources Association (ELRA). |
|
|
|
|
|
<a id="28">[28]</a> |
|
B. Schuller, F. Eyben, S. Can, and H. Feußner. Speech in Minimal Invasive Surgery - Towards an Affective Language Resource of Real-life Medical Operations. 2010. |
|
|
|
|
|
<a id="29">[29]</a> |
|
J. H. L. Hansen and S. E. Bou-Ghazale. Getting started with SUSAS: A speech under |
|
simulated and actual stress database. In Proc. Eurospeech 1997, pages 1743–1746, |
|
1997. |
|
|
|
|
|
|
|
<a id="30">[30]</a> |
|
S. Sultana, M. S. Rahman, M. R. Selim, and M. Z. Iqbal. SUST Bangla Emotional |
|
Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla. |
|
PLOS ONE, 16(4):e0250173, Apr. 2021. |
|
|
|
|
|
<a id="31">[31]</a> |
|
M. K. Pichora-Fuller and K. Dupuis. Toronto emotional speech set (TESS), Feb. |
|
2020. |
|
|
|
|
|
|
|
<a id="32">[32]</a> |
|
S. Latif, A. Qayyum, M. Usman, and J. Qadir. Cross Lingual Speech Emotion |
|
Recognition: Urdu vs. Western Languages. In 2018 International Conference on |
|
Frontiers of Information Technology (FIT), pages 88–93, Dec. 2018. |
|
|
|
|