metadata

license: cc-by-nc-sa-4.0
language:
  - en
  - de
  - zh
  - fr
  - nl
  - el
  - it
  - es
  - my
  - he
  - sv
  - fa
  - tr
  - ur
library_name: transformers
pipeline_tag: audio-classification
tags:
  - Speech Emotion Recognition
  - SER
  - Transformer
  - HuBERT
  - Affective Computing

ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets

Authors: Shahin Amiriparian, Filip Packań, Maurice Gerczuk, Björn W. Schuller

Fine-tuned and backbone extended HuBERT Large on EmoSet++, comprising 37 datasets, totaling 150,907 samples and spanning a cumulative duration of 119.5 hours. The model is expecting a 3 second long raw waveform resampled to 16 kHz. The original 6 Ouput classes are combinations of low/high arousal and negative/neutral/positive valence. Further details are available in the corresponding paper.

EmoSet++ subsets used for fine-tuning the model:


ABC [1]	AD [2]	BES [3]	CASIA [4]	CVE [5]
Crema-D [6]	DES [7]	DEMoS [8]	EA-ACT [9]	EA-BMW [9]
EA-WSJ [9]	EMO-DB [10]	EmoFilm [11]	EmotiW-2014 [12]	EMOVO [13]
eNTERFACE [14]	ESD [15]	EU-EmoSS [16]	EU-EV [17]	FAU Aibo [18]
GEMEP [19]	GVESS [20]	IEMOCAP [21]	MES [3]	MESD [22]
MELD [23]	PPMMK [2]	RAVDESS [24]	SAVEE [25]	ShEMO [26]
SmartKom [27]	SIMIS [28]	SUSAS [29]	SUBSECO [30]	TESS [31]
TurkishEmo [2]	Urdu [32]

Usage

import torch
import torch.nn as nn
from transformers import AutoModelForAudioClassification, Wav2Vec2FeatureExtractor



# CONFIG and MODEL SETUP
model_name = 'amiriparian/ExHuBERT'
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("facebook/hubert-base-ls960")
model = AutoModelForAudioClassification.from_pretrained(model_name, trust_remote_code=True,revision="b158d45ed8578432468f3ab8d46cbe5974380812")

# Freezing half of the encoder for further transfer learning
model.freeze_og_encoder()

sampling_rate=16000 
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

Citation Info

@inproceedings{Amiriparian24-EEH,
  author = {Shahin Amiriparian and Filip Packan and Maurice Gerczuk and Bj\"orn W.\ Schuller},
  title = {{ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets}},
  booktitle = {{Proc. INTERSPEECH}}, 
  year = {2024},
  editor = {},
  volume = {},
  series = {},
  address = {Kos Island, Greece},
  month = {September},
  publisher = {ISCA},
}

References

[1] B. Schuller, D. Arsic, G. Rigoll, M. Wimmer, and B. Radig. Audiovisual Behavior Modeling by Combined Feature Spaces. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, volume 2, pages II–733–II– 736, Apr. 2007.

[2] M. Gerczuk, S. Amiriparian, S. Ottl, and B. W. Schuller. EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition. IEEE Trans- actions on Affective Computing, 14(2):1472–1487, Apr. 2023.

[3] T. L. Nwe, S. W. Foo, and L. C. De Silva. Speech emotion recognition using hidden Markov models. Speech Communication, 41(4):603–623, Nov. 2003.

[4] The selected speech emotion database of institute of automation chineseacademy of sciences (casia). http://www.chineseldc.org/resource_info.php?rid=76. accessed March 2024.

[5] P. Liu and M. D. Pell. Recognizing vocal emotions in Mandarin Chinese: A val- idated database of Chinese vocal emotional stimuli. Behavior Research Methods, 44(4):1042–1051, Dec. 2012.

[6] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma. CREMA-D: Crowd-sourced Emotional Multimodal Actors Dataset. IEEE transactions on affective computing, 5(4):377–390, 2014.

[7] I. S. Engberg, A. V. Hansen, O. K. Andersen, and P. Dalsgaard. Design Record- ing and Verification of a Danish Emotional Speech Database: Design Recording and Verification of a Danish Emotional Speech Database. EUROSPEECH’97 : 5th European Conference on Speech Communication and Technology, Patras, Rhodes, Greece, 22-25 September 1997, pages Vol. 4, pp. 1695–1698, 1997.

[8] E. Parada-Cabaleiro, G. Costantini, A. Batliner, M. Schmitt, and B. W. Schuller. DEMoS: An Italian emotional speech corpus. Language Resources and Evaluation, 54(2):341–383, June 2020.

[9] B. Schuller. Automatische Emotionserkennung Aus Sprachlicher Und Manueller Interaktion. PhD thesis, Technische Universit¨at M¨unchen, 2006.

[10] F. Burkhardt, A. Paeschke, M. Rolfes, W. F. Sendlmeier, and B. Weiss. A database of German emotional speech. In Interspeech 2005, pages 1517–1520. ISCA, Sept. 2005.

[11] E. Parada-Cabaleiro, G. Costantini, A. Batliner, A. Baird, and B. Schuller. Categorical vs Dimensional Perception of Italian Emotional Speech. In Interspeech 2018, pages 3638–3642. ISCA, Sept. 2018.

[12] A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon. Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol. In Proceedings of the 16th International Conference on Multimodal Interaction, ICMI ’14, pages 461–466, New York, NY, USA, Nov. 2014. Association for Computing Machinery.

[13] G. Costantini, I. Iaderola, A. Paoloni, and M. Todisco. EMOVO Corpus: An Italian Emotional Speech Database. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceed- ings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3501–3504, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA).

[14] O. Martin, I. Kotsia, B. Macq, and I. Pitas. The eNTERFACE’ 05 Audio-Visual Emotion Database. In 22nd International Conference on Data Engineering Work- shops (ICDEW’06), pages 8–8, Apr. 2006.

[15] K. Zhou, B. Sisman, R. Liu, and H. Li. Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset, Feb. 2021.

[16] H. O’Reilly, D. Pigat, S. Fridenson, S. Berggren, S. Tal, O. Golan, S. B¨olte, S. Baron- Cohen, and D. Lundqvist. The EU-Emotion Stimulus Set: A validation study. Behavior Research Methods, 48(2):567–576, June 2016.

[17] A. Lassalle, D. Pigat, H. O’Reilly, S. Berggen, S. Fridenson-Hayo, S. Tal, S. Elfstr¨om, A. R˚ade, O. Golan, S. B¨olte, S. Baron-Cohen, and D. Lundqvist. The EU-Emotion Voice Database. Behavior Research Methods, 51(2):493–506, Apr. 2019.

[18] A. Batliner, S. Steidl, and E. Noth. Releasing a thoroughly annotated and processed spontaneous emotional database: The FAU Aibo Emotion Corpus. 2008.

[19] K. R. Scherer, T. B¨anziger, and E. Roesch. A Blueprint for Affective Computing: A Sourcebook and Manual. OUP Oxford, Sept. 2010.

[20] R. Banse and K. R. Scherer. Acoustic profiles in vocal emotion expression. Journal of Personality and Social Psychology, 70(3):614–636, 1996.

[21] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan. IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4):335–359, Dec. 2008.

[22] M. M. Duville, L. M. Alonso-Valerdi, and D. I. Ibarra-Zarate. The Mexican Emo- tional Speech Database (MESD): Elaboration and assessment based on machine learning. Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference, 2021:1644–1647, Nov. 2021.

[23] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, June 2019.

[24] S. R. Livingstone and F. A. Russo. The Ryerson Audio-Visual Database of Emo- tional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE, 13(5):e0196391, May 2018.

[25] S. Haq and P. J. B. Jackson. Speaker-dependent audio-visual emotion recognition. In Proc. AVSP 2009, pages 53–58, 2009.

[26] O. Mohamad Nezami, P. Jamshid Lou, and M. Karami. ShEMO: A large-scale validated database for Persian speech emotion detection. Language Resources and Evaluation, 53(1):1–16, Mar. 2019.

[27] F. Schiel, S. Steininger, and U. T¨urk. The SmartKom Multimodal Corpus at BAS. In M. Gonz´alez Rodr´ıguez and C. P. Suarez Araujo, editors, Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain, May 2002. European Language Resources Association (ELRA).

[28] B. Schuller, F. Eyben, S. Can, and H. Feußner. Speech in Minimal Invasive Surgery - Towards an Affective Language Resource of Real-life Medical Operations. 2010.

[29] J. H. L. Hansen and S. E. Bou-Ghazale. Getting started with SUSAS: A speech under simulated and actual stress database. In Proc. Eurospeech 1997, pages 1743–1746, 1997.

[30] S. Sultana, M. S. Rahman, M. R. Selim, and M. Z. Iqbal. SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla. PLOS ONE, 16(4):e0250173, Apr. 2021.

[31] M. K. Pichora-Fuller and K. Dupuis. Toronto emotional speech set (TESS), Feb. 2020.

[32] S. Latif, A. Qayyum, M. Usman, and J. Qadir. Cross Lingual Speech Emotion Recognition: Urdu vs. Western Languages. In 2018 International Conference on Frontiers of Information Technology (FIT), pages 88–93, Dec. 2018.