Edit model card

fusion_gttbsc_phi-3-best3

Ground truth text with prosody encoding and ASR encoding residual cross attention fusion multi-label DAC

Model description

ASR encoder: Whisper small encoder
Prosody encoder: 2 layer transformer encoder with initial dense projection
Backbone: Phi 3 mini
Fusion: 3 residual cross attention fusion layers (F_asr x F_text and F_prosody x F_text) with dense layer on top
Pooling: Self attention
Multi-label classification head: 2 dense layers with two dropouts 0.3 and Tanh activation inbetween

Training and evaluation data

Trained on ground truth.
Evaluated on ground truth (GT) and normalized Whisper small transcripts (E2E).

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0002
  • train_batch_size: 2
  • eval_batch_size: 2
  • seed: 42
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 8
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 20
  • mixed_precision_training: Native AMP

Framework versions

  • Transformers 4.41.2
  • Pytorch 2.3.0+cu121
  • Datasets 2.19.2
  • Tokenizers 0.19.1
Downloads last month
2
Safetensors
Model size
169M params
Tensor type
F32
·
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Dataset used to train Masioki/fusion_gttbsc_phi-3-best3

Evaluation results