csukuangfj's picture
Duplicate from meixu/k2fsa-zipformer-bilingual-zh-en-t
70a41b5
metadata
license: apache-2.0

Chinese-English ASR model using k2-zipformer-streaming

AIShell-1 and Wenetspeech testset results with modified-beam-search streaming decode using epoch-12.pt

decode_chunk_len AIShell-1 TEST_NET TEST_MEETING
64 4.79 11.6 12.64

Training and decoding commands

nohup ./pruned_transducer_stateless7_streaming/train.py --world-size 8 --num-epochs 30 --start-epoch 1 \
              --num-encoder-layers 2,2,2,2,2 \
              --feedforward-dims 768,768,768,768,768 \
              --nhead 4,4,4,4,4 \
              --encoder-dims 256,256,256,256,256 \
              --attention-dims 192,192,192,192,192 \
              --encoder-unmasked-dims 192,192,192,192,192 \
              --exp-dir pruned_transducer_stateless7_streaming/exp --max-duration 360 \
              > pruned_transducer_stateless7_streaming/exp/nohup.zipformer &

nohup ./pruned_transducer_stateless7_streaming/decode.py --epoch 12 --avg 1 \
              --num-encoder-layers 2,2,2,2,2 \
              --feedforward-dims 768,768,768,768,768 \
              --nhead 4,4,4,4,4 \
              --encoder-dims 256,256,256,256,256 \
              --attention-dims 192,192,192,192,192 \
              --encoder-unmasked-dims 192,192,192,192,192 \
              --exp-dir pruned_transducer_stateless7_streaming/exp \
              --max-duration 600 --decode-chunk-len 32 --decoding-method modified_beam_search --beam-size 4 \
              > nohup.zipformer.deocode &

Model unit is char+bpe as data/lang_char_bpe/tokens.txt

Tips

some k2-fsa version and parameter is

 {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.23.2', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'a74f59dba1863cd9386ba4d8815850421260eee7', 'k2-git-date': 'Fri Dec 2 08:32:22 2022', 'lhotse-version': '1.5.0.dev+git.8ce38fc.dirty', 'torch-version': '1.11.0+cu113', 'torch-cuda-available': True, 'torch-cuda-version': '11.3', 'python-version': '3.7', 'icefall-git-branch': 'master', 'icefall-git-sha1': '600f387-dirty', 'icefall-git-date': 'Thu Feb 9 15:16:04 2023', 'icefall-path': '/opt/conda/lib/python3.7/site-packages', 'k2-path': '/opt/conda/lib/python3.7/site-packages/k2/__init__.py', 'lhotse-path': '/opt/conda/lib/python3.7/site-packages/lhotse/__init__.py', 'hostname': 'worker-0', 'IP address': '127.0.0.1'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 30, 'start_epoch': 11, 'start_batch': 0, 'exp_dir': PosixPath('pruned_transducer_stateless7_streaming/exp_t'), 'lang_dir': 'data/lang_char_bpe', 'base_lr': 0.01, 'lr_batches': 5000, 'lr_epochs': 3.5, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'inf_check': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 200, 'use_fp16': False, 'num_encoder_layers': '2,2,2,2,2', 'feedforward_dims': '768,768,768,768,768', 'nhead': '4,4,4,4,4', 'encoder_dims': '256,256,256,256,256', 'attention_dims': '192,192,192,192,192', 'encoder_unmasked_dims': '192,192,192,192,192', 'zipformer_downsampling_factors': '1,2,4,8,2', 'cnn_module_kernels': '31,31,31,31,31', 'decoder_dim': 512, 'joiner_dim': 512, 'short_chunk_size': 50, 'num_left_chunks': 4, 'decode_chunk_len': 32, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 360, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 8, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'mix', 'blank_id': 0, 'vocab_size': 6254}