26(1)
/
2021 / 6
/
pp. 17 - 32
NSYSU-MITLab 團隊於福爾摩沙 語音辨識競賽 2020 之語音辨識系統
NSYSU-MITLab Speech Recognition System for Formosa Speech Recognition Challenge 2020
43
0
作者
Hung-Pang Lin
(Department of Computer Science and Information Engineering National Sun Yat-sen University)
Chia-Ping Chen *
(Department of Computer Science and Information Engineering National Sun Yat-sen University)
Hung-Pang Lin
Department of Computer Science and Information Engineering National Sun Yat-sen University
Chia-Ping Chen *
Department of Computer Science and Information Engineering National Sun Yat-sen University
中文摘要
本論文中,我們描述了 NSYSU-MITLab 團隊在福爾摩沙語音辨識競賽 2020 (Formosa Speech Recognition Challenge 2020, FSR-2020) 中所實作的系統。我 們使用多頭注意力機制 (Multi-head Attention) 所構成的 Transformer 架構建立 了端到端的語音辨識系統,並且結合了連續性時序分類 (Connectionist Temporal Classification, CTC) 共同進行端到端的訓練以及解碼。我們也嘗試將 編碼器更改為結合卷積神經網路 (Convolutional neural network, CNN) 與多頭 注意力機制的 Conformer 架構。同時我們也建立了深度神經網路結合隱藏式 馬可夫模型 (Deep Neural Network-Hidden Markov Model, DNN-HMM) ,其中 我們以時間限制自注意力機制 (Time-Restricted Self-Attention, TRSA) 及分解 時延神經網路 (Factorized Time Delay Neural Network, TDNN-F) 建立深度神 經網路的部分。最終我們在台文漢字任務上得到最佳的字元錯誤率 (Character Error Rate, CER) 為 43.4% 以及在台羅拼音任務上取得最佳的音節錯誤率 (Syllable Error Rate, SER) 25.4%。
英文摘要
In this paper, we describe the system team NSYSU-MITLab implemented for Formosa Speech Recognition Challenge 2020. We use the Transformer architecture composed of Multi-head Attention to construct an end-to-end speech recognition system and combine it with Connectionist Temporal Classification (CTC) for end-to-end training and decoding. We have also built a deep neural network combined with a hidden Markov model (DNN-HMM). We use Time-Restricted Self-Attention and Factorized Time Delay Neural Network (TDNN-F) for the deep neural network in DNN-HMM. The best performance we have achieved with the proposed methods is the character error rate of 45.5% for Taiwan Southern Min Recommended Characters (台文漢字) task and syllable error rate 25.4% for Taiwan Minnanyu Luomazi Pinyin (台羅拼音) task.
中文關鍵字
自動語音辨識; Transformer; Conformer; 連續性時序分類;聲學模型
英文關鍵字
Automatic Speech Recognition; Transformer; Conformer; Connectionist Temporal Classification; Acoustic Model