28(1)
/
2023 / 6
/
pp. 1 - 18
基於語音自監督模型的複雜環境長語句意圖偵測與辨識
Intent Detection and Recognition of Long Sentences in Complex Environment Based on Speech Self-supervised Model
作者
張開 Kai Zhang
(國立台灣大學資訊網路與多媒體研究所 Graduate Institute of Networking and Multimedia, National Taiwan University)
葉子雋 Tzu-Chun Yeh
(迪威智能股份有限公司 DeepWave)
王崇喆 Chung-Che Wang *
(國立台灣大學資訊工程學系 Department of Information Engineering, National Taiwan University)
張秋霞 Qiuxia Zhang
(國立台灣大學資訊工程學系 Department of Information Engineering, National Taiwan University)
藍偉任 WeiRen Lan
(迪威智能股份有限公司 DeepWave)
張智星 Jyh-Shing Roger Jang
(國立台灣大學資訊工程學系 Department of Information Engineering, National Taiwan University)
張開 Kai Zhang
國立台灣大學資訊網路與多媒體研究所 Graduate Institute of Networking and Multimedia, National Taiwan University
葉子雋 Tzu-Chun Yeh
迪威智能股份有限公司 DeepWave
王崇喆 Chung-Che Wang *
國立台灣大學資訊工程學系 Department of Information Engineering, National Taiwan University
張秋霞 Qiuxia Zhang
國立台灣大學資訊工程學系 Department of Information Engineering, National Taiwan University
藍偉任 WeiRen Lan
迪威智能股份有限公司 DeepWave
張智星 Jyh-Shing Roger Jang
國立台灣大學資訊工程學系 Department of Information Engineering, National Taiwan University
中文摘要

本論文針對消防指揮記錄中心無綫電語料中,語句長、噪聲多的特點,提出了一種在長語句中辨識意圖的方法。此方法首先使用自監督學習 (self super- vised learning) 模型來進行語音的特徵提取,然後再使用兩個下游模型分別對語音中的意圖進行偵測和辨識。此方法在無線電語料長語音意圖辨識任務上與 Whis- per+BERT 的方法相比,錯誤減少率 (error reduction rate, ERR) 為 33.2%。在關鍵詞發現(keyword spotting) 任務上與區域提案網絡(region proposal network, RPN) 方法相比,在每小時誤報次數(false alarm per hour, FAH) 相近的情況下,錯誤拒絕率 (false rejection rate, FRR) 的ERR 為 73.2%。在短語句分類任務上和Whisper+BERT 方法相比,ERR 為 4.3%。同時與Whisper+BERT 方法相比,推理算力需求下降了 91.4%。我們所提出的方法,可以廣泛地應用在從長語音(電話或無線電對話等)中提取關鍵資訊。

英文摘要

According to the characteristics of long sentences and lots of noise in the radio corpus of the fire command record center, this paper proposes a method to identify intent in long sentences. This method first using a self-supervised learning model for speech feature extraction, and then using two downstream models to detect and recognize intent in speech respectively. Compared with the Whisper+BERT method, this method has an error reduction rate (ERR) of 33.2% in the long speech intent recognition task of radio corpus. Compared with the region proposal network (RPN) method on the keyword spotting task, the false alarm per hour (FAH) is similar, and the false rejection rate (false rejection rate, FRR) ERR is 73.2%. Compared with the Whisper+BERT method on the short sentence classification task, the ERR is 4.3%. At the same time, compared with the Whisper+BERT method, the inference computing power requirement has dropped by 91.4%. This method can be widely used in the fields of extracting key information from long speech, recording key information of telephone or radio communication and so on.

中文關鍵字

意圖分類;長語句;自監督學習;關鍵詞發現;語句分類

英文關鍵字

Intent Classification; Long Speech Sentence; Self-supervised Learning; Keyword Spot-ting; Speech Sentence Classification