Research 

English

基於AI之口語溝通輔具科技

近年來由於人類的壽命增長、環境噪音頻繁出現、過度使用耳機,聽損的人數逐年升高,而聽損族群的年齡逐年下降。國際知名研究指出,聽損會造成年長人士與外界產生隔離,間接造成失智問題,而學齡兒童的聽損會造成學習效力下降,因此近年間聽損問題逐漸受到國際重視。申請人主持的實驗室近年致力於開發基於AI以及先進訊號處理技術為基礎的聽覺輔具科技,包括FM無線調頻系統 [1, 2]、助聽器 [3, 4]、人工電子耳 [5-7],目標是期望基於最新穎的AI演算法改進現有的聽覺輔具,幫助聽障人士提升聽覺效能、進一步改善其生活品質。除了聽覺輔具科技,我們亦投入開發新穎的說話輔具技術,目標是提升構音異常人士的語音理解度,增進其與其他人的溝通效率。我們目前已經開發了一套基於機器學習為基礎的語音增強系統,實驗證明能夠有效地提升口腔癌術後的語音辨識率 [8]。此外我們亦基於深度學習演算法實現一個發聲異常偵測平台 [9],此平台可以讓人們在家裡隨時監控是否有發聲異常之可能,以上這些項目我們都獲得相當正面的研究成果。在這個研究方向,我們合作的團隊包括:振興醫院 (研究題目:實現深度學習語音處理演算法於人工電子耳)、榮民總醫院 (研究題目:聽力篩檢平台應用軟體)、馬偕醫學院 (研究題目:構音異常之語音增強技術)、亞東醫院(研究題目:基於深度學習語多模態之構音異常偵測系統)。


基於深度學習之音訊處理

基於深度學習理論,我們提出新穎的語音訊號處理演算法於除噪 [10]、除混響 [11]、以及通道補償 [12] 等議題。這些議題的共通目標在於增進聲音的品質,有效地提昇人與人、人與機器之間的溝通效率。我們特別研發強化對理解力 [13] 以及語音品質 [14] 的演算法,以實現優良的語音辨識率及良好的口語溝通品質。此外,我們提出端對端語音波形增強法,以提高前述的語音理解度及聲音品質。同時,我們提出整合深度及總體學習演算法 [15] 及環境調適演算法(基於對抗式模型訓練準則) [16] ,用來減輕在真實應用情境上可能遭遇到的訓練、測試環境不匹配問題,進一步提升語音訊號處理效能。


結合多模態之語音訊號處理技術

人與人、人與機器的溝通包含口語與非口語的部分,發話端傳遞口語訊息時,收話端聽者除了專注於聲音本身外,也接收有用的視覺訊息來協助了解語音的內容。一般而言,視覺資訊構築非口語的部分,包含語者說話時的發音動作、臉部表情以及肢體語言,在某些語音技術中,圖像及聲音訊號的結合能有效地幫助訊息傳遞以及人機介面的高效設置。由此發想,我們研究結合視覺與聲音訊號的方法,以提高語音訊號處理的效能。目前,我們提出了新穎的演算法,應用於情緒辨識 [17]、口語演講評分 [18]、以及語音增強 [19],實驗結果證實所開發出來的演算法均能有效提升目標任務的效能,未來我們將以開發出的演算法應用於口語溝通輔具科技之開發,進一步幫助需要口語溝通輔助的人們。


Increasing Compactness of Deep Learning based Speech Enhancement Models

Most recent studies on deep learning-based speech enhancement (SE) have focused on improving the denoising performance. However, successful SE application requires the achievement of a balance between denoising performance and computational cost in real scenarios. We have investigated two approaches to effectively compress deep learning models so that the SE can be performed at edge sides. These approaches are model pruning and parameter quantization. In model pruning, a computation-performance optimization (CPO) algorithm was developed [20] for the removal of redundant channels in a neural network, as shown in Fig. 4. For parameter quantization, we proposed an exponent-only floating point quantized neural network (EOFP-QNN) to compress the model and enhance inference efficiency [21]. Both the model pruning and parameter quantization techniques can significantly reduce model size and increase inference efficiency with an acceptable drop in performance.


Speech Enhancement with Direct Evaluation Metric Optimization

During the training process for an SE model, an objective function is used to optimize the model parameters. In the existing literature, there is an inconsistency between the model optimization criterion and the evaluation criterion for enhanced speech. For example, in the measurement of intelligibility, most of the evaluation metrics are based on short-time objective intelligibility (STOI) measure, while a frame based mean square error (MSE) between the enhanced speech and clean reference is widely used in the process of optimizing the model. Due to this inconsistency, there is no guarantee that the trained model will facilitate optimal performance in different applications [13]. We therefore investigated several algorithms with the aim of directly optimizing model parameters based on evaluation metrics including STOI [13], perceptual evaluation of speech quality (PESQ) [14], and automatic speech recognition (ASR) [22]. Reinforcement learning and GAN-based methods have also been exploited to facilitate optimization given that some evaluation metrics are complex and not differentiable. Experimental results show that by using the same specific evaluation metric as the objective function, the SE model can be trained to yield superior performance compared to the MSE to achieve a desired outcome.


Multimodal Pathological Voice Classification

We have conducted research on pathological voice classification based on medical records [23] and voice signals [24]. The results showed that voice disorders can be accurately identified using voice signals and medical records when advanced signal processing and machine learning methods are utilized. Based on the voice data, we organized a pathological Voice Detection Challenge in IEEE Big Data 2018 [25], which attracted 109 participating teams from 27 different countries. More recently, we investigated the combination of acoustic signals and medical records and derived a multimodal deep learning model. The proposed model consists of two stages: the first stage processes acoustic features and medical data individually and the second stage integrates the outputs from the first stage to perform classification. The proposed multimodal deep learning frameworks were evaluated using 589 samples collected from Far Eastern Memorial Hospital, consisting of three categories of vocal disease, i.e. glottic neoplasm, phonotraumatic lesions, and vocal paralysis. We obtained promising experimental results compared to systems that use only acoustic signals or medical records.

REFERENCE

[1] A. Chern, Y.-H. Lai, Y.-p. Chang, Y. Tsao, R. Y. Chang, and H.-W. Chang, “A Smartphone-Based Multi-Functional Hearing Assistive System to Facilitate Speech Recognition in the Classroom,” IEEE Access, vol. 5, pp. 10339-10351, 2017 (This paper has been selected as a Featured Article in IEEE Access).

[2] Y.-C. Lin, Y.-H. Lai, H.-W. Chang, Y. Tsao, Y.-p. Chang, and R. Y. Chang, “A Smartphone-Based Remote Microphone Hearing Assistive System Using Wireless Technologies,” IEEE Systems Journal, vol. 12(1), pp. 20-29, 2018.

[3] Y.-T. Liu, R. Y. Chang, Y. Tsao, and Y.-p. Chang, “A New Frequency Lowering Technique for Mandarin-speaking Hearing Aid Users,” in Proc. GlobalSIP 2015.

[4] Y.-T. Liu, Y. Tsao, and R. Y. Chang, “Nonnegative Matrix Factorization-based Frequency Lowering Technology for Mandarin-speaking Hearing Aid Users,” in. Proc. ICASSP 2016.

[5] Y.-H. Lai, Y. Tsao, X. Lu, F. Chen, Y.-T. Su, K.-C. Chen, Y.-H. Chen, L.-C. Chen, P.-H. Li, and C.-H. Lee, “Deep Learning based Noise Reduction Approach to Improve Speech Intelligibility for Cochlear Implant Recipients,” Ear and Hearing, vol. 39(4), pp. 795-809, 2018.

[6] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee, “A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation,” IEEE Transactions on Biomedical Engineering, vol. 64(7), pp. 1568-1578, 2017.

[7] Y.-H. Lai, Y. Tsao, and F. Chen, “Effects of Adaptation Rate and Noise Suppression on the Intelligibility of Compressed-Envelope Based Speech,” PLoS ONE, vol. 10.1371, journal.pone.0133519, 2015.

[8] S.-W. Fu, P.-C. Li, Y.-H. Lai, C.-C. Yang, L.-C. Hsieh, and Y. Tsao, “Joint Dictionary Learning-based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery,” IEEE Transactions on Biomedical Engineering, vol. 64 (11), pp. 2584-2594, 2016.

[9] L.-W. Chen, H.-Y. Lee, and Y. Tsao, “Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech,” http://arxiv.org/abs/1810.12656

[10] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, C, “Speech Enhancement Based on Deep Denoising Autoencoder,” in Proc. Interspeech 2013.

[11] W.-J. Lee, S.-S. Wang, F. Chen, X. Lu, S.-Y. Chien, and Y. Tsao, “Speech Dereverberation Based on Integrated Deep and Ensemble Learning Algorithm,” in Proc. ICASSP 2018.

[12] H.-P. Liu, Y. Tsao, Y., and C.-S. Fuh, “Bone-Conducted Speech Enhancement Using Deep Denoising Autoencoder,” Speech Communication, vol. 104, pp. 106-112, 2018.

[13] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks,” IEEE Transactions on Audio, Speech and Language Processing, vol. 26(9), pp. 1570-1584, 2018.

[14] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM,” in Proc. Interspeech 2018.

[15] X, Lu, Y. Tsao, S, Matsuda and C, Hori, “Ensemble Modeling of Denoising Autoencoder for Speech Spectrum Restoration,” pp. 885-889, in Proc. Interspeech 2014.

[16] C.-F. Liao, Y. Tsao, H.-y. Lee, and H.-M. Wang, “Noise Adaptive Speech Enhancement using Domain Adversarial Training,” Interspeech, 2019.

[17] W.-C. Chen, P.-T. Lai, Y. Tsao, and C.-C. Lee, “Multimodal Arousal Rating using Unsupervised Fusion Technique,” in Proc. ICASSP 2015.

[18] S.-W. Hsiao, H.-C. Sun, M.-C. Hsieh, M.-H. Tsai, Y. Tsao, and C.-C. Lee, “Toward Automating Oral Presentation Scoring during Principal Certification Program using Audio-Video Low-level Behavior Profiles,” IEEE Transactions on Affective Computing, in press.

[19] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual Speech Enhancement using Multimodal Deep Convolutional Neural Networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2(2), pp. 117-128, 2018.

[20] C.-T. Liu, T.-W. Lin, Y.-H. Wu, Y.-S. Lin, H. Lee, Y. Tsao, and S.-Y. Chien, “Computation-Performance Optimization of Convolutional Neural Networks with Redundant Filter Removal,” IEEE Transactions on Circuits and Systems I 2018.

[21] Y.-T. Hsu, Y.-C. Lin, S.-W. Fu, Y. Tsao, and T.-W. Kuo, “A Study on Speech Enhancement using Exponent-only Floating Point Quantized Neural Network (EOFP-QNN),” in Proc. SLT 2018.

[22] Y.-L. Shen, C.-Y. Huang, S.-S. Wang, Y. Tsao, H.-M. Wang, and T.-S. Chi, “Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition,” to appear in ICASSP 2019.

[23] S.-Y. Tsui, Y. Tsao, C.-W. Lin, S.-H. Fang, and C.-T. Wang, “Demographic and Symptomatic Features of Voice Disorders and Their Potential Application in Classification using Machine Learning Algorithms,” Folia Phoniatrica et Logopaedica 2018.

[24] C.-T. Wang, F.-C. Lin, J.-Y. Chen, M.-J. Hsiao, S.-H. Fang, Y.-H. Lai, and Y. Tsao, “Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach,” Journal of Voice, 2018.

[25] https://femh-challenge2018.weebly.com/