AI for assistive speech communication technologies

The proportional increase in the elderly population and the inappropriate use of portable audio devices have led to a rapid increase in incidents of hearing loss. Untreated hearing loss can cause feelings of loneliness and isolation in the elderly and may lead to learning difficulties in students. Over the past few years, our group has investigated the application of machine learning and signal processing algorithms in FM assistive hearing systems [1, 2], hearing aids [3, 4], and cochlear implants (CIs) [5-7] to improve speech communication in hearing-impaired patients and the subsequent enhancement in their quality of life. In addition to assistive listening devices, we have also investigated the development of machine learning-based assistive speaking devices to enhance intelligibility in individuals with speech and language disorders [8]. Oral cancer ranks in the top five of all cancers in Taiwan. To treat the oral cancer, surgical processes are often required to have parts of the patients’ articulators removed. Because of the removal of parts of the articulator, a patient’s speech may be distorted and difficult to understand. To overcome this problem, we propose two voice conversion (VC) approaches: the first one is the joint dictionary training non-negative matrix factorization (JD-NMF), and the second one is the end-to-end generative adversarial network (GAN)-based unsupervised VC model [9]. Experimental results show that both approaches can be applied to convert the distorted speech signals to the ones with improved intelligibility.

Deep learning-based speech signal processing

In this investigation, we focused on deriving novel deep learning-based algorithms for denoising [10], dereverberation [11], and channel compensation [12] on speech signals. The goal is to enhance the speech signals in order to achieving improved human-human and human-machine communication efficacy. We investigated approaches to enhance speech intelligibility [13] and quality [14] to facilitate higher speech recognition rates and improved communication. We also proposed end-to-end waveform enhancement to directly improve the intelligibility and quality of speech. In addition, we have developed a novel integrated deep and ensemble learning algorithm (IDEA) [15] and an environment-adaptive algorithm (based on the domain adversarial training criterion) [16] for speech signal processing to address possible mismatched issues in real-world applications.

Multimodal speech signal processing

Communication can be verbal or nonverbal. Verbal communication includes speaking and listening. The speaker relays verbal information while the listener focuses on auditory signals and the visual cues for speech recognition. The visual signals may include articulatory movements, facial expressions, and co-speech gestures of the speaker, which constitutes the nonverbal part of the communication process. In various applied speech technologies, it has been shown that audio-visual integration can assist in human information exchange and the development of human-computer interfaces. We have conducted studies on the incorporation of audio and video information to facilitate improved speech signal processing performance. Currently, we have developed novel algorithms by fusing the audio and visual information for emotion recognition [17], oral presentation scoring [18], and speech enhancement [19].

Increasing Compactness of Deep Learning based Speech Enhancement Models

Most recent studies on deep learning-based speech enhancement (SE) have focused on improving the denoising performance. However, successful SE application requires the achievement of a balance between denoising performance and computational cost in real scenarios. We have investigated two approaches to effectively compress deep learning models so that the SE can be performed at edge sides. These approaches are model pruning and parameter quantization. In model pruning, a computation-performance optimization (CPO) algorithm was developed [20] for the removal of redundant channels in a neural network, as shown in Fig. 4. For parameter quantization, we proposed an exponent-only floating point quantized neural network (EOFP-QNN) to compress the model and enhance inference efficiency [21]. Both the model pruning and parameter quantization techniques can significantly reduce model size and increase inference efficiency with an acceptable drop in performance.

Speech Enhancement with Direct Evaluation Metric Optimization

During the training process for an SE model, an objective function is used to optimize the model parameters. In the existing literature, there is an inconsistency between the model optimization criterion and the evaluation criterion for enhanced speech. For example, in the measurement of intelligibility, most of the evaluation metrics are based on short-time objective intelligibility (STOI) measure, while a frame based mean square error (MSE) between the enhanced speech and clean reference is widely used in the process of optimizing the model. Due to this inconsistency, there is no guarantee that the trained model will facilitate optimal performance in different applications [13]. We therefore investigated several algorithms with the aim of directly optimizing model parameters based on evaluation metrics including STOI [13], perceptual evaluation of speech quality (PESQ) [14], and automatic speech recognition (ASR) [22]. Reinforcement learning and GAN-based methods have also been exploited to facilitate optimization given that some evaluation metrics are complex and not differentiable. Experimental results show that by using the same specific evaluation metric as the objective function, the SE model can be trained to yield superior performance compared to the MSE to achieve a desired outcome.

Multimodal Pathological Voice Classification

We have conducted research on pathological voice classification based on medical records [23] and voice signals [24]. The results showed that voice disorders can be accurately identified using voice signals and medical records when advanced signal processing and machine learning methods are utilized. Based on the voice data, we organized a pathological Voice Detection Challenge in IEEE Big Data 2018 [25], which attracted 109 participating teams from 27 different countries. More recently, we investigated the combination of acoustic signals and medical records and derived a multimodal deep learning model. The proposed model consists of two stages: the first stage processes acoustic features and medical data individually and the second stage integrates the outputs from the first stage to perform classification. The proposed multimodal deep learning frameworks were evaluated using 589 samples collected from Far Eastern Memorial Hospital, consisting of three categories of vocal disease, i.e. glottic neoplasm, phonotraumatic lesions, and vocal paralysis. We obtained promising experimental results compared to systems that use only acoustic signals or medical records.


[1] A. Chern, Y.-H. Lai, Y.-p. Chang, Y. Tsao, R. Y. Chang, and H.-W. Chang, “A Smartphone-Based Multi-Functional Hearing Assistive System to Facilitate Speech Recognition in the Classroom,” IEEE Access, vol. 5, pp. 10339-10351, 2017 (This paper has been selected as a Featured Article in IEEE Access).

[2] Y.-C. Lin, Y.-H. Lai, H.-W. Chang, Y. Tsao, Y.-p. Chang, and R. Y. Chang, “A Smartphone-Based Remote Microphone Hearing Assistive System Using Wireless Technologies,” IEEE Systems Journal, vol. 12(1), pp. 20-29, 2018.

[3] Y.-T. Liu, R. Y. Chang, Y. Tsao, and Y.-p. Chang, “A New Frequency Lowering Technique for Mandarin-speaking Hearing Aid Users,” in Proc. GlobalSIP 2015.

[4] Y.-T. Liu, Y. Tsao, and R. Y. Chang, “Nonnegative Matrix Factorization-based Frequency Lowering Technology for Mandarin-speaking Hearing Aid Users,” in. Proc. ICASSP 2016.

[5] Y.-H. Lai, Y. Tsao, X. Lu, F. Chen, Y.-T. Su, K.-C. Chen, Y.-H. Chen, L.-C. Chen, P.-H. Li, and C.-H. Lee, “Deep Learning based Noise Reduction Approach to Improve Speech Intelligibility for Cochlear Implant Recipients,” Ear and Hearing, vol. 39(4), pp. 795-809, 2018.

[6] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee, “A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation,” IEEE Transactions on Biomedical Engineering, vol. 64(7), pp. 1568-1578, 2017.

[7] Y.-H. Lai, Y. Tsao, and F. Chen, “Effects of Adaptation Rate and Noise Suppression on the Intelligibility of Compressed-Envelope Based Speech,” PLoS ONE, vol. 10.1371, journal.pone.0133519, 2015.

[8] S.-W. Fu, P.-C. Li, Y.-H. Lai, C.-C. Yang, L.-C. Hsieh, and Y. Tsao, “Joint Dictionary Learning-based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery,” IEEE Transactions on Biomedical Engineering, vol. 64 (11), pp. 2584-2594, 2016.

[9] L.-W. Chen, H.-Y. Lee, and Y. Tsao, “Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech,”

[10] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, C, “Speech Enhancement Based on Deep Denoising Autoencoder,” in Proc. Interspeech 2013.

[11] W.-J. Lee, S.-S. Wang, F. Chen, X. Lu, S.-Y. Chien, and Y. Tsao, “Speech Dereverberation Based on Integrated Deep and Ensemble Learning Algorithm,” in Proc. ICASSP 2018.

[12] H.-P. Liu, Y. Tsao, Y., and C.-S. Fuh, “Bone-Conducted Speech Enhancement Using Deep Denoising Autoencoder,” Speech Communication, vol. 104, pp. 106-112, 2018.

[13] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks,” IEEE Transactions on Audio, Speech and Language Processing, vol. 26(9), pp. 1570-1584, 2018.

[14] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM,” in Proc. Interspeech 2018.

[15] X, Lu, Y. Tsao, S, Matsuda and C, Hori, “Ensemble Modeling of Denoising Autoencoder for Speech Spectrum Restoration,” pp. 885-889, in Proc. Interspeech 2014.

[16] C.-F. Liao, Y. Tsao, H.-y. Lee, and H.-M. Wang, “Noise Adaptive Speech Enhancement using Domain Adversarial Training,” Interspeech, 2019.

[17] W.-C. Chen, P.-T. Lai, Y. Tsao, and C.-C. Lee, “Multimodal Arousal Rating using Unsupervised Fusion Technique,” in Proc. ICASSP 2015.

[18] S.-W. Hsiao, H.-C. Sun, M.-C. Hsieh, M.-H. Tsai, Y. Tsao, and C.-C. Lee, “Toward Automating Oral Presentation Scoring during Principal Certification Program using Audio-Video Low-level Behavior Profiles,” IEEE Transactions on Affective Computing, in press.

[19] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual Speech Enhancement using Multimodal Deep Convolutional Neural Networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2(2), pp. 117-128, 2018.

[20] C.-T. Liu, T.-W. Lin, Y.-H. Wu, Y.-S. Lin, H. Lee, Y. Tsao, and S.-Y. Chien, “Computation-Performance Optimization of Convolutional Neural Networks with Redundant Filter Removal,” IEEE Transactions on Circuits and Systems I 2018.

[21] Y.-T. Hsu, Y.-C. Lin, S.-W. Fu, Y. Tsao, and T.-W. Kuo, “A Study on Speech Enhancement using Exponent-only Floating Point Quantized Neural Network (EOFP-QNN),” in Proc. SLT 2018.

[22] Y.-L. Shen, C.-Y. Huang, S.-S. Wang, Y. Tsao, H.-M. Wang, and T.-S. Chi, “Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition,” to appear in ICASSP 2019.

[23] S.-Y. Tsui, Y. Tsao, C.-W. Lin, S.-H. Fang, and C.-T. Wang, “Demographic and Symptomatic Features of Voice Disorders and Their Potential Application in Classification using Machine Learning Algorithms,” Folia Phoniatrica et Logopaedica 2018.

[24] C.-T. Wang, F.-C. Lin, J.-Y. Chen, M.-J. Hsiao, S.-H. Fang, Y.-H. Lai, and Y. Tsao, “Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach,” Journal of Voice, 2018.