In CITI, Academia Sinica was founded in November, 2011. We are dedicated to developing novel
acoustic signal processing and artificial intelligence algorithms and apply them to
biomedical and biology related tasks.
The main research focuses include five parts:
AI for assistive speech communication
technologies
Deep learning based speech signal processing
Multi-modal speech signal processing
Soundscape information retrieval
So far, the Bio-ASP Lab has published more than 52 journal papers and 120 international conference papers. Among them, the Bio-ASP Lab received Best Poster Presentation Award in IEEE MIT URTC 2017, Poster Presentation Award in APSIPA 2017, Best Paper Award in ROCLING 2017, Excellent Paper Award in TAAI 2012. Meanwhile, an Interspeech 2014 paper received the best paper award nomination, four papers have received the ISCA travel grant awards, and one paper received the ICML travel grant. A co-advised PhD student received the PhD Thesis Award by ACLCLP in 2018. The Bio-ASP Lab also received the Career Development Award, Academia Sinica in 2017, and National Innovation Award in 2018.
The proportional increase in the elderly population and the inappropriate use of portable
audio devices have led to a rapid increase in incidents of hearing loss.
Untreated hearing loss can cause feelings of loneliness and isolation in the elderly and may
lead to learning difficulties in students. Over the past few years, our group has investigated
the application of machine learning and signal processing algorithms in FM assistive hearing
systems [1, 2],
hearing aids [3, 4], and cochlear implants (CIs) [5-7] to improve speech communication in
hearing-impaired
patients and the subsequent enhancement in their quality of life. In addition to assistive
listening devices,
we have also investigated the development of machine learning-based assistive speaking devices
to enhance
intelligibility in individuals with speech and language disorders [8].
Oral cancer ranks in the top five of all cancers in Taiwan. To treat the oral cancer, surgical processes are often required to have parts of the patients’ articulators removed. Because of the removal of parts of the articulator, a patient’s speech may be distorted and difficult to understand. To overcome this problem, we propose two voice conversion (VC) approaches: the first one is the joint dictionary training non-negative matrix factorization (JD-NMF), and the second one is the end-to-end generative adversarial network (GAN)-based unsupervised VC model [9]. Experimental results show that both approaches can be applied to convert the distorted speech signals to the ones with improved intelligibility.
[1] A. Chern, Y.-H. Lai, Y.-p. Chang, Y. Tsao, R. Y. Chang, and H.-W. Chang, “A Smartphone-Based Multi-Functional Hearing Assistive System to Facilitate Speech Recognition in the Classroom,” IEEE Access, vol. 5, pp. 10339-10351, 2017 (This paper has been selected as a Featured Article in IEEE Access).
[2] Y.-C. Lin, Y.-H. Lai, H.-W. Chang, Y. Tsao, Y.-p. Chang, and R. Y. Chang, “A Smartphone-Based Remote Microphone Hearing Assistive System Using Wireless Technologies,” IEEE Systems Journal, vol. 12(1), pp. 20-29, 2018.
[3] Y.-T. Liu, R. Y. Chang, Y. Tsao, and Y.-p. Chang, “A New Frequency Lowering Technique for Mandarin-speaking Hearing Aid Users,” in Proc. GlobalSIP 2015.
[4] Y.-T. Liu, Y. Tsao, and R. Y. Chang, “Nonnegative Matrix Factorization-based Frequency Lowering Technology for Mandarin-speaking Hearing Aid Users,” in. Proc. ICASSP 2016.
[5] Y.-H. Lai, Y. Tsao, X. Lu, F. Chen, Y.-T. Su, K.-C. Chen, Y.-H. Chen, L.-C. Chen, P.-H. Li, and C.-H. Lee, “Deep Learning based Noise Reduction Approach to Improve Speech Intelligibility for Cochlear Implant Recipients,” Ear and Hearing, vol. 39(4), pp. 795-809, 2018.
[6] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, and C.-H. Lee, “A Deep Denoising Autoencoder Approach to Improving the Intelligibility of Vocoded Speech in Cochlear Implant Simulation,” IEEE Transactions on Biomedical Engineering, vol. 64(7), pp. 1568-1578, 2017.
[7] Y.-H. Lai, Y. Tsao, and F. Chen, “Effects of Adaptation Rate and Noise Suppression on the Intelligibility of Compressed-Envelope Based Speech,” PLoS ONE, vol. 10.1371, journal.pone.0133519, 2015.
[8] S.-W. Fu, P.-C. Li, Y.-H. Lai, C.-C. Yang, L.-C. Hsieh, and Y. Tsao, “Joint Dictionary Learning-based Non-Negative Matrix Factorization for Voice Conversion to Improve Speech Intelligibility After Oral Surgery,” IEEE Transactions on Biomedical Engineering, vol. 64 (11), pp. 2584-2594, 2016.
[9] L.-W. Chen, H.-Y. Lee, and Y. Tsao, “Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech,” http://arxiv.org/abs/1810.12656
In this investigation, we focused on deriving novel deep learning-based algorithms for denoising
[10],
dereverberation [11], and channel compensation [12] on speech signals. The goal is to enhance
the
speech signals in order to achieving improved human-human and human-machine communication
efficacy.
We investigated approaches to enhance speech intelligibility [13] and quality [14] to facilitate
higher
speech recognition rates and improved communication. We also proposed end-to-end waveform
enhancement to
directly improve the intelligibility and quality of speech. In addition, we have developed a
novel
integrated deep and ensemble learning algorithm (IDEA) [15] and an environment-adaptive
algorithm (based on the domain adversarial training criterion) [16] for speech signal processing
to
address possible mismatched issues in real-world applications.
[10] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, C, “Speech Enhancement Based on Deep Denoising Autoencoder,” in Proc. Interspeech 2013.
[11] W.-J. Lee, S.-S. Wang, F. Chen, X. Lu, S.-Y. Chien, and Y. Tsao, “Speech Dereverberation Based on Integrated Deep and Ensemble Learning Algorithm,” in Proc. ICASSP 2018.
[12] H.-P. Liu, Y. Tsao, Y., and C.-S. Fuh, “Bone-Conducted Speech Enhancement Using Deep Denoising Autoencoder,” Speech Communication, vol. 104, pp. 106-112, 2018.
[13] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks,” IEEE Transactions on Audio, Speech and Language Processing, vol. 26(9), pp. 1570-1584, 2018.
[14] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM,” in Proc. Interspeech 2018.
[15] X, Lu, Y. Tsao, S, Matsuda and C, Hori, “Ensemble Modeling of Denoising Autoencoder for Speech Spectrum Restoration,” pp. 885-889, in Proc. Interspeech 2014.
[16] C.-F. Liao, Y. Tsao, H.-y. Lee, and H.-M. Wang, “Noise Adaptive Speech Enhancement using Domain Adversarial Training,” Interspeech, 2019.
Communication can be verbal or nonverbal. Verbal communication includes speaking and
listening.
The speaker relays verbal information while the listener focuses on auditory signals and the
visual
cues for speech recognition. The visual signals may include articulatory movements, facial
expressions,
and co-speech gestures of the speaker, which constitutes the nonverbal part of the communication
process. In various applied speech technologies, it has been shown that audio-visual integration
can
assist in human information exchange and the development of human-computer interfaces. We have
conducted
studies on the incorporation of audio and video information to facilitate improved speech signal
processing performance.
Currently, we have developed novel algorithms by fusing the audio and visual information for emotion recognition [17], oral presentation scoring [18], and speech enhancement [19]
[17] W.-C. Chen, P.-T. Lai, Y. Tsao, and C.-C. Lee, “Multimodal Arousal Rating using Unsupervised Fusion Technique,” in Proc. ICASSP 2015.
[18] S.-W. Hsiao, H.-C. Sun, M.-C. Hsieh, M.-H. Tsai, Y. Tsao, and C.-C. Lee, “Toward Automating Oral Presentation Scoring during Principal Certification Program using Audio-Video Low-level Behavior Profiles,” IEEE Transactions on Affective Computing, in press.
[19] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and H.-M. Wang, “Audio-visual Speech Enhancement using Multimodal Deep Convolutional Neural Networks,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2(2), pp. 117-128, 2018.
Most recent studies on deep learning-based speech enhancement (SE) have focused on improving the
denoising performance. However, successful SE application requires the achievement of a balance
between
denoising performance and computational cost in real scenarios.We have investigated two
approaches to
effectively compress deep learning models so that the SE can be performed at edge sides.
These approaches are model pruning and parameter quantization. In model pruning, a
computation-performance
optimization (CPO) algorithm was developed [20] for the removal of redundant channels in a
neural network,
as shown in Fig. 4. For parameter quantization, we proposed an exponent-only floating point
quantized
neural network (EOFP-QNN) to compress the model and enhance inference efficiency [21]. Both the
model
pruning and parameter quantization techniques can significantly reduce model size and increase
inference
efficiency with an acceptable drop in performance.
[20] C.-T. Liu, T.-W. Lin, Y.-H. Wu, Y.-S. Lin, H. Lee, Y. Tsao, and S.-Y. Chien, “Computation-Performance Optimization of Convolutional Neural Networks with Redundant Filter Removal,” IEEE Transactions on Circuits and Systems I 2018.
[21] Y.-T. Hsu, Y.-C. Lin, S.-W. Fu, Y. Tsao, and T.-W. Kuo, “A Study on Speech Enhancement using Exponent-only Floating Point Quantized Neural Network (EOFP-QNN),” in Proc. SLT 2018.
During the training process for an SE model, an objective function is used to optimize the model
parameters. In the existing literature, there is an inconsistency between the model optimization
criterion and the evaluation criterion for enhanced speech.
For example, in the measurement of intelligibility, most of the evaluation metrics are based on
short-time objective intelligibility (STOI) measure, while a frame based mean square error (MSE)
between the enhanced speech and clean reference is widely used in the process of optimizing the
model.
Due to this inconsistency, there is no guarantee that the trained model will facilitate optimal
performance in different applications [13]. We therefore investigated several algorithms with
the
aim of directly optimizing model parameters based on evaluation metrics including STOI [13],
perceptual
evaluation of speech quality (PESQ) [14], and automatic speech recognition (ASR) [22].
Reinforcement learning and GAN-based methods have also been exploited to facilitate optimization given that some evaluation metrics are complex and not differentiable. Experimental results show that by using the same specific evaluation metric as the objective function, the SE model can be trained to yield superior performance compared to the MSE to achieve a desired outcome.
[13] S.-W. Fu, T.-W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural Networks,” IEEE Transactions on Audio, Speech and Language Processing, vol. 26(9), pp. 1570-1584, 2018.
[14] S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An End-to-End Non-intrusive Speech Quality Assessment Model based on BLSTM,” in Proc. Interspeech 2018.
[22] Y.-L. Shen, C.-Y. Huang, S.-S. Wang, Y. Tsao, H.-M. Wang, and T.-S. Chi, “Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition,” to appear in ICASSP 2019.
We have conducted research on pathological voice classification based on medical records [23]
and voice
signals [24]. The results showed that voice disorders can be accurately identified using voice
signals
and medical records when advanced signal processing and machine learning methods are utilized.
Based on
the voice data, we organized a pathological Voice Detection Challenge in IEEE Big Data 2018
[25], which
attracted 109 participating teams from 27 different countries.
More recently, we investigated the combination of acoustic signals and medical records and
derived a
multimodal deep learning model. The proposed model consists of two stages: the first stage
processes
acoustic features and medical data individually and the second stage integrates the outputs from
the
first stage to perform classification. The proposed multimodal deep learning frameworks were
evaluated
using 589 samples collected from Far Eastern Memorial Hospital, consisting of three categories
of vocal
disease, i.e. glottic neoplasm, phonotraumatic lesions, and vocal paralysis. We obtained
promising
experimental results compared to systems that use only acoustic signals or medical records.
[23] S.-Y. Tsui, Y. Tsao, C.-W. Lin, S.-H. Fang, and C.-T. Wang, “Demographic and Symptomatic Features of Voice Disorders and Their Potential Application in Classification using Machine Learning Algorithms,” Folia Phoniatrica et Logopaedica 2018.
[24] C.-T. Wang, F.-C. Lin, J.-Y. Chen, M.-J. Hsiao, S.-H. Fang, Y.-H. Lai, and Y. Tsao, “Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach,” Journal of Voice, 2018.
[25] https://femh-challenge2018.weebly.com/