Back to results list
Show full item record
Please use this identifier to cite or link to this item:
|Title:||Speech enhancement using sparse representation methods||Authors:||Shen, Tak Wai||Degree:||Ph.D.||Issue Date:||2015||Abstract:||In this thesis, the problem of speech enhancement is investigated. In consistent with the traditional frequency domain speech enhancement algorithms, we investigated the estimation methods of some important parameters in speech enhancement, such as the speech periodogram, a-priori Signal-to-Noise Ratio (SNR) and Speech Presence Probability (SPP). In this study, we emphasize on making use of the sparse representation of speech signals to improve the estimation. To achieve this, the wavelet denoising technique, the cepstral analysis using expectation-maximization (EM) framework as well as the dictionary learning method based on sparse reconstruction on log-spectra have been adopted and achieved satisfactory results. The first part of this study is related to the estimation of SPP. It is known that a reliable SPP estimator is important to many frequency domain speech enhancement algorithms. A good estimate of SPP can be obtained by having a smooth a-posteriori SNR function, which can be achieved by reducing the noise variance when estimating the speech power spectrum. Recently, the wavelet denoising with multitaper spectrum (MTS) estimation technique was suggested for such purpose. However, traditional approaches directly make use of the wavelet shrinkage denoiser which has not been fully optimized for denoising the MTS of noisy speech signals. In this study, we propose a two-stage wavelet denoising algorithm for estimating the speech power spectrum. First, we apply the wavelet transform to the periodogram of a noisy speech signal. Using the resulting wavelet coefficients, an oracle is developed to indicate the approximate locations of the noise floor in the periodogram. Second, we make use of the oracle developed in stage 1 to selectively remove the wavelet coefficients of the noise floor in the log MTS of the noisy speech. The remaining wavelet coefficients are then used to reconstruct a denoised MTS and in turn generate a smooth a-posteriori SNR function. To adapt to the enhanced a-posteriori SNR function, we further propose a new method to estimate the generalized likelihood ratio (GLR), which is an essential parameter for SPP estimation. Simulation results show that the new SPP estimator outperforms the traditional approaches and enables an improvement in both the quality and intelligibility of the enhanced speeches.
While the wavelet transform can sparsely describe the sudden changes in a speech power spectrum, it misses the periodic nature of speech signals which is an important feature in speech enhancement. For the second part of this study, a new speech enhancement method based on the sparsity of speeches in the cepstral domain is investigated. It is known that voiced speeches have a quasi-periodic nature that allows them to be compactly represented in the cepstral domain. It is a distinctive feature compared with noises. Recently, the temporal cepstrum smoothing (TCS) algorithm was proposed and was shown to be effective for speech enhancement in non-stationary noise environments. However, the missing of an automatic parameter updating mechanism limits its adaptability to noisy speeches with abrupt changes in SNR across time frames or frequency components. In this part, an improved speech enhancement algorithm based on a novel EM framework is proposed. The new algorithm starts with the traditional TCS method which gives the initial guess of the periodogram of the clean speech. It is then applied to an L1 norm regularizer in the M-step of the EM framework to estimate the true power spectrum of the original speech. It in turn enables the estimation of the a-priori SNR and is used in the E-step, which is indeed an MMSE-LSA gain function, to refine the estimation of the clean speech periodogram. The M-step and E-step iterate alternately until converged. A notable improvement of the proposed algorithm over the traditional TCS method is its adaptability to the changes (even abrupt changes) in SNR of the noisy speech. Performance of the proposed algorithm is evaluated using standard measures based on a large set of speech and noise signals. Evaluation results show that a significant improvement is achieved compared to conventional approaches. The above shows that obtaining the sparse representation of speeches is one of the keys for designing an efficient speech enhancement algorithm. One obvious question then arises if the ceptrum is the best representation of speeches as far as the sparsity is concerned. To answer this question, we further investigate a new sparse representation based speech enhancement algorithm with the transform kernel trained based on the dictionary learning method. It is known that the dictionary learning method allows the design of a transform kernel with the emphasis of sparsity in the transform domain. When applying to speech enhancement, it allows a speech to be represented by very few significant transform coefficients. In practice, the overcomplete dictionary of the clean speech signal is trained by an extended K-SVD algorithm in the log power spectra domain. The batch LARS with Coherence Criterion (LARC) method is used to reconstruct the log power spectra of the clean speech. And a new stopping criterion is proposed for the iterative speech enhancement process in order to adapt to various background noise environment. In addition, a modified two-step noise reduction with MMSE-LSA filtering is applied which solves the bias problem of the estimated a priori SNR. A notable improvement of the proposed algorithm over the traditional speech enhancement method is its adaptability to the changes in SNR of the noisy speech. Performance of the proposed algorithm is evaluated using standard measures based on a large set of speech and noise signals. Evaluation results show that a significant improvement is achieved compared to the traditional approaches especially when the noises are not totally random but have certain structure in the frequency domain.
|Subjects:||Speech processing systems.
Hong Kong Polytechnic University -- Dissertations
|Pages:||xiv, 162 pages : color illustrations|
|Appears in Collections:||Thesis|
View full-text via https://theses.lib.polyu.edu.hk/handle/200/8298
Citations as of May 22, 2022
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.