Please use this identifier to cite or link to this item:
|Title:||Cantonese speech recognition|
|Subjects:||Automatic speech recognition|
Speech processing systems
Cantonese dialects -- Data processing
Hong Kong Polytechnic University -- Dissertations
|Publisher:||The Hong Kong Polytechnic University|
|Abstract:||Cantonese speech recognition consists of three parts: translating our perceived Cantonese speech to their respective tone patterns and syllables and converting them into texts based on the contextual information in the passage. Our research focuses on the first domain, the Cantonese tone recognition; while leaving the syllable recognition as a future work. For the language modeling, since it requires further linguistic knowledge and searching algorithm, it should be completed as another research work.|
Starting from this goal, we develop our research framework based on pitch synchronization. Pitch synchronization means information is extracted in phase with the movement of pitch in speech signals. In our research, information refers to the tonal patterns of Chinese speech. Thus, pitch synchronization for tone recognition means the extraction of pitch contour, which is the changes in the fundamental frequency of speech signals, is achieved by first identifying the beginning and end of each pitch period and then measuring the interval between each pair of pitch marks. The advantage of using this so-called pitch synchronous pitch extraction over the conventional non-pitch synchronous one, such as the autocorrelation method and the cepstrum pitch determination, is the independence of the analysis frame size for different speakers. Hence it can handle both low-pitch as well as high-pitch speakers.
Formally, the identification of these pitch marks is called the epoch detection in which each glottal closure instant during voicing is located. From our survey of the existing epoch detection methods, the major problem is the degradation of performance in noise contaminated environment and the difficulties in identifying the epochs at the boundaries of the utterances. Wavelet is famed for its good singularity detection ability, however, leaving much room for improvement under the above conditions. The difficulties come from the 'too good' characteristic of the wavelet for singularity detection, while viewing in another perspective, is sensitive to noise and ineffective for weaker excitation. Hence, a matching scheme to confirm the existence of the epochs is a must and the detection correctness largely depends on this matching scheme.
Our proposed Combined Wavelet Epoch Detector (CWED) is based on two wavelets: the Spline and Gaussian wavelet, to improve the deterministic matching scheme. The rationale is to retain the good singularity detection property of the Spline wavelet for epoch detection while utilizing the coarse but robust epoch occurrence identification property of the Gaussian wavelet found experimentally. Results of our proposed scheme is tested with different noise conditions and achieves 26% improvement in recall performance while retaining the relative position consistency of 1.4ms.
The realization of the detected epochs on tone recognition is done with our proposed Smoothed Contour Tone Recognizer (SCTR). Pitch contour is not directly measured from the intervals of the epoch marks owing to the identification defects obtained during the detection. Instead, a smoothing algorithm is proposed and implemented before the pitch frequencies are extracted for feature extraction and tone recognition. This smoothing algorithm is based on the distinction between perceptively good pitch frequencies and the irregular pitch frequencies caused by the mistaken epochs with a pitch tracking algorithm and the estimation of the complete pitch contour is done by a linear/quadratic interpolation of the former subset. The accuracy for recognizing the six non-entering tones average over the different noise types and noise levels (down to 0dB) are 72% (male) and 75% (female) for the single speaker cases; and having 59% (male) and 69% (female) for the multiple speaker cases. The overall improvement in accuracy in all SNRs (from clean down to -18dB) compared with the baseline tone recognizer for the single speaker and multiple speaker cases are 23% and 18% (male); 19% and 14% (female) respectively. Further performance comparison of the SCTR was conducted with the replacement of the combined wavelet epoch detector (CWED) with the K&B algorithm. However, from the result there is no evidence that the CWED provides improved performance over the K&B algorithm, in terms of the tone recognition accuracy.
|Description:||128 leaves : ill. ; 30 cm.|
PolyU Library Call No.: [THS] LG51 .H577M COMP 2001 LamY
|Rights:||All rights reserved.|
|Appears in Collections:||COMP Theses|
PolyU Electronic Theses
Files in This Item:
|b15784988_link.htm||For PolyU Users||179 B||HTML||View/Open|
|b15784988_ir.pdf||For All Users (Non-printable)||3.87 MB||Adobe PDF||View/Open|
All items in the PolyU Institutional Repository are protected by copyright, with all rights reserved, unless otherwise indicated. No item in the PolyU IR may be reproduced for commercial or resale purposes.