Automatic Segmentation Of Speech Into Sentences Using Prosodic Features
12.10.2010 - 17.11.2011
Automatic Segmentation of Speech into Sentences Using Prosodic Features
Segmentation of speech into sentences plays an important role as a first step in several speech processing fields. Automatic Speech Recognition (ASR) algorithms mostly produce just a stream of non-structured words without detecting the hidden structure in spoken language. However, natural language processing devices often have a strong need for sentence-like units to work properly. Apart from that, hand-labeling is very time-consuming. Thus, it is reasonable to develop an algorithm which marks sentence and phrase boundaries using prosodic features. In this project thesis, the Aix-MARSEC database of BBC radio speech is used for analysis.
The algorithm can be described as following: An adaptive, energy-based voice-activitydetector (VAD) is used to gather all active regions and calculate the pause lengths and intensity as first features. These blocks are then used as input for a pitch estimation algorithm. To assess tendencies at the region boundaries, we calculate an optimal (in the least-squares sense) piecewise polynomial approximation and derive various prosodic features (initial/final intonation, pitch gradient, downdrift, ...). Consequently, the extracted features are combined in a decision tree to determine the sentence boundaries.