Back to list

Automatic Segmentation of Speech into Sentences Using Prosodic Features

Authors Pausch, F.
Year 2011
Thesis Type Audio Engineering project
Topic Audio Signal Processing
Keywords speech processing
Abstract Segmentation of speech into sentences plays an important role as a first step in several speech processing fields. Automatic Speech Recognition (ASR) algorithms mostly produce just a stream of non-structured words without detecting the hidden structure in spoken language. However, natural language processing devices often have a strong need for sentence-like units to work properly. Apart from that, hand-labeling is very time-consuming. Thus, it is reasonable to develop an algorithm which marks sentence and phrase boundaries using prosodic features. In this project thesis, the Aix-MARSEC database of BBC radio speech is used for analysis. The algorithm can be described as following: An adaptive, energy-based voice-activity- detector (VAD) is used to gather all active regions and calculate the pause lengths and intensity as first features. These blocks are then used as input for a pitch estimation algorithm. To assess tendencies at the region boundaries, we calculate an optimal (in the least-squares sense) piecewise polynomial approximation and derive various prosodic features (initial/final intonation, pitch gradient, downdrift...). Consequently, the extracted features are combined in a decision tree to determine the sentence boundaries.
Supervisors Jany-Luig, J.