Back to list

A Deep Learning Approach to Speech, Music and Environmental Noise Classification

Authors Rothmund, F.
Year 2017
Thesis Type Master's thesis
Topic Sonification
Abstract The goal of this thesis was to develop an algorithm that effectively classifies audio data into the categories Music, Speech and Environmental Noise. Traditionally, audio classification tasks have been approached with standard classification procedures applied to hand-crafted descriptive features derived from the audio waveform. In recent years, inspired by their success in image classification and object recognition, Deep Neural Nets (DNNs) have been applied to different audio classification tasks with promising results. This work assesses both traditional machine learning algorithms, as well as state-of-the-art deep learning methods for real-time audio classification. The algorithms were evaluated with audio data, coming from a far-field microphone array in different domestic and business environments. More specifically, a non-linear Support Vector Machine (SVM) was evaluated with different descriptive audio features, as well as a Convolutional Neural Net (CNN) applied to Mel-spectrograms. While classification accuracy was excellent for both algorithms when classifying ’clean’ audio data, the SVM performed poorly for ’real-world’ far-field microphone array recordings. An accuracy rate of over 94% was achieved using the CNN for audio clips of 1 second, providing excellent performance in real-time tests.
Supervisors Bauer, G., Sontacchi, A.