Urban Sound Classification
EE 541: A Computational Introduction to Deep Learning — Final Project
Introduction
Urban sound classification addresses the problem of automatically identifying environmental sounds in city settings. This has applications in noise monitoring, surveillance systems, and assistive technologies for hearing-impaired individuals. The challenge lies in the high variability of real-world recordings—sounds overlap, occur at different distances, and are affected by background noise and reverberation.
Dataset
The UrbanSound8K dataset contains 8,732 labeled sound excerpts (≤4 seconds each) of urban sounds from 10 classes:
- Air conditioner
- Car horn
- Children playing
- Dog bark
- Drilling
- Engine idling
- Gun shot
- Jackhammer
- Siren
- Street music
The audio files are in WAV format with varying sample rates (typically 22050 Hz or 44100 Hz) and are organized into 10 folds for cross-validation. Each file is labeled and contains metadata including fold number, class ID, and source information.
Dataset Access: https://urbansounddataset.weebly.com/urbansound8k.html
The dataset totals approximately 8.75 GB and includes pre-defined train/test splits via the fold structure. You should use the provided folds to ensure fair comparison and avoid data leakage.
Problem Statement
Your task is to build a deep learning system that classifies audio clips into one of the 10 urban sound categories. This is a multi-class classification problem where the input is an audio waveform and the output is a class label.
Alternative Problem Formulations
Hierarchical Classification: Group the 10 classes into higher-level categories (e.g., mechanical sounds, human/animal sounds, vehicle sounds) and implement a two-stage classifier. This allows exploration of hierarchical architectures and analysis of which sound characteristics distinguish broad categories versus fine-grained classes.
Multi-Label Classification: Some audio clips contain multiple simultaneous sound events. You can relabel portions of the dataset to reflect this reality and train a multi-label classifier that predicts all present sound types. This requires modifying the loss function and evaluation metrics accordingly.
Few-Shot Learning: Simulate a scenario where only a small number of examples are available for some classes. Train on limited data for certain categories and evaluate how well the model generalizes. This tests the robustness of learned representations.
Audio Representation
Audio data can be represented in multiple ways for neural network input.
Time Domain
Raw audio waveforms are one-dimensional signals where each sample represents air pressure at a specific time. A 4-second clip at 22050 Hz contains 88,200 samples. You can apply 1D convolutions directly to waveforms, though this approach requires the network to learn frequency-sensitive filters from scratch.
Frequency Domain
Converting audio to the frequency domain reveals which frequencies are present at what intensities. The Short-Time Fourier Transform (STFT) computes frequency content over short time windows, producing a spectrogram—a 2D representation with time on one axis and frequency on the other.
Mel Spectrogram: The mel scale approximates human auditory perception, spacing frequencies logarithmically rather than linearly. A mel spectrogram applies this perceptual scaling and is commonly used for audio classification:
\[ m = 2595 \log_{10}\left(1 + \frac{f}{700}\right) \]
where \(m\) is the mel frequency and \(f\) is the frequency in Hz.
Mel spectrograms can be treated as grayscale images and processed with 2D CNNs.
MFCCs (Mel-Frequency Cepstral Coefficients): MFCCs capture the spectral envelope of audio signals and are compact feature representations. They are computed by taking the discrete cosine transform of log-scaled mel spectrogram values. The first 12-20 MFCC coefficients are typically used as features.
Suggested Approach
Data Preprocessing: Audio files in the dataset have varying sample rates. Resampling to a consistent rate (e.g., 22050 Hz) ensures uniform input dimensions. Clips shorter than 4 seconds can be padded with silence, while longer clips can be truncated or split. Normalization by mean and standard deviation stabilizes training.
Data Augmentation: Audio augmentation increases dataset diversity and improves generalization. Time shifting (circular shift of waveform), pitch shifting (frequency scaling), time stretching (duration scaling without pitch change), and adding background noise are common strategies. You can also apply augmentations in the spectrogram domain such as time masking or frequency masking (SpecAugment).
Representation and Architecture: Consider how you will represent audio for your neural network. Different representations (raw waveform, spectrograms, MFCCs) have different computational and modeling implications. The temporal and frequency dimensions of spectrograms have different characteristics—frequency features are more translation-invariant than time features.
Evaluation: Classification accuracy provides an overall performance measure for this balanced dataset. Per-class precision, recall, and F1-scores reveal which sound categories are most challenging. Confusion matrices show which classes are commonly confused, providing insight into acoustic similarities.
Cross-Validation: The dataset provides 10 folds for cross-validation. Training on 9 folds and testing on 1, then rotating through all folds, gives a robust performance estimate. This approach is computationally expensive but reduces variance in your results.
Dataset Considerations
The UrbanSound8K dataset has some important characteristics to consider:
Class Imbalance: While relatively balanced, some classes have fewer examples than others. Monitor per-class performance to ensure your model doesn’t overfit to common classes while ignoring rare ones.
Fold Structure: The 10-fold structure groups recordings from the same source together. This prevents train/test leakage where the same environment or recording device appears in both sets. Respect the fold boundaries in your splits.
Audio Quality Variation: Recordings come from diverse sources with different background noise levels, microphone quality, and recording distances. This variability makes the problem realistic but challenging. Your model should be robust to these variations.
Overlapping Sounds: Many clips contain background sounds in addition to the target sound. A dog bark might have traffic noise, or a siren might include ambient city sounds. This reflects real-world conditions and makes pure classification difficult.
Technical Notes
Computational Requirements: Processing raw audio or generating spectrograms can be memory-intensive for large batch sizes. Pre-computing and caching spectrograms as images speeds up training significantly. A typical spectrogram might be 128×128 or 128×256 depending on your STFT parameters.
Sample Rate and Duration: The 4-second duration at 22050 Hz is reasonable for most neural network training. Higher sample rates increase temporal resolution but also computational cost and memory usage. Lower sample rates may lose high-frequency information relevant to distinguishing some sound classes.
Expected Outcomes
Your analysis should identify which sound classes are most difficult to classify and why. Examine confusion between similar classes (e.g., engine idling vs. car horn). Investigate how data augmentation and architectural choices affect generalization. Visualize learned filters or activation maps to understand what acoustic features the network learns.