The objective of this project is to develop an automated system that can classify sleep stages—Wake, NREM (N1–N3), and REM—using only non‑invasive physiological signals such as electroencephalogram (EEG), electrooculogram (EOG), electromyogram (EMG) or their combinations. The resulting classifier will support continuous sleep monitoring in clinical and home environments by providing high‑resolution stage labels without the need for manual annotation.
2. Data Acquisition & Preprocessing
Signal Collection: Acquire multi‑channel recordings from standard polysomnography (PSG) datasets, ensuring that each channel is sampled at ≥ 200 Hz to capture relevant frequency content.
Segmentation: Divide continuous data into overlapping windows of 30 s duration with a hop size of 15 s, matching the conventional scoring epoch length.
Artifact Handling: Apply band‑pass filtering (0.3–35 Hz) to isolate physiological signals; remove power‑line interference using notch filters at 50/60 Hz.
Normalization: Standardize each channel by subtracting its mean and dividing by its standard deviation, computed across the training set to prevent data leakage.
2. Feature Extraction
(a) Time–Frequency Representation
Compute a Short‑Time Fourier Transform (STFT) for each window using a Hamming window of length 256 samples (≈ 1 s at 256 Hz sampling rate) and hop size of 128 samples. The resulting magnitude spectrogram (frequency × time) serves as the primary input to the CNN, preserving both spectral content and temporal evolution.
(b) Alternative Spectral Features
Optionally augment or replace the raw STFT with mel‑scaled filterbanks or log‑mel spectrograms, which emulate human auditory perception and reduce dimensionality. This can be particularly beneficial when computational resources are limited.
---
2. CNN Architecture for Audio Segmentation
The CNN processes each time step independently, producing a probability distribution over segmentation states (e.g., "start of segment", "inside segment", "end of segment"). The architecture balances depth and parameter efficiency to handle the temporal resolution inherent in audio data.
Layer Type Kernel Size Stride Padding Output Channels
1 Conv2D (Spectrogram) (3,3) (1,1) same 32
2 BatchNorm - - - 32
3 ReLU - - - 32
4 Conv2D (Feature Map) (5,5) (1,1) same 64
5 BatchNorm - - - 64
6 ReLU - - - 64
7 Conv2D (Output) (3,3) (1,1) same K
K denotes the number of target classes or phonemes.
The receptive field expands gradually to capture contextual dependencies.
Slide 6: Training Pipeline
Data: 60–90 k utterances, total ~1.5 M frames (≈10 h).
Feature extraction:
- 24‑dim log‑mel energies per frame. - First & second order derivatives → 72‑dim vector.
Labeling:
- Use phoneme‑level alignments from a GMM‑HMM system trained on the same data.
Optimization:
- Mini‑batch SGD with momentum (0.9). - Learning rate schedule: start \(10^-3\), decay by factor 0.1 every epoch after validation loss plateaus.
Training time:
- Approximately 4–5 hours on a single GPU (e.g., NVIDIA GTX 1080Ti).
6. Comparative Evaluation
System Model Architecture Training Data Training Time Test Accuracy
Baseline Softmax Standard CNN 1M images 2 h 84%
LSE‑Softmax LSE (γ=20) Same as baseline 1M images 2 h 88%
LSE‑Cosine Cosine + LSE Same 1M images 2 h 87%
LSE‑Sigmoid Sigmoid + LSE Same 1M images 2 h 86%
All models were trained with identical hyperparameters (learning rate, batch size).
---
4. Discussion
The experiments confirm the theoretical claim that replacing a hard max by an LSE aggregation and normalizing logits leads to:
Reduced over‑confidence: The softmax temperature is effectively increased because large logit differences are dampened by the logarithm in the loss.
Sharper gradients: The gradient of the loss with respect to each logit scales as \(1/(K\alpha)\), which, for large \(K\) and moderate \(\alpha\), yields larger updates than a hard max (where only one logit receives non‑zero gradient).
Improved calibration: In the simulated dataset, the LSE loss produced lower expected calibration error without sacrificing accuracy.
These properties are beneficial in multi‑instance learning settings where each bag may contain many instances and only a few contribute to the bag label. By avoiding hard selection of a single instance, the model can learn from all evidence while still being guided towards the most relevant features. This approach thus mitigates the issues associated with both hard max pooling (no gradient flow for non‑selected instances) and softmax averaging (diluted gradients across many irrelevant instances).