Facial Keypoints Detection

Deep Learning · Computer Vision · USF · Aug — Dec 2023

A dual-framework deep learning system for detecting 15 facial keypoints (30 x,y coordinates) on 96×96 grayscale images. Implements both a Keras CNN with two-phase training and a PyTorch ResNet with custom NaN-aware loss, achieving a Kaggle RMSE of 2.10 on the Facial Keypoints Detection competition.

The key innovation is the NaN-aware MSE loss function that trains on all 7,049 samples — including the ~5,000 with partial keypoint labels — rather than discarding ~70% of the data. Combined with a 6-stage ResNet architecture (12 residual blocks) and careful learning rate scheduling, this approach significantly outperforms the baseline CNN.

ResNet predictions on test faces
ResNet keypoint predictions (red dots) on unseen test faces — 15 keypoints per face
2.10
Best RMSE
7,049
Training Samples
15
Keypoints
~5.7M
Total Parameters

Model Comparison

Model Framework Parameters Kaggle RMSE Strategy
ResNet PyTorch ~4.2M 2.10 Adam + StepLR + NaN-aware MSE
CNN TensorFlow/Keras ~1.5M 2.55 Two-phase: Adam → SGD + Huber

ResNet Architecture

Input (1×96×96 grayscale)
  ├― Stem: Conv(1→32, 7×7) → BatchNorm → ReLU → MaxPool
  ├― Stage 1: ResBlock×2  (32 → 32)
  ├― Stage 2: ResBlock×2  (32 → 64)     stride=2
  ├― Stage 3: ResBlock×2  (64 → 128)    stride=2
  ├― Stage 4: ResBlock×2  (128 → 256)   stride=2
  ├― Stage 5: ResBlock×2  (256 → 512)   stride=2
  ├― AdaptiveAvgPool2d(1×1)
  └― Linear(512 → 30)   ―――― 15 keypoints × 2 (x, y)

ResBlock: x → Conv3×3 → BN → ReLU → Conv3×3 → BN → (+shortcut) → ReLU
Shortcut: 1×1 conv when dimensions change

Training Pipeline

Input: 96×96 Grayscale Image (CSV pixel strings)
  ├― Pixel normalization (0-255 → 0-1)
  ├― NaN handling: forward-fill (CNN) / mask (ResNet)
  └― Train/Val split (80/20)

CNN (Keras/TensorFlow)         ResNet (PyTorch)
  Conv2D(32→64→128)              Stem: Conv(1→32)+BN+Pool
  Dense(500→500→30)              5 Stages: ResBlock×2 each
  LeakyReLU + Dropout              (32→64→128→256→512)
  Huber loss                       AvgPool → Linear(512→30)
  Adam → SGD (two-phase)          NaN-aware MSE + Adam + StepLR
           └―――――――――――│――――――――――――┌

Output: 30 values (x,y for 15 keypoints)
  └― Kaggle submission CSV via IdLookupTable

Key Features

Data & Missing Labels

The Kaggle dataset contains 7,049 training images stored as space-separated pixel strings in CSV format. Each is a 96×96 grayscale face. The critical challenge: only ~2,140 samples (~30%) have all 15 keypoints labeled. The remaining ~70% have partial annotations ranging from 4 to 14 keypoints, creating a sparse label landscape that most approaches address by simply discarding incomplete samples.

NaN label distribution across training data
Label completeness distribution across 7,049 training samples — only ~30% have all 15 keypoints

Design Decisions

Code Highlights

NaN-Aware MSE Loss
class MSELossIgnoreNan(nn.Module):
    """MSE that masks missing (NaN) keypoint labels."""
    def forward(self, pred, target):
        mask = torch.isfinite(target)
        count = mask.sum()
        if count == 0:
            return torch.tensor(0.0, requires_grad=True)
        return ((pred[mask] - target[mask]) ** 2).sum() / count
Residual Block with Skip Connection
class ResidualBlock(nn.Module):
    """Two-conv block with identity shortcut."""
    def forward(self, x):
        y = F.relu(self.bn1(self.conv1(x)))
        y = self.bn2(self.conv2(y))
        if self.shortcut is not None:
            x = self.shortcut(x)    # 1×1 conv to match dims
        return F.relu(y + x)        # skip connection
Two-Phase Optimizer Switch (CNN)
# Phase 1: Adam for fast convergence
model.compile(optimizer=Adam(lr=5e-4), loss="huber")
model.fit(X_train, y_train, epochs=200, callbacks=[early_stop])

# Phase 2: SGD for fine-tuning near minimum
model.compile(optimizer=SGD(lr=1e-3, momentum=0.9), loss="huber")
model.fit(X_train, y_train, epochs=500, callbacks=[
    ReduceLROnPlateau(patience=5, factor=0.5),
    ModelCheckpoint("sgd_best.h5", save_best_only=True)
])

How It Works

Data Preprocessing: Training images are stored as space-separated pixel strings in CSV format. Each is parsed into a 96×96 float32 array and normalized to [0, 1]. Of the 7,049 training samples, only ~2,140 have all 15 keypoints labeled — the rest have partial annotations ranging from 4 to 14 keypoints.

CNN Training (Two-Phase): Three convolutional blocks (32→64→128 filters) with LeakyReLU and progressive dropout (0.05→0.01→0.15), followed by two 500-unit dense layers. Phase 1 uses Adam (lr=0.0005) with Huber loss for fast convergence. Phase 2 switches to SGD (lr=0.001, momentum=0.9) with ReduceLROnPlateau for stable fine-tuning. Keypoint targets are normalized to [0, 1] by dividing by 96.

ResNet Training: Custom 6-stage ResNet with 12 residual blocks progressing from 32 to 512 channels. Each block uses batch normalization and ReLU with 1×1 convolution shortcuts for dimension matching. Trained with Adam (lr=0.0001), StepLR scheduling (step=5, gamma=0.1), and early stopping (patience=5). The raw keypoint coordinates are predicted directly without normalization.

NaN-Aware Loss: The custom MSELossIgnoreNan creates a finite-value mask on each target tensor. Only predicted values where the corresponding target is finite contribute to the loss. This means a sample with 10 of 15 keypoints labeled produces gradients for those 10 — the model still learns from partial data rather than discarding the entire sample.

Training Results

ResNet training curves
ResNet: Adam + StepLR
CNN training curves
CNN: Two-Phase (Adam → SGD)
Per-keypoint RMSE breakdown
Per-keypoint RMSE — eye and eyebrow keypoints are most accurate; mouth corners are hardest
Ground truth vs ResNet prediction
Ground truth (green) vs. ResNet prediction (red) on a validation sample

Frameworks & Tools

CNN Framework
TensorFlow / Keras 2.x
ResNet Framework
PyTorch 2.x
Loss Functions
Huber (CNN), NaN-aware MSE (ResNet)
LR Scheduling
ReduceLROnPlateau (CNN), StepLR (ResNet)
Configuration
YAML + frozen dataclasses
Testing
pytest (models, loss, dataset, config)
Python PyTorch TensorFlow Keras ResNet CNN Computer Vision Deep Learning Kaggle NumPy pytest

Challenges & Solutions

References