A dual-framework deep learning system for detecting 15 facial keypoints (30 x,y coordinates) on 96×96 grayscale images. Implements both a Keras CNN with two-phase training and a PyTorch ResNet with custom NaN-aware loss, achieving a Kaggle RMSE of 2.10 on the Facial Keypoints Detection competition.
The key innovation is the NaN-aware MSE loss function that trains on all 7,049 samples — including the ~5,000 with partial keypoint labels — rather than discarding ~70% of the data. Combined with a 6-stage ResNet architecture (12 residual blocks) and careful learning rate scheduling, this approach significantly outperforms the baseline CNN.
| Model | Framework | Parameters | Kaggle RMSE | Strategy |
|---|---|---|---|---|
| ResNet | PyTorch | ~4.2M | 2.10 | Adam + StepLR + NaN-aware MSE |
| CNN | TensorFlow/Keras | ~1.5M | 2.55 | Two-phase: Adam → SGD + Huber |
Input (1×96×96 grayscale) ├― Stem: Conv(1→32, 7×7) → BatchNorm → ReLU → MaxPool ├― Stage 1: ResBlock×2 (32 → 32) ├― Stage 2: ResBlock×2 (32 → 64) stride=2 ├― Stage 3: ResBlock×2 (64 → 128) stride=2 ├― Stage 4: ResBlock×2 (128 → 256) stride=2 ├― Stage 5: ResBlock×2 (256 → 512) stride=2 ├― AdaptiveAvgPool2d(1×1) └― Linear(512 → 30) ―――― 15 keypoints × 2 (x, y) ResBlock: x → Conv3×3 → BN → ReLU → Conv3×3 → BN → (+shortcut) → ReLU Shortcut: 1×1 conv when dimensions change
Input: 96×96 Grayscale Image (CSV pixel strings) ├― Pixel normalization (0-255 → 0-1) ├― NaN handling: forward-fill (CNN) / mask (ResNet) └― Train/Val split (80/20) CNN (Keras/TensorFlow) ResNet (PyTorch) Conv2D(32→64→128) Stem: Conv(1→32)+BN+Pool Dense(500→500→30) 5 Stages: ResBlock×2 each LeakyReLU + Dropout (32→64→128→256→512) Huber loss AvgPool → Linear(512→30) Adam → SGD (two-phase) NaN-aware MSE + Adam + StepLR └―――――――――――│――――――――――――┌ Output: 30 values (x,y for 15 keypoints) └― Kaggle submission CSV via IdLookupTable
MSELossIgnoreNan trains on all 7,049 samples including the ~5,000 with partial keypoint labels, using masking instead of dropping rowsThe Kaggle dataset contains 7,049 training images stored as space-separated pixel strings in CSV format. Each is a 96×96 grayscale face. The critical challenge: only ~2,140 samples (~30%) have all 15 keypoints labeled. The remaining ~70% have partial annotations ranging from 4 to 14 keypoints, creating a sparse label landscape that most approaches address by simply discarding incomplete samples.
class MSELossIgnoreNan(nn.Module): """MSE that masks missing (NaN) keypoint labels.""" def forward(self, pred, target): mask = torch.isfinite(target) count = mask.sum() if count == 0: return torch.tensor(0.0, requires_grad=True) return ((pred[mask] - target[mask]) ** 2).sum() / count
class ResidualBlock(nn.Module): """Two-conv block with identity shortcut.""" def forward(self, x): y = F.relu(self.bn1(self.conv1(x))) y = self.bn2(self.conv2(y)) if self.shortcut is not None: x = self.shortcut(x) # 1×1 conv to match dims return F.relu(y + x) # skip connection
# Phase 1: Adam for fast convergence model.compile(optimizer=Adam(lr=5e-4), loss="huber") model.fit(X_train, y_train, epochs=200, callbacks=[early_stop]) # Phase 2: SGD for fine-tuning near minimum model.compile(optimizer=SGD(lr=1e-3, momentum=0.9), loss="huber") model.fit(X_train, y_train, epochs=500, callbacks=[ ReduceLROnPlateau(patience=5, factor=0.5), ModelCheckpoint("sgd_best.h5", save_best_only=True) ])
Data Preprocessing: Training images are stored as space-separated pixel strings in CSV format. Each is parsed into a 96×96 float32 array and normalized to [0, 1]. Of the 7,049 training samples, only ~2,140 have all 15 keypoints labeled — the rest have partial annotations ranging from 4 to 14 keypoints.
CNN Training (Two-Phase): Three convolutional blocks (32→64→128 filters) with LeakyReLU and progressive dropout (0.05→0.01→0.15), followed by two 500-unit dense layers. Phase 1 uses Adam (lr=0.0005) with Huber loss for fast convergence. Phase 2 switches to SGD (lr=0.001, momentum=0.9) with ReduceLROnPlateau for stable fine-tuning. Keypoint targets are normalized to [0, 1] by dividing by 96.
ResNet Training: Custom 6-stage ResNet with 12 residual blocks progressing from 32 to 512 channels. Each block uses batch normalization and ReLU with 1×1 convolution shortcuts for dimension matching. Trained with Adam (lr=0.0001), StepLR scheduling (step=5, gamma=0.1), and early stopping (patience=5). The raw keypoint coordinates are predicted directly without normalization.
NaN-Aware Loss: The custom MSELossIgnoreNan creates a finite-value mask on each target tensor. Only predicted values where the corresponding target is finite contribute to the loss. This means a sample with 10 of 15 keypoints labeled produces gradients for those 10 — the model still learns from partial data rather than discarding the entire sample.
[pytorch], [tensorflow], [all])