Image Colorization

EE 541: A Computational Introduction to Deep Learning — Final Project

Introduction

Image colorization is the task of predicting color information for grayscale images. This has applications in restoring historical photographs, enhancing medical imaging, and artistic tools. The challenge lies in the fact that colorization is inherently ambiguous—many plausible color assignments exist for a single grayscale image, and the model must learn realistic color distributions from data.

Dataset

The Oxford-IIIT Pet Dataset contains 7,349 images of cats and dogs across 37 breeds. Each image shows a pet in various poses, backgrounds, and lighting conditions. The dataset includes breed labels and segmentation masks, though for colorization you will primarily use the raw images.

Dataset Characteristics:

~200 images per breed
Resolution varies (typically 200-500 pixels on the longer side)
Natural backgrounds with context (grass, furniture, indoor/outdoor scenes)
Diverse lighting conditions and color palettes
Breeds include: British Shorthair, Persian, Siamese cats; Beagle, Pug, Golden Retriever dogs; and 31 others

Dataset Access: https://www.robots.ox.ac.uk/~vgg/data/pets/

The dataset is split into training and test sets with approximately 3,680 training images and 3,669 test images. Images are provided in JPEG format with annotations in separate files.

Problem Statement

Build a deep learning system that predicts RGB color channels given a grayscale input image. This is a supervised generative task where the input is a single-channel grayscale image and the output is a three-channel color image of the same spatial resolution.

Alternative Problem Formulations

Image Inpainting: Instead of colorization, predict missing regions of color images. Randomly mask rectangular regions (or irregular shapes) in RGB images and train the model to fill in these gaps. This tests whether the model learns to generate plausible content conditioned on surrounding context.

Conditional Colorization: Provide additional input beyond the grayscale image, such as breed labels or user-specified color hints (sparse color points). This explores how conditional information guides the generative process and reduces ambiguity.

Perceptual Quality Optimization: Move beyond pixel-wise loss functions to perceptual losses that measure similarity in learned feature space. This typically produces more visually pleasing results even if pixel-level accuracy decreases.

Color Spaces and Image Representation

Images are typically represented in RGB space where each pixel has red, green, and blue intensity values in \([0, 255]\). However, colorization is often easier in alternative color spaces that separate luminance from chrominance.

LAB Color Space

The LAB color space decouples lightness (\(L\)) from color channels (\(a\) and \(b\)):

L channel: Lightness ranging from 0 (black) to 100 (white)
A channel: Green-red axis, ranging from -128 to 127
B channel: Blue-yellow axis, ranging from -128 to 127

For colorization, the grayscale input is the L channel, and the model predicts the A and B channels. This is natural because grayscale conversion is essentially extracting the L channel.

Converting between RGB and LAB involves nonlinear transformations through XYZ color space. Libraries like OpenCV and scikit-image provide these conversions.

RGB Color Space

Alternatively, you can work directly in RGB by converting grayscale to a three-channel image (copying the grayscale value to all three channels) and training a model to predict the true RGB values. This is conceptually simpler but may be harder for the network to learn since color and intensity are entangled.

Suggested Approach

Data Preprocessing: Resize images to a consistent resolution (e.g., 128×128 or 256×256) for batching. Normalize pixel values to a consistent range—either [0, 1] or [-1, 1] depending on your activation functions. Convert RGB images to grayscale for the input by weighted averaging of channels or by extracting the L channel in LAB space.

Data Augmentation: Standard augmentations include random cropping, horizontal flipping, and brightness/contrast adjustments. For colorization specifically, ensure augmentations preserve the grayscale-to-color relationship. Geometric transforms (rotation, flipping) are safe, but color jittering should be applied before grayscale conversion.

Architecture Considerations: Colorization requires preserving spatial structure—the output must have the same resolution and alignment as the input. Encoder-decoder architectures compress the input to a bottleneck representation then expand it back to full resolution. Skip connections between encoder and decoder help preserve fine spatial details.

U-Net style architectures with skip connections at multiple scales work well for tasks requiring both semantic understanding (what object is this?) and spatial precision (where are the edges?).

Loss Functions: Mean squared error (MSE) between predicted and true color channels is a standard choice:

\[ \mathcal{L}_{\text{MSE}} = \frac{1}{N}\sum_{i=1}^{N} \|\mathbf{y}_{\text{pred},i} - \mathbf{y}_{\text{true},i}\|^2 \]

where \(\mathbf{y}_i\) are the color channel values (either AB in LAB space or RGB).

Mean absolute error (MAE) is less sensitive to outliers:

\[ \mathcal{L}_{\text{MAE}} = \frac{1}{N}\sum_{i=1}^{N} |\mathbf{y}_{\text{pred},i} - \mathbf{y}_{\text{true},i}| \]

Evaluation Metrics: Peak signal-to-noise ratio (PSNR) measures reconstruction quality:

\[ \text{PSNR} = 10 \log_{10} \frac{\text{MAX}^2}{\text{MSE}} \]

where MAX is the maximum possible pixel value (e.g., 255 for 8-bit images).

Structural similarity index (SSIM) measures perceptual similarity considering luminance, contrast, and structure. It ranges from -1 to 1, with 1 indicating perfect similarity.

Dataset Considerations

Color Ambiguity: Some objects have canonical colors (grass is green, sky is blue) while others are ambiguous (pet fur can be many colors). The model must learn when color is constrained by context versus when multiple valid solutions exist.

Background Context: The pet dataset includes rich backgrounds—indoor furniture, outdoor grass, varied lighting. These context clues help the model infer plausible colors. A pet on grass likely has green surroundings even if the grayscale intensity alone is ambiguous.

Lighting Variation: Images have different lighting conditions affecting color appearance. Indoor warm lighting shifts colors toward yellow/orange while outdoor shade shifts toward blue. The model must learn to colorize appropriately for the inferred lighting.

Fine Detail: Pet fur has fine texture and color variation. Preserving these details while colorizing requires the network to maintain high-resolution information through the architecture.

Technical Notes

Computational Requirements: Image generation tasks are more computationally intensive than classification. Processing 256×256 RGB images requires substantial memory for batch processing. Reducing resolution to 128×128 or 64×64 during initial experiments speeds up iteration. Pre-processing images to a consistent size and caching them avoids repeated resizing.

Output Activation: If working in LAB space, the AB channels have specific ranges (typically normalized to [-1, 1]). Use tanh activation for the output layer to produce values in this range. If working in RGB with [0, 1] normalization, use sigmoid activation.

Color Space Conversion: Converting between RGB and LAB involves gamma correction and white point normalization. Ensure you use the same conversion functions for preprocessing and post-processing. Small numerical errors in conversion can accumulate and degrade results.

Expected Outcomes

Your analysis should examine which types of images colorize well and which are difficult. Investigate whether the model learns canonical colors for objects (green grass, blue sky) or produces varied plausible colorizations. Compare results across different breeds and backgrounds to understand what visual cues the network relies on. Analyze failure cases—desaturated outputs, incorrect object colors, or color bleeding across boundaries. Visualize how predicted colors vary with different architectural choices or loss functions. Examine whether the model captures fine texture details or produces smooth color regions that miss fur patterns.