16-726 Learning-Based Image Synthesis

Assignment 1: Colorizing the Prokudin-Gorskii Photo Collection
Jun Luo
Feb. 2021

Overview

In this assignment, we focus on producing a color image from digitized Prokudin-Gorskii glass plate images using image processing techniques. We extract three color channel images from a single input image which is split into top, middle and bottom patch corresponding to three chanels (BGR). Different techniques were then used to align the three patches so that they form a single RGB image.

Methods and Implementations

In the provided glass plate images, there is one low resolution (< 500×500) image and several high resolution (>3000×3000) images. A naive way to align the channel patches is to search over a window of possible displacements, score each one using some image matching metric, and take the displacement with the best score. While this naive method might work, it can only be feasible on small-scale glass plate images since the a proper window size for the searching would be too big for larger scale input images and the computation of the metric would also be extremely time-comsuming. We implement a single-scale aligment for low resolution inputs and a multi-scale (image pyramid) alignment which is more efficient on the high resolution inputs. For these two methods, below lists several similarities and differences.

Results

Below are the output color images from single-scale implementation of different loss functions and different settings for the border area. The corresponding displacement vectors are in the captions (first number indicates a right shift, second indicates a down shift).

SAD. Disp.: R=[-1,8]; G=[0,2]
SAD + ignore 4% border. Disp.: R=[3,12]; G=[2,5]
SSD. Disp.: R=[-1,7]; G=[-1,1]
SSD + ignore 4% border. Disp.: R=[3,12]; G=[2,5]

Here are the results for the multi-scale (image pyramid) alignment implementation using SSD loss and ignoring 30% border area.

Disp.: R=[7,-217]; G=[24,48]
Disp.: R=[7,-217]; G=[24,48]
Disp.: R=[23,90]; G=[17,41]
Disp.: R=[11,115]; G=[8,53]
Disp.: R=[37,175]; G=[29,78]
Disp.: R=[13,110]; G=[15,51]
Disp.: R=[33,86]; G=[7,42]
Disp.: R=[28,116]; G=[21,55]
Disp.: R=[22,137]; G=[12,64]

Some more results from downloaded images in the same collection:

Disp.: R=[54,122]; G=[34,56]
Disp.: R=[-4,110]; G=[4,41]
Disp.: R=[7,125]; G=[7,60]
Disp.: R=[-43,174]; G=[-17,81]

From the results above we can see that for most input images, the method works reasonably well. However, there exists some input that failed to align for different channels, e.g. the first image, Emir of Bukhara ("emir.tif"), since there are obvious artifects in the color image. One major reason could be that since this image contains many color patches with low variance, it makes the SSD's similar for small shift. Using better features would improve this issue. Please see Bells & Whistles for more details.


Bells & Whistles: Better Features

Images with higher resolution often have larger patches that has low variance in terms of color. This is not particularly helpful for the alignment. There exists more useful features such as gradient and edge maps. For this part, we use the edge maps of different channels to implement the alignment, where the edges in each image were detected by the Sobel filter.

Below shows the edge maps of Emir of Bukhara ("emir.tif") together with some other examples and their alignment result using this feature.

Emir edge map
Disp.: R=[41,107]; G=[24,49]
Cathedral edge map
Disp.: R=[3,12]; G=[2,5]
Harvesters edge map
Disp.: R=[14,123]; G=[18,60]
Icon edge map
Disp.: R=[23,90]; G=[17,41]
Lady edge map
Disp.: R=[13,114]; G=[8,53]
Self-portrait edge map
Disp.: R=[37,175]; G=[29,78]
Three generations edge map
Disp.: R=[11,112]; G=[14,52]
Turkmen edge map
Disp.: R=[29,117]; G=[22,56]
Train edge map
Disp.: R=[28,116]; G=[21,55]
Village edge map
Disp.: R=[28,116]; G=[21,55]

Bells & Whistles: Better transformations

The above methods and results assume that the alignment need only translation in X-Y plane. Yet, in reality, more transformation could apply on the image. When taking the photos, the displacement can also be in Z direction, which translate to scaling in the photo, and there can also be slight rotation between taking two photos for different channels. In this part we add another two dimensions which are scaling (zooming while retaining the shape), and rotation. The implementation provides scales with 0.01 interval within a range of scales between [0.95, 1.05] and rotation of [-5, 5] degrees with 1 degree interval, which could also be further tuned. Both dimensions fill black pixel to the extra space. Below gives two examples of the added two dimensions with exaggerated extent.

Harvesters zooming (original, zoom 0.5, zoom 1.5)
Harvesters rotation (original, clock-wise 30 degrees, counter clock-wise 30 degrees)

Since these two additional dimensions alone add up to 121 possible combinations, the processing time will be much longer than only translation. We conduct experiment using "three_generations.tif" with this method. In the caption for 4D alignment, "Z." means zooming scale, "Ro." means rotation degree, "S" means shift displacement.

Edge alignment
Better transformation (4D alignment)
R=[11, 112]; G=[14, 52]
R=[Z: 1.01, Ro.: -1, S:(11, 99)]; G=[Z: 1.01, Ro.: -1, S:(15, 40)]

Bells & Whistles: SIFT Keypoints Matching

Align different channels can rely on different features. Here we try to detect a number of keypoints in each channel using Scale-invariant feature transform (SIFT). SIFT is method that can detect local features with respect to a pixel. Similar descriptors from different channels of image can be match up as another way of alignment. Figures below show different keypoints from RGB channels of the cathedral image. In practice, to find keypoints with better quality, we center crop the image with 10% margin.

Cathedral R channel keypoints
Cathedral G channel keypoints
Cathedral B channel keypoints

By computing the similarities of descriptors of keypoints from different channels, we match the top 20 keypoints from B-R channels and B-G channels and have the following.

B (left) and R (right) channel keypoints matching
B (left) and G (right) channel keypoints matching


Bells & Whistles: Least Square Estimate of Projection Matrix

If we view the alignment as a projective transformation, rather than a mere translation, we will have the following equation:$$\begin{bmatrix} H \end{bmatrix}_{3\times3} \cdot \begin{bmatrix} x \\ y \\ 1 \\ \end{bmatrix} = \lambda \cdot \begin{bmatrix} x' \\ y' \\ 1 \\ \end{bmatrix},$$where \(H\) is the projection matrix. \((x, y)\) is the keypoint coordinate in an image and \((x', y')\) is its corresponding keypoint coordinate from another image. In our practice, \((x, y)\) is from channel R or G and \((x', y')\) is from B. \(\lambda\) is a scalar. The degree of freedom of this projection matrix is 8, which needs 4 pairs of keypoints to solve it. We take the top 20 keypoint pairs with smallest distance between the keypoint descriptors, forming it to a least square problem: $$ \begin{bmatrix} x_1 & y_1 & 1 & 0 & 0 & 0 & -x_1x_1' & -y_1x_1' \\ 0 & 0 & 0 & x_1 & y_1 & 1 & -x_1y_1' & -y_1y_1' \\ x_2 & y_2 & 1 & 0 & 0 & 0 & -x_2x_2' & -y_2x_2' \\ 0 & 0 & 0 & x_2 & y_2 & 1 & -x_2y_2' & -y_2y_2' \\ & & & & ... & & & \\ & & & & ... & & & \\ x_n & y_n & 1 & 0 & 0 & 0 & -x_nx_n' & -y_nx_n' \\ 0 & 0 & 0 & x_n & y_n & 1 & -x_ny_n' & -y_ny_n' \\ \end{bmatrix} \begin{bmatrix} H_{1, 1} \\ H_{1, 2} \\ H_{1, 3} \\ H_{2, 1} \\ H_{2, 2} \\ H_{2, 3} \\ H_{3, 1} \\ H_{3, 2} \\ \end{bmatrix} = \begin{bmatrix} x_1' \\ y_1' \\ x_1' \\ y_2' \\ ... \\ ... \\ x_n' \\ y_n'\\ \end{bmatrix}. $$ With 20 SIFT keypoint pairs, we have 40 equations to solve for the least square estimate \(\tilde{h}=\arg \min_h ||Ah-b||^2\), where \(A, h, b\) correspond to the three matrices in order in the above equation. By letting \(\nabla_h ||Ah-b||^2 =0\), we have \(\tilde{h}=(A^TA)^{-1}A^Tb\)

In our practice, we apply this method in the cathedral image ("cathedral.jpg"). The projection matrices for channel R and channel G are: $$ \tilde{H}_{R-B}=\begin{bmatrix} 1.00827106e+00 & -1.99447227e-03 & 2.47675004e+00 \\ 8.25956498e-03 & 1.00280905e+00 & 1.03396081e+01 \\ 3.23789581e-05 & -1.89239902e-05 & 1 \\ \end{bmatrix} $$ $$\tilde{H}_{G-B}=\begin{bmatrix} 9.76731053e-01 & -5.80008295e-02 & 6.42160211e+00 \\ 7.40887563e-03 & 9.09362629e-01 & 9.10895190e+00 \\ 3.40525560e-05 & -3.85026578e-04 & 1 \\ \end{bmatrix} $$ The final result of the aligment is computed through inverse warping, where for each position in the B channel, we compute the corresponding position in the R channel or G channel via \(\tilde{H}\), and we use bilinear interpolation to compute the actual value of the pixel. The final output is shown below.

Cathedral alignment with least square estimates of projection matrices using SIFT keypoints

From the result, we can see that the alignment is actually not extremel satisfying. One major reason could be that not all keypoints from the top 20 matches are a proper match. They could have similar surroundings but are not aligned in different channels. Also, the keypoints should be detected after proper denoising. Median filtering was tried regarding this issue, but the result is worse than the reported one.