16-889 Assignment 2: Single View to 3D

Naveen Venkat (Andrew id: nvenkat)



Late Days: 2



A Note on implementation details. The provided main.py accepts arguments as the problem number (e.g. python main.py --problem="1.1" or python main.py --problem="2.2_eval"). Wherever applicable, the exact command has been provided in the Implementation Details section in the writeup. The same has been called from main.py. Please refer to the help in main.py (python main.py --help), or refer to the corresponding section in the writeup to see available options.


1. Exploring loss functions



1.1. Fitting a voxel grid (5 points)


1.1.1 Implementation Details.

Executing the following will optimize the voxel

    python fit_data.py --type 'vox'

and, the following will visualize the optimized voxel grid

    python fit_data.py --type 'vox_visualize'

The function losses.voxel_loss implements the binary cross-entropy (with optional weighting) for learning occupancy. The learned voxel is written to file named learned_voxel_src.pt, and is converted to mesh using PyMCubes marching cubes algorithm.


1.1.2 Results

The mesh obtained from the voxel (after smoothing) is as follows:

learned_voxel_src

Note that although the voxel resides in the first quadrant (each coordinate axes defined in the interval [0, 32]), the mesh (which is a smoothed representation of the voxel obtained by marching cubes) is centered while rendering.




1.2. Fitting a point cloud (10 points)


1.2.1 Implementation Details

Executing the following will optimize the pixel point

    python fit_data.py --type 'point'

and, the following will visualize the optimized point cloud

    python fit_data.py --type 'point_visualize'

The function losses.chamfer_loss implements the chamfer (with L2-distance) for optimizing the point cloud. The learned point cloud is written to file named learned_pointclouds_src.pt.


1.2.2 Result

The point cloud obtained after rendering is as follows:

learned_pointclouds_src

Note that the point cloud is centered about its mean, while rendering.




1.3. Fitting a mesh (5 points)


1.3.1 Implementation Details

Executing the following will optimize the pixel point

    python fit_data.py --type 'mesh' --w_smooth=0.4

and, the following will visualize the optimized mesh

    python fit_data.py --type 'mesh_visualize'

The function losses.smoothness_loss implements the laplacian smoothness loss for optimizing the point cloud. There are 2 implementaitons available (one manually written - commented, and the other using pytorch3d library function - in use). The learned mesh is written to file named learned_mesh_src.pt.


1.3.2 Result

The mesh obtained after rendering is as follows:

w_smooth=0.4
learned_mesh_src

Note that the mesh is centered about its mean, while rendering.

Few other values of the hyperparameter w_smooth are as follows:

w_smooth=0.1 w_smooth=1.0
learned_mesh_src_wsmooth_0.1 learned_mesh_src_wsmooth_1.0




2. Reconstructing 3D from single view



2.1. Image to voxel grid (15 points)


2.1.1 Results

Note. The results reported here correspond to Architecture 4 described below. Other architectures have been implemented and an ablation study is presented, however the corresponding code blocks have been commented. To run other architectures, uncomment the corresponding code blocks in the model.py

Avg F1@0.05: 89.05

Here, three examples are shown:

Example id=460
Single-view Image Predicted Voxel Ground-truth Mesh
img_0_vox 0_vox 0_vox
Example id=70
Single-view Image Predicted Voxel Ground-truth Mesh
img_0_vox 30_vox 30_vox
Example id=560
Single-view Image Predicted Voxel Ground-truth Mesh
img_0_vox 30_vox 30_vox

2.1.2 Implementation Details

Train:

  python3 train_model.py --type 'vox' --save_freq=500 --batch_size=32 --weight_decay=0.00001 --pos_weight=2.0 --gpu_id=1 --num_workers=4 \ 
                         --pin_memory=False --max_iter=8000 --lr=4e-4

Evaluate:

  python3 eval_model.py --type 'vox' --load_checkpoint --batch_size=1 --vis_freq=10 --num_workers=1 --gpu_id=0

Training is performed along with weight decay (1e-5) and a weighted binary-cross-entropy loss on the outputs, where the positive class is given a weight of 2.0. The loss is implemented in losses.voxel_loss where the use_weighted flag is set to False.

The training takes around 6 hours for 1 epoch on a GTX 1060 (6GB) GPU with a batch size of 32.


2.1.3 Architecture Details

Here, 4 baselines models have been trained (each described in comments model.SingleViewto3D.__init__, and, model.SingleViewto3D.forward). They are described as follows. Refer to Section 2.5 for sample outputs for each baseline.

(The encoder (resnet18) outputs a 512 dimensional feature, which is appropriately reshaped, to pass as input to various decoder layers)


  1. [Architecture 1] Experiment with a simple FC network: Two types of decoders containing linear layers have been tried out:

Architecture 1a:

   Input(512) -> Linear(32*32*32, None, None)

Architecture 1b:

    Input(512) -> Linear(1024, ELU, BN) -> Linear(2048, ELU, BN) -> Linear(32*32*32, None, None) 

where, Linear(N, Act, Norm) indicates a fully connected layer with N outputs, Act activation (ELU) and Norm normalization (BatchNorm). This network is observed to be too small to learn on the entire dataset, and there is further scope of improvement. Specifically, the model is seen to spit out an average voxel representation for most chairs (see visualizations in Section 2.5).


  1. [Architecture 2] Experiment with a separable deconv network: The decoder predicts 3 planes of grid size 32x32 that correspond to XY, YZ, ZX planes. The value of a voxel at a location (x, y, z) is computed as the sum of the projections in the XY, YZ, ZX planes, i.e., adding the values at XY[x, y] + YZ[y, z] + XZ[x, z] that is predicted by the decoder. This is to simplify the network architecture (decomposing the prediction of the entire voxel volume, into summation of three learned planes). The intuition is similar to separable convolutions - one can approximate the learning of the entire 32 x 32 x 32 volume as learning projections along the three planes in the 3D space. This idea has been explored in some recent works such as the architecture proposed by the EG3D paper (Chan et al.).

The decoder contains layers as follows

    Input(512) -> Reshape(512, 1, 1)
    ConvTranspose2d(512, 256, kernel=(5, 5), stride=(1, 1)) -> ELU -> BN2d
    ConvTranspose2d(256, 128, kernel=(3, 3), stride=(2, 2)) -> ELU -> BN2d
    ConvTranspose2d(128, 128, kernel=(5, 5), stride=(1, 1)) -> ELU -> BN2d
    ConvTranspose2d(128,   3, kernel=(3, 3), stride=(2, 2), output_padding=(1, 1)) -> ELU -> BN2d

Finally, the prediction is computed as summation of the tri-planes:

   decoded_feat = decoder(encoded_feat)
   plane_xy = decoded_feat[:, 0, :, :]  # (B, 32, 32)
   plane_yz = decoded_feat[:, 1, :, :]  # (B, 32, 32)
   plane_xz = decoded_feat[:, 2, :, :]  # (B, 32, 32)
   voxels_pred = plane_xy.unsqueeze(2) + plane_yz.unsqueeze(0) + plane_xz.unsqueeze(1)

This network is unable to learn well. This could be due to several reasons: (1) the fact that the ConvTranspose2d layers are employed could limit learning only along two spatial dimensions (each ConvTranspose2d filter is independent), (2) the last ConvTranspose2d layer directly takes the latent representation to a 32x32 spatial resolution, without having any non-linearity in that space (observe that even after we reach the 32x32 spatial resolution for the tri-planes, we must have some non-linearity to ensure that the network can sufficiently learn). Accordingly, when this network fails, it is observed to output columns of densities rather than voxels (due to the nature of the summation operation applied on the tri-planes). Further hyperparameter tuning was not carried out with this architecture due to the discovery of better architectures. See Section 2.5 for an interesting failure case.


  1. [Architecture 3] Experiment with FC network with skip connections: As hypothesized in Architecture 2 above, the incapability of the network to learn non-linearities in the voxel space could be resolved using a deeper network. To this end, we first device a systematic mechanism to increase the depth of the network using skip connections (similar to Residual Networks (He et al.)). This decoder consists of Fully-connected layers with skip-connections along with batch normalization. This baseline is implemented to understand the effect of fully connected layers.
   Input(512) -> Linear(256, ELU, BN) -> Linear(64, ELU, BN)
   Linear(64, ELU, BN) + (skip)
   Linear(64, ELU, BN) + (skip)
   Linear(64, ELU, BN) + (skip)
   Linear(64, ELU, BN) + (skip)
   Linear(64, ELU, BN) + (skip)
   Linear(32*32*32, None, None)

In the architecture above, + (skip) denotes a skip connection from the previous layer. Here, we find that the outputs are over-smooth and the network barely learns high-frequency information (such as small bumps / gaps / slits in chairs). However an approximate voxel structure (blobs) appears. This could be attributed to the fact that a fully connected network learns to process each voxel independently of other voxels, therefore there would be no spatial reasoning inculcated into the network.

A hand-wavy explanation is that, the network learns to output some rough (averaged) voxel volumes instead of modelling the high frequency details which could be better learned by incorporating spatial contexts.


  1. [Architecture 4] Experiment with FC + Conv3D + Skip connections: This is the final architecture that worked. Here, the decoder contains an initial set of fully connected layers which transform the encoding into a grid of 32x32x32 outputs, which is then operated upon by Conv3d layers (with same padding; and ELU non-linearities) to obtain a final 32x32x32 voxel predictions. The intuition is that we first quickly go to the 32x32x32 voxel space and apply non-linearities in that space. This resolves the issues in the architectures proposed above. Moreover, to enchance the non-linearity of the network, we increase the depth of the network along with residual connections (again, as observed from the limited success of Architecture 3).
   Input(512) -> Linear(128, ELU, BN) -> Linear(256, ELU, BN) -> Linear(32*32*32, None, None
   Conv3d(32, k=3, s=1, p='same') -> ELU -> BN2d -> Conv3d(32, k=3, s=1, p='same') -> ELU -> BN2d + (skip; Conv3d(k=3))
   Conv3d(64, k=3, s=1, p='same') -> ELU -> BN2d -> Conv3d(64, k=3, s=1, p='same') -> ELU -> BN2d + (skip; Conv3d(k=3))
   Conv3d(128, k=3, s=1, p='same') -> ELU -> BN2d -> Conv3d(128, k=3, s=1, p='same') -> ELU -> BN2d + (skip; Conv3d(k=3))
   Conv3d(1, k=3, s=1, p='same')

Here, the skip connection is implemented with a Conv3d layer in between (1x1x1 kernel size) to get the same number of channels across the input and the output of the conv block (to make them congruent for addition using the skip connections).

Architecture 4 is the model on which the results in this subsection are reported.




2.2. Image to point cloud (15 points)


2.2.1 Results

Avg F1@0.05: 94.056

Here, three examples are shown:

Example id=460
Single-view Image Predicted Voxel Ground-truth Mesh
img_0_vox 0_point 0_point
Example id=70
Single-view Image Predicted Voxel Ground-truth Mesh
img_0_point 30_point 30_point
Example id=570
Single-view Image Predicted Voxel Ground-truth Mesh
img_0_point 30_point 30_point

2.2.2 Implementation Details

Train:

  python3 train_model.py --type 'point' --save_freq 500 --batch_size=32 --weight_decay=0.00001 --num_workers=2 --gpu_id=1 --pin_memory=False \ 
                         --max_iter=10000 --lr=5e-4 --scheduler_step=2000 --scheduler_gamma=0.5

Evaluate:

  python3 eval_model.py --type 'point' --load_checkpoint --batch_size=1 --vis_freq=10 --num_workers=1 --gpu_id=1 --ico_sphere_level=4

Training is performed with learning rate 5e-4 along with weight decay (1e-5). A step learning rate scheduler (StepLR) is applied with step size 2000 iterations and gamma value 0.5. The loss is implemented in losses.chamfer_loss where the use_bidirectional flag is set to True.

The training takes around 7 hours for 1 epoch on a GTX 1060 (6GB) GPU with a batch size of 32.


2.2.3 Architecture Details

Here, we observe that training a simpler non-linear network suffices. Specifically, a fully connected network with activations and normalizations is able to perform reasonably well. The architecture is as follows

      Input(512) -> Linear(1024, ELU, BN) 
      Linear(1024, ELU, BN)
      Linear(2048, ELU, BN)
      Linear(3 * n_points, None, None)

where n_points is the number of points predicted in the point cloud (set to 5000).

A plausible hypothesis supporting the simplicity of the network could be as follows. Note that the image encoder (ResNet18) is also being trained to extract features which enable learning the point cloud. As it turns out, for a point cloud, there is no concept of connectivity (in some sense, continuity or neighborhood-ness). Therefore, each neuron in the output layer can independently learn to project the points in the 3D space (of course, following some non-linear operation which converts the latent features to the point cloud representation). Moreover, since the ResNet model itself is being trained, the overall network can sufficiently capture the distribution of the point cloud.

As for the error in projection, note that there is no ground truth point cloud - we are sampling points from the ground truth mesh while computing the chamfer loss. This means that there is no fixed ground truth that the model can regress towards. Since we're only training for few epochs ( (32 / 61000) * 10000 = 5.24), the sampled points from the ground truth mesh are insufficient to model the entire mesh accurately. This is also seen as fluctuating results when we run eval_model.py (evaluation really depends on the sampled point clouds and can result in ~2% deviation in the F1 measure). This stochasticity in evaluation is not preferred, but devising a standard evaluation protocol is beyond the scope of this assignment.




2.3. Image to mesh (15 points)


2.3.1 Results

Avg F1@0.05: 82.73

Example id=30
Single-view Image Predicted Voxel Ground-truth Mesh
img_0_mesh 0_mesh 0_mesh
Example id=130
Single-view Image Predicted Voxel Ground-truth Mesh
img_130_mesh 3130_mesh 3130_mesh
Example id=560
Single-view Image Predicted Voxel Ground-truth Mesh
img_0_mesh 30_mesh 30_mesh

2.3.2 Implementation Details

Train:

  python3 train_model.py --type 'mesh' --save_freq 500 --batch_size=32 --weight_decay=0.00001 --w_smooth=0.4 --ico_sphere_level=4 --gpu_id=1 \
                         --num_workers=2 --pin_memory=False --max_iter=5000 --lr=5e-4 --scheduler_step=2000 --scheduler_gamma=0.5

Evaluate:

  python3 eval_model.py --type 'point' --load_checkpoint --batch_size=1 --vis_freq=10

Training is performed with learning rate 5e-4 along with weight decay (1e-5). A step learning rate scheduler (StepLR) is applied with step size 2000 iterations and gamma value 0.5. The losses are implemented in losses.chamfer_loss (where the use_bidirectional flag is set to True) and losses.smoothness_loss (with method set as uniform).


2.3.3 Architecture Details

The best result that was obtained was using a wide model as follows:

      Input (512) -> Linear(1024, ELU) 
      Linear(1024, ELU)
      Linear(2048, ELU)
      Linear(3 * n_points, None)

Here too, I tried various architectures. However, training a model for mesh was notoriously difficult for reasons stated in the following sections. Therefore, I chose an architecture similar to the point cloud model in Section 2.2.3. Removing the BatchNormalization layers seemed to improve the performance.


2.3.4 Discussion

When computing the chamfer loss, we sample only around n_points=5000 from both the meshes (i.e., predicted mesh and the ground truth mesh). While the mesh is initialized from ico_sphere(4) (around 2562 vertices and 5120 faces), the loss computation on just 5000 points on the faces is insufficient.

Clearly, at best, uniform sampling would only yeild around one point per face, and that would be insufficeint to learn the face orientations (or even vertex locations) accurately. Since, we're only training for a few epochs (~5 epochs; computed in Section 2.2.3), any training example would not see more than a handful of points sampled from each face, and therefore the model is unable to learn fine details along the surface of the chairs.

This is also exhibited as spurious points shooting out of the predicted mesh - those points are noisy vertices which are produced by an insufficiently trained model. Ignoring these effects, the model is able to learn a general shape for the mesh. Note that, the F1@0.05 metric being around 80 still suggests that there is indeed a significant overlap between the predicted mesh and the ground-truth mesh, and the sharp spurious vertices shooting off of the mesh are due to insufficient expression of the loss backpropagation. Training for more epochs should improve the predictions (because more points would be sampled from the faces across epochs).


2.3.5 What can we do to improve the results?

Here are two approaches to improve the model's predictions, but have not been carried out due to computational constraints. However, the following analytical discussion is intended to complement the (somewhat) poor visualizations above.

These following approaches (or even a combination of these) should certainly improve the model's predictions:

  1. Increasing n_points: As explained above, the chamfer loss is only computed on (approx.) one sampled point per face of the mesh being learned. This is extremely insufficient considering that an accurate mesh would entail (1) accurate face orientations, (2) aligned vertices. Considering that we only train for a few epochs, it is hard to expect a well-aligned mesh. The best process would be to increase the number of points used for computing chamfer loss (to about, say, n_points=50000). Ensuring more points per face, per iteration would significantly improve the smoothness of the results.
  1. Tuning smoothness loss: The smoothness loss is a major factor that ensures that each vertex is close to the mean of its neighbors. Employing a high weight w_smooth should certainly reduce the burden on the model to learn to orient the faces using sampled points. Consequently, the spurious vertices as observed above should also reduce. In this experiment, the weight for w_smooth=0.4 was chosen by observing the trend in Section 1.3.2 (please refer to the GIFs for w_smooth=0.1, 0.4, 1.0). From the analysis in Section 1.3.2 it was apparent that w_smooth=0.4 would yield reasonable results. However, in this case, the architecture is completely different (a single model is being optimized across 61k images) which incidentally does not correlate with the best performance of w_smooth=0.4 in Section 1.3.2. A future direction would be tune w_smooth.



2.4. Quantitative comparisions (10 points)

The average F1@0.05 for each model is as follows:

Model Average F1@0.05
Mesh 82.73
Voxel 89.05
Point Cloud 94.05

Again, as described in Section 2.2.3, the numbers may vary slightly (around 2%) across evaluations because the F1 score is calculated based on sampled points from the ground truth mesh (there is no fixed GT for evaluation). However, the results in the table above have been obtained by running evaluation 3 times per model, and averaging the F1@0.05 score over the entire test set.

An extensive discussion on the trend is given in Sections 2.1-2.3. An (inexhaustive) summary is as follows (please refer to Sections 2.1-2. 3 for a detailed explanation of the results:

  1. The point clouds model performs the best because it is the easiest to optimize. Clearly, it requires a smaller network, and as described before, there is lesser need for learning connectivity across points. Therefore, a fully-connected network suffices as it can independently learn to predict points without the need to model inter-point relationship.
  2. Voxel grids employ a better architecture suited for the task. The next best model observed is the voxel-grid based model which employs a Conv3D architecture as described in Section 2.1.3 (Architecture 4). Essentially, the inductive bias of a 3D convolution operation suits the task of learning voxel volumes, because the output of a convolution operation lies in the voxel space (over a 32x32x32 grid). Moreover, there is sufficient depth in the network (incorporating residual connections) which allow the model to capture both low-level and high-level characteristics suitable for learning the voxel volume.
  3. Mesh model employs a weaker training procedure. As noted in Sections 2.3.4-2.3.5, the mesh model performs the worst as it does not learn sufficiently well. It is indeed difficult to learn to orient the surfaces (considering the hyperparameters chosen), using (approx.) one sampled point per face, per iteration. Ways to improve this model are outlined in Section 2.3.5.



2.5. Analyse effects of hyperparms variations (10 points)


I have thoroughly analyzed the effect of various architectures in learning to predict voxels in Section 2.1. Please refer to the description in Section 2.1.3 for a detailed explanation. Here is a summary of the results:


  1. Employing shallow fully connected networks is insufficient. Clearly, an FC network will be unable to learn spatial dependencies (continuity / neighborhoodness based globularity etc.). Each neuron independently predicts a voxel, which results in the loss of structural information (such variations in chairs). Here is a visualization of three sample chairs predicted by Architecture 1a (see Section 2.1.3).

Predicted Voxel (Left) and Ground-truth Mesh (Right)

0_point 0_point

0_point 400_point

0_point 500_point


  1. 2D-convolutions are insufficient to capture 3D structures. As a recently popular approach (Chan et al.) discovered, the tri-plane representation could be one way to reduce the model's parameters while still learning to predict in a 3D space (32x32x32 voxel space). However, Chan et al. apply tri-planes at a latent space that acts as an input to a deeper network (to a differentiable rendering pipeline). While I tried to implement a similar idea here, perhaps the model did not train well (or better strategies to combine the outputs could be explored). However, here I'd like to present a failure case of this model which is interesting to observe,

Predicted Voxel (Left) and Ground-truth Mesh (Right)

0_point 0_point

Note how the predictions contain columns of occupancies. These columns are obtained because of the nature of the summation operation (where the values in a 2D plane are simply repeated across the third dimension due for addition). This issue can be resolved by incorporating more non-linear layers after the tri-plane representation space. This was not investigated further due to time and computational constraints.


  1. Deeper (non-linear) architectures allow learning better structures. As described in Section 2.1.3 he third variant of the voxel model is a deeper FC network with skip connections. The skip connections allow the transfer of features at a granularity that is suitable for learning. This network is able to express more variations than Architecture 1 above (which mostly predicts a mean chair representation for many types of chairs).

Predicted Voxel (Left) and Ground-truth Mesh (Right)

0_point 0_point

0_vox 300_vox

0_vox 500_vox

Compare these with that of Architecture 1a above. Clearly, the deeper network in Architecture 3 with skip connections has a better expressive power. However, as compared to the final architecture (Architecture 4), this network learns volumes as it is unable to integrate the spatial knowledge (since each output neuron processes the voxel independently). The success of this network over Architecture 1a,b is attributed to the fact that the deeper (and non-linear) nature can support better fitting of the training data.


  1. 3D convolutions with non-linearities in the 32x32x32 voxel space learn fine structures. The best results are obtained by quickly going to the 32x32x32 space of voxels and learning to refine the structures using 3D convolutions. As seen in Architecture 4 in Section 2.1.3, the FC layers decode onto the voxel space, after which the Conv3D layers learn the spatial structure of the ground-truth voxels. The key ingredients in this architecture are taken from the observations in Architectures 1-3, i.e., going deeper (with skip connections) and reasoning spatially (with Conv3D). We see fine structures being captured by this model.

Predicted Voxel (Left) and Ground-truth Mesh (Right)

0_point 0_point

0_vox 0_vox

0_vox 0_vox




2.6. Interpret your model (15 points)


2.6.1 Problem Setup

Suppose we are given a point cloud prediction model (as trained in section above 2.2), and we need to understand what part of the image caused the model to predict incorrectly. Here is a way to learn an image residual that indicates which parts of the image lacked information which could have caused the network to predict correctly.


2.6.2 Residual Maps

A residual map R(X, Y, f) is a function of the input image X, the associated ground-truth 3D representation (point cloud) Y and the learned predictor f. The map indicates a residual in the input (image) space which when added to the image X minimizes the training objective (chamfer(f (X+R), Y)). The optimal residual map therefore is obtained by:

  R* = argmin_R chamfer(f(x+M), y)

Essentially, by inferring a residual map R, we can visualize regions in the image which need to be modified in order to minimize the error. In other words, the residual map indicates "what in the image needs to change to predict correctly".


2.6.3 Algorithm to obtain a residual map

Given an image X, the associated ground truth point cloud Y, and a trained model f, the residual map is obtained by performing gradient descent on the loss function (chamfer loss) while updating the map R. The algorithm is as follows:

  ALGORITHM: RESIDUAL MAP

  [  Requires  ] X (input image), Y (target point cloud), f (trained predictor), N (max iters)
  [   Yields   ] R (output residual map)

  1:  INIT: R <- zeros_like(X)
  2:  LOOP
  3:      X = (X + R).clip(0, 1)
  4:      y = f(X)
  5:      loss = chamfer(y, Y)
  6:      loss.backward()
  7:      R <- SGD_Momentum_Step(R)      (update R using SGD with momentum)
  8:  UNTIL convergence of F1@0.05 or N iterations

  END ALGORITHM

Using gradient descent, a residual image is learned which when added to the image, the chamfer loss is minimized. The steps are described as follows:

  1. Start with a zeroed residual map R of shape HxWx3 (same as the image X).
  2. At each iteration
    1. Add the residual to the previously updated input X, and forward pass through the network (L3-L4)
    2. Compute the loss and backpropagate the gradients all the way uptp the residual image (L5-L6)
    3. Update the residual using Stochastic Gradient Descent step (with momentum) (L7
  3. Repeat the above process until convergence or until a set maximum number of iterations are reached

Note: Convergence can be defined in several ways. Here, for simplicity, I have defined convergence to be the case when F1@0.05 stops improving for certain number of iterations (200).


2.6.4 Results & Discussion

Here are some visualizations from the chairs test set. Note that the residual R has been normalized by scaling the non-zero values between [0, 1].

The tables below shows 4 images: (Zoom in for further clarity)

  1. The input image passed to a point-cloud prediction model (trained in Section 2.2)
  2. The residual map obtained by the algorithm discussed above
  3. The F1@0.05 score computed through each iteration of the gradient descent in the algorithm above
  4. The Chamfer loss computed through each iteration of the gradient descent in the algorithm above

Original Image X Residual R
F1@0.05 Chamfer
N / A
Original Image X Residual R
F1@0.05 Chamfer
Original Image X Residual R
F1@0.05 Chamfer
Original Image X Residual R
F1@0.05 Chamfer

The tables above summarizes two key aspects of the Residual Map generation process. (Zoom in to the full extent for clarity)

  1. Quantitative Results. Clearly, the F1@0.05 score increases (~95% -> ~96% in the first example above), and the Chamfer loss falls during the iterations. This is a sanity check of the proposed algorithm. Clearly, the learned residual indicates an additive component which yields better predictions that have a higher alignment with the ground-truth.

  1. Qualitative Results & Discussion.
    1. One can interpret the residual as a heat map where brighter values indicate allowable perturbations that can improve the predictions. The residual is seen to have higher values around the boundaries of the objects and at the background. This indicates an importance of pixels around the object where there is a change of mode (background is all black, while the foreground object has color).
    2. A plausible hypothesis for this trend is as follows. Most of the model's knowledge is attributed to the foreground pixels which allow the model to correctly predict the outputs. What could have been learnt by the model (with the chosen hyperparameters, epochs etc.) have already been captured by looking at the foreground. Therefore, to further improve the predictions, we must feed the model with additional context surrounding the object (e.g. a background where the chair is placed such that the model can learn the pose of the chair better). The residual map therefore should be seen as a potential background which could provide this context to the model.
    3. As can be seen from the Chamfer loss curves on the right, the estimating convergence on the F1@0.05 scores doesn't really correlate with a convergence on the Chamfer loss (which would take significantly longer to converge). Nevertheless, I chose to decide convergence based on the F1@0.05 score. However, optimizing for more iterations would yield residual maps with higher contrast (and possibly some implicit patterns) which could further reason about the model's knowledge.

2.6.5 Implementation Details

Execution:

  python3 visualize_model.py --index=150 --max_iter=5000 --patience=200 --load_checkpoint \
                             --checkpoint_name="./checkpoint_point.pth"

where --index corresponds to the index of the test data to evaluate, and --patiance indicates the number of iterations for determining convergence.