[16-889] Assignment 2 Submission: Aarush Gupta (aarushg3)

Late days used: 4

4 late days image


1. Exploring loss functions


1.1. Fitting a voxel grid (5 points)


Optimized voxel grid: Alt Text


Ground truth voxel grid: Alt Text


1.2. Fitting a point cloud (10 points)


Optimized point cloud: Alt Text


Ground truth point cloud: Alt Text


1.3. Fitting a mesh (5 points)


Optimized mesh: Alt Text


Ground truth mesh: Alt Text


2. Reconstructing 3D from single view

2.1. Image to voxel grid (15 points)


Example 1

Input RGB image:

RGB Image


Ground truth Voxel grid:

GT Voxel Grid


Predicted Voxel grid:

Predicted Voxel Grid


Example 2

Input RGB image:

RGB Image


Ground truth Voxel grid:

GT Voxel Grid


Predicted Voxel grid:

Predicted Voxel Grid


Example 3

Input RGB image:

RGB Image


Ground truth Voxel grid:

GT Voxel Grid


Predicted Voxel grid:

Predicted Voxel Grid


2.2. Image to point cloud (15 points)

Example 1

Input RGB image:

RGB Image


Ground truth point cloud:

GT point cloud


Predicted point cloud:

Predicted point cloud


Example 2

Input RGB image:

RGB Image


Ground truth point cloud:

GT point cloud


Predicted point cloud:

Predicted point cloud


Example 3

Input RGB image:

RGB Image


Ground truth point cloud:

GT point cloud


Predicted point cloud:

Predicted point cloud


2.3. Image to mesh (15 points)

Example 1

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted mesh:

Predicted mesh


Example 2

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted mesh:

Predicted mesh


Example 3

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted mesh:

Predicted mesh


2.4. Quantitative comparisions(10 points)

Point cloud representation performs the best among all the three representations in terms of F1 scores (also apparent from visualizations above). This is expected as there are no constraints on the point cloud representation whatsoever, and it can learn the given ground truth relatively easily. On the contrary, both the voxel and mesh representations have constraints on them, namely grid resolution for voxel grid, and connectivity between points and mode of initialization for meshes. Furthermore, predicting per voxel occupancy seems to be a more difficult task intutively as compared to predicting the deformations to a pre-initialized mesh, making the lower F1 score for voxel sound reasonable. Increasing the resolution of the voxel grid can probably boost its F1 score.

Representation F1@0.05
Voxel 85.875
Point Cloud 94.423
Mesh 91.832


2.5. Analyse effects of hyperparms variations (10 points)

Modifying the number of points (n_points) in Point Cloud representation

n_points F1@0.05
1000 91.253
2000 93.399
5000 94.423
10000 95.119

The baseline here is the experiment with 5000 points in the point cloud. Increasing the number of points to 10000 leads to a small increase in the F1 score. Similarly decreasing the number of points leads to a slight drop in the F1 score. This is expected as the model's learning capacity should vary proportionally with the number of points. Visually, 10000 points produce a slightly richer representation as compared to a 5000 or 2000 point representation as visible in the samples below. Further, example 2 of 1000 points representation doesn't match properly with the ground truth, indicative of this model's poor capacity.


Example 1

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted PC with 1k points:

PC 1k


Predicted PC with 2k points:

PC 2k


Predicted PC with 5k points:

PC 5k


Predicted PC with 10k points:

PC 10k


Example 2

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted PC with 1k points:

PC 1k


Predicted PC with 2k points:

PC 2k


Predicted PC with 5k points:

PC 5k


Predicted PC with 10k points:

PC 10k


Example 3

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted PC with 1k points:

PC 1k


Predicted PC with 2k points:

PC 2k


Predicted PC with 5k points:

PC 5k


Predicted PC with 10k points:

PC 10k


Modifying the weight of Smoothness loss in Mesh representation

Representation F1@0.05
0 92.901
0.5 92.196
1 91.587
5 91.446

Changing the w_smooth hyperparameter doesn't have a huge effect on the F1 scores as such. From the visualizations, we can see that lower w_smooth values such as 0 have too many sharp edges/abnormalities, and a higher value seems to removes the bigger edges/peaks/abnormalities in the mesh (although this doesn't seem to be that consistent). Further, in example 3, w_smooth=5 generates a shape that is a significant deviation from the ground truth mesh. Therefore, a mediocre value of w_smooth, say 0.1 or 0.5 seems to work best.

Example 1

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted Mesh with w_smooth 0:

PC 1k


Predicted Mesh with w_smooth 0.5:

PC 5k


Predicted Mesh with w_smooth 1:

PC 10k


Predicted Mesh with w_smooth 5:

PC 10k


Example 2

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted Mesh with w_smooth 0:

PC 1k


Predicted Mesh with w_smooth 0.5:

PC 5k


Predicted Mesh with w_smooth 1:

PC 10k


Predicted Mesh with w_smooth 5:

PC 10k


Example 3

Input RGB image:

RGB Image


Ground truth mesh:

GT mesh


Predicted Mesh with w_smooth 0:

PC 1k


Predicted Mesh with w_smooth 0.5:

PC 5k


Predicted Mesh with w_smooth 1:

PC 10k


Predicted Mesh with w_smooth 5:

PC 10k


2.6. Interpret your model (15 points)

For a unique visualization, I try to pass black images, i.e., no chair in the input image, image tensors filled with ones and rotated/inverted versions of one of the input images and see what the model outputs.

Black images

For the given input:

Black image

we get the following outputs:

White images

For the given input:

White image

we get the following outputs:

Rotated images

For rotations of the form shown in this figure:

Rotated image

The outputs are as follows:

Inverted images

For inversions of the form shown in this figure:

Rotated image

The outputs are as follows:

The models predicting point clouds and meshes seem to learn an intermediate representation of chairs (which is what they predict when fed with black images). The model predicting voxels seems to output a chair (rather a part of it) when fed with a black image. Further, on feeding in a white image, the voxel model predicts gibberish (or part of a chair which we can't seem to comprehend) whereas the other two models predict chairs different from their black image counterparts. The models should be ideally variant to inversion (since they are trained on upright chair images) but don't predict gibberish (except for the voxel model) and rather output reasonable chairs. A similar case is seen with rotated input images. Surprisingly, the voxel model predicts a reasonable chair when fed with a rotated image (suggesting that it didn't produce gibberish in the upright black image case). Another interesting observation is that for almost all the images (in this section and previous sections), the point cloud model's predictions are missing one of the legs of the chair indicating that the model is a little biased. Better training losses/hyperparameter tuning should help alleviate this problem.