[16-889] Assignment 2 Submission: Aarush Gupta (aarushg3)

Late days used: 4

4 late days image

1. Exploring loss functions

1.1. Fitting a voxel grid (5 points)

Optimized voxel grid: Alt Text

Ground truth voxel grid: Alt Text

1.2. Fitting a point cloud (10 points)

Optimized point cloud: Alt Text

Ground truth point cloud: Alt Text

1.3. Fitting a mesh (5 points)

Optimized mesh: Alt Text

Ground truth mesh: Alt Text

2. Reconstructing 3D from single view

2.1. Image to voxel grid (15 points)

Example 1

Input RGB image:

RGB Image

Ground truth Voxel grid:

GT Voxel Grid

Predicted Voxel grid:

Predicted Voxel Grid

Example 2

Input RGB image:

RGB Image

Ground truth Voxel grid:

GT Voxel Grid

Predicted Voxel grid:

Predicted Voxel Grid

Example 3

Input RGB image:

RGB Image

Ground truth Voxel grid:

GT Voxel Grid

Predicted Voxel grid:

Predicted Voxel Grid

2.2. Image to point cloud (15 points)

Example 1

Input RGB image:

RGB Image

Ground truth point cloud:

GT point cloud

Predicted point cloud:

Predicted point cloud

Example 2

Input RGB image:

RGB Image

Ground truth point cloud:

GT point cloud

Predicted point cloud:

Predicted point cloud

Example 3

Input RGB image:

RGB Image

Ground truth point cloud:

GT point cloud

Predicted point cloud:

Predicted point cloud

2.3. Image to mesh (15 points)

Example 1

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted mesh:

Predicted mesh

Example 2

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted mesh:

Predicted mesh

Example 3

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted mesh:

Predicted mesh

2.4. Quantitative comparisions(10 points)

Point cloud representation performs the best among all the three representations in terms of F1 scores (also apparent from visualizations above). This is expected as there are no constraints on the point cloud representation whatsoever, and it can learn the given ground truth relatively easily. On the contrary, both the voxel and mesh representations have constraints on them, namely grid resolution for voxel grid, and connectivity between points and mode of initialization for meshes. Furthermore, predicting per voxel occupancy seems to be a more difficult task intutively as compared to predicting the deformations to a pre-initialized mesh, making the lower F1 score for voxel sound reasonable. Increasing the resolution of the voxel grid can probably boost its F1 score.

Representation	F1@0.05
Voxel	85.875
Point Cloud	94.423
Mesh	91.832

2.5. Analyse effects of hyperparms variations (10 points)

Modifying the number of points (n_points) in Point Cloud representation

n_points	F1@0.05
1000	91.253
2000	93.399
5000	94.423
10000	95.119

The baseline here is the experiment with 5000 points in the point cloud. Increasing the number of points to 10000 leads to a small increase in the F1 score. Similarly decreasing the number of points leads to a slight drop in the F1 score. This is expected as the model's learning capacity should vary proportionally with the number of points. Visually, 10000 points produce a slightly richer representation as compared to a 5000 or 2000 point representation as visible in the samples below. Further, example 2 of 1000 points representation doesn't match properly with the ground truth, indicative of this model's poor capacity.

Example 1

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted PC with 1k points:

PC 1k

Predicted PC with 2k points:

PC 2k

Predicted PC with 5k points:

PC 5k

Predicted PC with 10k points:

PC 10k

Example 2

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted PC with 1k points:

PC 1k

Predicted PC with 2k points:

PC 2k

Predicted PC with 5k points:

PC 5k

Predicted PC with 10k points:

PC 10k

Example 3

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted PC with 1k points:

PC 1k

Predicted PC with 2k points:

PC 2k

Predicted PC with 5k points:

PC 5k

Predicted PC with 10k points:

PC 10k

Modifying the weight of Smoothness loss in Mesh representation

Representation	F1@0.05
0	92.901
0.5	92.196
1	91.587
5	91.446

Changing the w_smooth hyperparameter doesn't have a huge effect on the F1 scores as such. From the visualizations, we can see that lower w_smooth values such as 0 have too many sharp edges/abnormalities, and a higher value seems to removes the bigger edges/peaks/abnormalities in the mesh (although this doesn't seem to be that consistent). Further, in example 3, w_smooth=5 generates a shape that is a significant deviation from the ground truth mesh. Therefore, a mediocre value of w_smooth, say 0.1 or 0.5 seems to work best.

Example 1

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted Mesh with w_smooth 0:

PC 1k

Predicted Mesh with w_smooth 0.5:

PC 5k

Predicted Mesh with w_smooth 1:

PC 10k

Predicted Mesh with w_smooth 5:

PC 10k

Example 2

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted Mesh with w_smooth 0:

PC 1k

Predicted Mesh with w_smooth 0.5:

PC 5k

Predicted Mesh with w_smooth 1:

PC 10k

Predicted Mesh with w_smooth 5:

PC 10k

Example 3

Input RGB image:

RGB Image

Ground truth mesh:

GT mesh

Predicted Mesh with w_smooth 0:

PC 1k

Predicted Mesh with w_smooth 0.5:

PC 5k

Predicted Mesh with w_smooth 1:

PC 10k

Predicted Mesh with w_smooth 5:

PC 10k

2.6. Interpret your model (15 points)

For a unique visualization, I try to pass black images, i.e., no chair in the input image, image tensors filled with ones and rotated/inverted versions of one of the input images and see what the model outputs.

Black images

For the given input:

Black image

we get the following outputs:

For model predicting voxels:
For model predicting points:
For model predicting meshes:

White images

For the given input:

White image

we get the following outputs:

For model predicting voxels:
For model predicting points:
For model predicting meshes:

Rotated images

For rotations of the form shown in this figure:

Rotated image

The outputs are as follows:

For model predicting voxels:
For model predicting points:
For model predicting meshes:

Inverted images

For inversions of the form shown in this figure:

Rotated image

The outputs are as follows:

For model predicting voxels:
For model predicting points:
For model predicting meshes:

The models predicting point clouds and meshes seem to learn an intermediate representation of chairs (which is what they predict when fed with black images). The model predicting voxels seems to output a chair (rather a part of it) when fed with a black image. Further, on feeding in a white image, the voxel model predicts gibberish (or part of a chair which we can't seem to comprehend) whereas the other two models predict chairs different from their black image counterparts. The models should be ideally variant to inversion (since they are trained on upright chair images) but don't predict gibberish (except for the voxel model) and rather output reasonable chairs. A similar case is seen with rotated input images. Surprisingly, the voxel model predicts a reasonable chair when fed with a rotated image (suggesting that it didn't produce gibberish in the upright black image case). Another interesting observation is that for almost all the images (in this section and previous sections), the point cloud model's predictions are missing one of the legs of the chair indicating that the model is a little biased. Better training losses/hyperparameter tuning should help alleviate this problem.