16-889 Assignment 2: Single View to 3D
Name: Neha Boloor Andrew ID: nboloor
Goals: In this assignment, you will explore the types of loss and decoder functions for regressing to voxels, point clouds, and mesh representation from single view RGB input.
1. Exploring loss functions
1.1. Fitting a voxel grid (5 points)
Run:
python -W ignore main.py --question 1 --mode train --type 'vox'
Optimized Voxel Grid |
Ground Truth Voxel Grid |
 |
 |
 |
 |
 |
 |
1.2. Fitting a point cloud (10 points)
Run:
python -W ignore main.py --question 1 --mode train --type 'point'
Optimized Point Cloud |
Ground Truth Point Cloud |
 |
 |
 |
 |
 |
 |
1.3. Fitting a mesh (5 points)
Run:
python -W ignore main.py --question 1 --mode train --type 'mesh'
Optimized Mesh |
Ground Truth Mesh |
 |
 |
 |
 |
 |
 |
2. Reconstructing 3D from single view
For training
Run:
python -W ignore main.py --question 2 --mode train --type 'vox' --save_freq 50 --max_iter 10000 --batch_size 8 --lr 4e-5 --w_smooth 0.15
For evaluating and visualising :
python -W ignore main.py --question 2 --mode eval --type 'vox' --vis 100 --load_checkpoint
Input RGB Image |
Predicted Voxel |
Ground Truth Mesh |
 |
|  |
 |
 |
 |
 |
 |
 |
On your webpage, you should include visuals of any three examples in the test set. For each example show the input RGB, render of the predicted 3D voxel grid and a render of the ground truth mesh.
2.2. Image to point cloud (15 points)
For training
Run:
python -W ignore main.py --question 2 --mode train --type 'point' --save_freq 50 --max_iter 10000 --batch_size 8 --lr 4e-5 --w_smooth 0.15
For evaluating and visualising :
python -W ignore main.py --question 2 --mode eval --type 'point' --vis 100 --load_checkpoint
Input RGB Image |
Predicted Point Cloud |
Ground Truth Mesh |
 |
 |
 |
 |
 |
 |
 |
 |
 |
2.3. Image to mesh (15 points)
For training
Run:
python -W ignore main.py --question 2 --mode train --type 'mesh' --save_freq 50 --max_iter 10000 --batch_size 8 --lr 4e-5 --w_smooth 0.15
For evaluating and visualising :
python -W ignore main.py --question 2 --mode eval --type 'mesh' --vis 100 --load_checkpoint
Input RGB Image |
Predicted Mesh |
Ground Truth Mesh |
 |
 |
 |
 |
 |
 |
 |
 |
 |
2.4. Quantitative comparisions(10 points)
Metric |
Voxel Grid |
Point Cloud |
Mesh |
Avg F1@0.05 |
81.156 |
95.516 |
93.628 |
-
The metric used here is F1@0.05. Weighing recall much lesser than precision.
Using this metric for performance comparison we see that, point cloud performs the best followed by mesh and voxel.
-
I think is expected for the voxel grid to perform the worst of the 3 as this particular representation, especially at a resolution of 32x32x32 has the least expressive power and lack the capability to model intricate variations, holes etc. in the chairs because in this case we predict occupancies. Where as we try to deform a mesh in case of meshes and point cloud predictions.
-
Meshes perform better than voxels but they too lack the ability to model holes and intricate details. At least the details could be refined by increasing the number of vertices and increasing the smoothness factor to a certain extent. Point clouds are the best at modelling these details, holes etc. However, the drawback here again would be lack of connectivity information.
-
Both point cloud and mesh perform better mainly because they are deforming a starting mesh and learning the offsets using the "chamfer distance", which only cares about the closest point and not really the actual correspondence of points which could also be a reason that is boosting the score for meshes and point clouds.
2.5. Analyse effects of hyperparms variations (10 points)
I did experiment with a wide range of hyper parameters like learning rate, n_points, batch_size, w_smooth, max iterations etc. Here are a couple of them discussed:
-
w_smooth: value = 0.1(default), value = 0.15, value = 2
The higher the smoothness value, the smoother the mesh rendered, which results in almost all the predictions looking more or less the same. On the other hand a low value results in variations in predictions but rather pointy meshes.
Why is this? Thi sis because mesh predictions use both chamfer distance and laplasian smoothing. A lower value of w_smooth means, higher weightage is given to chamfer loss, which means finding the closest vertex is of greater importance than the surface being smooth, hence a mesh with sharp, pointy protrusions. A value of 0.15 worked reasonably well for me when used with optimal values for other hyperparameters too.
Smoothness Value |
Predicted mesh |
0.1 |
 |
2 |
 |
-
Batch size for training: value = 2(default) and value = 8
The model trained with batch size = 8, trained better and gave a higher F1 score and more importantly better visualisations.
Why is this? I think a larger batch size probably improves the effectiveness of the optimization steps resulting in more rapid convergence of the model parameters and hence better performance for a given number of epochs. Value pof 8 worked best for me when used with optimal values for other hyperparameters too.
Batch Size |
Predicted mesh |
2 |
 |
8 |
 |
2.6. Interpret your model (15 points)
I think visualising both the ground truth and prediction together in a common frame, along with the RGB image input, would give a better idea of what the F1 score is really telling us or what the model is really predicting as opposed to just visualizing them individually. Here, I have rendered the model prediction for point cloud (this can be extended to other 2 representations as well) in "red" and ground truth in "green" and a combined overlap point cloud that shows how well the model predictions are.
I have shown 2 examples. One with the best F1@0.05 score and one with the worst, output by the model I have trained.
The superimposed, overlap visualisation gives us a better idea of how close the predictions actually are to the ground truth and also what parts of the chair did the model predict correctly, and there by get a sense of what patterns of chairs are "easy" for the model to predict accurately and what are "hard".
F1@0.05 |
RGB Image |
Predicted Point Cloud |
Ground Truth Point Cloud |
Combined Overlap Point Cloud |
99.950 |
 |
 |
 |
 |
21.684 |
 |
 |
 |
 |
We could also use this visualisation to see how the mesh (points sampled from here) deform gradually to the final prediction as show here: This is how the initial stages of the training looks like (red points) which finally get deformed and model to get closer to the ground truth representation (green). We could plot this at various time steps (using diffrent checkpoints) during the training for visually appreciating the model's learning process in terms of the output seen.
Run: Once you have a checkpoint and want to visualise
python -W ignore main.py --question 2 --mode eval --max_iter 100 --type 'point' --visual True --load_checkpoint