16-889 Assignment 2: Single View to 3D

Author: Zhe Huang (zhehuang@andrew.cmu.edu)

1.1. Fitting a voxel grid

1.2. Fitting a point cloud

1.3. Fitting a mesh

2.1. Image to voxel grid

2.2. Image to point cloud

2.3. Image to mesh

2.4. Quantitative comparisions

Here are the quantitative results for all three methods generated by using the evaluation script. All models are trained for 32,000 steps with batch size set to 16. It seems that the image-to-mesh model performs the best as it achieves the highest average F1 score. This might be because the image-to-mesh model has the lowest training loss as 0.001 compared with other methods. Since the image-to-voxelgrid model and image-to-point cloud model have similar final training loss, their F1 scores are similar.

Voxel Point Cloud Mesh
Batch size 16 16 16
# training steps 32,000 32,000 32,000
Final loss value 0.002 0.002 0.001
Avg. F1\@0.05 90.921 90.353 93.984

2.5. Analyze effects of hyperparms variations

Here we tweak the level of ico_sphere from 4 to 3, reducing the total number of subdivisions from 2,562 to 642. Keep all other hyparameters the same, we train this relatively "low-resolution" model for 32,000 steps with batch size as 16.

The example results are shown below. Although pred_mesh_2562s and pred_mesh_642s look alike generally but they differ subtly. pred_mesh_2562s are more detailed and overall their surfaces are smoother. The chair legs in pred_mesh_2562s are less sharp, resembling the groundtruth better than pred_mesh_642s. The back of pred_mesh_2562 chairs also match the groundtruth better in terms of overall style and the shape. Thus, we can conclude that training using an initial mesh that has more subdivisions can yield to a better result.

2.6. Interpret your model

One interesting discovery is that for the image-to-point cloud model, it looks like most of those 5000 predicted points are "squeezed" on the seat part of those chairs. This results in the appearence where the seat is densly populated but other parts of a chair is sparsely represented by the predicted point cloud. One possibility to interpret this artifact is that crowding at the seat could make the overall training loss lower, thus more preferable by the model since its objective is to minimize the loss.

To demonstrate this, we sample a point cloud of 5,000 from the groundtruth mesh, generating the froundtruth point cloud in the same way as during the training. We then examine the mutual chamfer distance between the groundtruth and the predicted point cloud. We color points from groundtruth and prediction by their loss value. Specifically, points that have small losses are colored with red whereas points that have large lossed are colored with blue. The less loss an individual point generates, the more red it will appear in the figure.

As is illustrated in the group of gifs below, it seems like lots of points at the seating area generate low losses as they appear in red. This could be the reason why after training, the seating area is the densest part of the predicted pointcloud as it carries lowest energy. Hence, the behavior of the image-to-point cloud model can be interpreted from the perspective of training loss optimization.