CMU 2022 Spring 16-889

Assignment 2: Single View to 3D


1. Exploring loss functions

1.1. Fitting a voxel grid

Left is the optimized voxel grid, right is ground truth

1.2 Fitting a point cloud

Left is the optimized point cloud, right is ground truth

1.3 Fitting a mesh

Left is the optimized mesh, right is ground truth

2. Reconstructing 3D from single view

2.1. Image to voxel grid

Left the input RGB, middle is render of the predicted 3D voxel grid, right is the render of the ground truth mesh.

2.2. Image to point cloud

Left the input RGB, middle is render of the predicted 3D point cloud, right is the render of the ground truth mesh.

2.3. Image to mesh

Left the input RGB, middle is render of the predicted 3D mesh, right is the render of the ground truth mesh.

2.4 Quantitative comparisions

As the result shows, point cloud has the best F1 score, mesh is second and voxel is the worse. This is opposed to the qualitative result, where voxel has the most compelling visual quality. The reason why point cloud has the best F1 score is probably because that optimizing the Chamfer loss is directly aligned with optimizing the point distance. Mesh also has relatively high F1 because of the same reason, but it is a little bit worse than point cloud because it optimizes for an additional smoothness loss, which compromizes the effect of Chamfer loss. Voxel has the worse F1 score because it optimizes for the volumetric representation instead of 3D points, which is indirectly related to the way F1 score is calculated.

2.5 Analyze effects of hyperparameter variations

In this part, I focus on analyzing the effect of initial meshes for single view to 3D mesh model. More specifically, I vary the subdivision level on the initial unit sphere to find out which mesh detail level fits data the best. The belief is that by using more sub-meshes to represent a object, the more authentic it will be, but it will be harder to train at the same time. For this experiment, I change only the subdivision level while fixing other parameters and I train the networks until convergence. Here is the comparison:

Subdivision Level # of vertices # of faces Average F1 score
2 162 320 93.207
3 642 1280 94.425
4 2562 5120 95.113
5 10242 20480 95.771
6 40962 81920 95.292

We could see that fewer subdivision levels leads to slightly worse F1 score, and more subdivision levels results in better F1 score in general. However, after the subdivision level reaches 4 or above, the F1 score is pretty much the same while taking more iterations to converge. I also provide the qualitative results down below. Starting from left to right is the input image, ground truth mesh and predcited mesh with subdivision level from 2 to 6. The visual quality is not so great when the subdivision level increases although F1 score is high. This is probably because of less focus on the smoothness of faces during training.

Besides varying the subdivision level of unit sphere, I also test with using a torus as the initial mesh to deform. The quantitative and qualitative comparison is down below. Here, I use the unit sphere with subdivision level 4 to compare, because it has similar number of vertices and faces. The diferrence is subtle. The F1 score is slightly worse when using torus. Another difference is that using torus as initial mesh seems to reproduce chair leg of better quality. This is probably because chair leg is easier to be deformed from a torus.

Initial Mesh Type # of vertices # of faces Average F1 score
Unit sphere 2562 5120 95.113
Torus 2048 4096 94.425

2.6 Interpret your model

To interpret what my model has predicted, I use two methods to visualize my result.

The first one is visualizing the point cloud on which the F1 score is calculated. This means to sample points from the mesh infer from the voxels in the voxel prediction method. In addition, I color the points to indicate which points are false positive and false negative.

Here are the results. Starting from right to left, are:

  1. Input single view image
  2. Predicted points
  3. Predicted points with false positive
  4. Ground truth points
  5. Ground truth points with false negative

Single image to voxel

Single image to point cloud

Single image to mesh

From the above visualization, we could clearly see why voxel prediction method has the worse method. It tends to produce more false negative points compared to other two, besides the false positive that is visible in all methods. I felt like this is an unfair performance evaluation for voxel prediction method, since it does not optimize for the points, which would lead to better F1 score in general. The fair method I think to evaluate voxel methods is to compare its prediction to the ground truth voxel. So I visualize the voxel prediction and ground truth down below.

Similar to what I do for point cloud, I color the voxel cubes to indicate whether it's a false positive or false negative.

Here are the visualization. Starting from right to left are:

  1. Input single view image
  2. Predicted voxels
  3. Predicted voxels with false positive
  4. Ground truth voxels
  5. Ground truth voxels with false negative

Based on this visualization, we could see how our voxel methods performs. One thing I notice is that there are some unocccupied voxels within the object boundaries. This might be due to the fact that points are sampled from the mesh enclosing the object surface, and occupied voxels are generated from those points. So there would be hallow space within the object, causing inaccurate modeling of voxel space for learning.

All in all, I think simply comparing the three methods with F1 score is not enough. As seen from the results, the voxel method seems more visually pleasing than the other two, though it might not be accurate enough. This aspect should also be considered during the evaluation of 3D representation methods. This is an indicator of the limitation of existing evaluation methods for 3D.

Q3.1 Implicit network

In this part, I implement a simplified version of the work "Occupancy networks: Learning 3d reconstruction in function space". Rather than predicting voxels, point clouds, meshes from a single 2D image, this method predicts occupancy of query points. The architecture is as follows. The input image is first passed through a resnet18 for an encoded feature. Then, the coordinate of query points are passed through a fully-connected layer of 512-d. The encoded coordinate feature is concatenated with the image feature, then passing though two 512-d fully-connected layers than a 1-d fully-connected layer to predict occupancy. The network is trained with binary cross-entropy loss.

Results are shown below. We could see that the F1 score is generally lower than all methods in part 2. This might attribute to the fact that I use a very simple decoder network here compared to the mult-layer resblock-styled network used in the paper. However, the ablation study on the voxel sampling resolution shows that denser sampling of query points leads to more accurate reconstruction of the occupancy grid. This is quite as expected.

Voxel sampling resolution Average F1 score
16 81.856
32 85.512
48 86.140

Down below I also some visualization results. Starting from left to right, the first image is ground truth mesh, ground truth voxel, predicted occupancy grid of sampling rate 16, 32, 48.