16-889: Learning for 3D Vision (SP22) - Assignment 2

1. Exploring loss functions


1.1 Fitting a voxel grid (5 points)

Source voxel v.s. Ground truth voxel

1.1-src 1.1-gt

1.2. Fitting a point cloud (10 points)

Source point cloud v.s. Ground truth point cloud

1.2-src 1.2-gt

1.3. Fitting a mesh (5 points)

Source mesh v.s. Ground truth mesh

1.3-src 1.3-gt

2. Reconstructing 3D from single view


2.1. Image to voxel grid (15 points)

Qualitative results of image2voxel on three examples.

2.1-1 2.1-2 2.1-3

2.2. Image to point cloud (15 points)

Qualitative results of image2point_cloud on three examples.

2.2-1 2.2-2 2.2-3

2.3. Image to mesh (15 points)

Qualitative results of image2mesh on three examples are shown below.

2.3-1 2.3-2 2.3-3

2.4. Quantitative comparisions(10 points)

Quantitative results on three models are summarized in this table:

vox point mesh
Avg. F1@0.05 89.7964 96.6206 94.5228
Point clouds perform better in terms of the F1 scores. This is probably due to how we evaluate the F1 score -- computing precision and recall given a distance threshold for each nearest point at the ground truth.
Because we directly optimize the Chamfer distance for training image2point_cloud model, it performs better than image2mesh since it has additional smoothing terms for the mesh.
And image2vox is optimizing using binary cross-entropy loss for each voxel, and it needs to be converted to mesh first such that we can sample points from it.
So the F1 score is lower compared to other two models.

2.5. Analyse effects of hyperparms variations (10 points)

I analyze how the w_chamfer will affect the performance of image2mesh. The table below shows the quantitative comparison with w_chamfer = [0.1, 10, 1000]:

0.1 10 1000
Avg. F1@0.05 91.6955 94.5228 94.9293

We also show the qualitative comparison (top to bottom: w_chamfer = [0.1, 10, 1000])

2.5-1
2.5-1
2.5-1

When the w_chamfer is lower, the average F1 score worse. Without properly penalizing the prediction on pointclouds, the predicted mesh is not correct and has obvious artifacts.
For instance, in the top row, the predicted voxels have four weird triangular legs compared to other predictions. Increasing w_chamfer to 1000 gives us similar results when w_chamfer=10, meaning that a little regularization on the smoothness of a mesh is enough in this case.

2.6. Interpret your model (15 points)

First, I visualize the trained image2vox model by varying the isovalue = [0.1, 0.2, ..., 0.9] threshold when converting the voxels into meshes:

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Avg. F1@0.05 68.875 80.779 86.250 89.050 90.089 88.151 84.506 76.218 60.773

2.6-plot

The plot shows that the performance is at the peak when isovalue=0.5. So the threshold we choose during training and evaluation is reasonable.

Second, I visualize the qualitative results by feeding the same chairs but with different views to the model. For instance, below are different predictions of image2vox given different views of inputs:

2.6-1 2.6-2 2.6-3 2.6-4 2.6-5 2.6-6

When the views are clear (e.g., first and second rows in the first column), the conditions are less ambiguous and thus the predictions are more correct visually.
However, if we are seeing from the side or the back of the sofa (e.g., second and third rows in the second column), the model predicts a solid base for the sofa instead of a hollow base.
This makes sense since the back or side view does not provide information about what the sofa's base looks like. This shows that the model is not memorizing the 3d models.

3. (Extra Credit) Exploring some recent architectures.


3.1 Implicit network

I implement an implicit decoder based on occupance networks, which takes the the 3D locations (in x, y, z coordinates) and outputs the occupancy value at that location. In particular, I use the fourier features of the positions proposed by Tancik et. al.

Here are some qualitative evaluation:

3.1-1 3.1-2 3.1-3

And the average F1 score is 84.7045. Unfortunately, I did not obtain a better performance in my implementation. I think the two potential reasons are -- (1) I did not incorporate the conditional batchnorm in the network successfully, (2) the network layers are not properly tuned yet due to the time constraint.