CMU-16889 Learning for 3D Vision - HW2

1. Exploring loss functions

1.1. Fitting a voxel grid

1.2. Fitting a point cloud

1.3. Fitting a mesh

2. Reconstructing 3D from single view

2.1. Image to voxel grid

2.2. Image to point cloud

2.3. Image to mesh

2.4. Quantitative comparisions

The average test F1 scores at 0.05 threshold are 91.40, 96.18, and 93.10 for voxel grid, point cloud, and mesh, respectively. It is expected that point cloud has the highest F1 score while the voxel grid has the lowest. The reason is that we directly use the predicted points to compute the F1 score for point clouds. For meshes, we first sample points from the meshes and then compute the score. For voxels, which has the lowest score, we even have to convert it to meshes before sampling points for metric computation. It is reasonable to see that the point cloud has the highest while the other two have lower F1 score due to some variance.

2.5. Analyse effects of hyperparms variations

I try different number of points in the image-to-pointcloud experiment, which are 1000, 5000, and 10000. The average test F1 score at 0.05 threshold are 92.95, 96.18, and 96.47, respectively under same number of iterations of training. The figures below from top to bottom are test examples of 1000, 5000, 10000 points. We can observe that a larger number of points represents the whole object better, which is reflected in the F1 score.

2.6. Interpret your model

[Concept]
I try to visualize the precision and recall score for a better idea of how the model performs. Instead of just looking at the F1 score, it's important to know where false negatives and false positives could happen.

[Visualization explanation]
In the visualization below, I show an example of voxel, mesh and point cloud predictions. The left column is the input single-view image. The middle column shows the sampled points from predictions used to compute the F1 score. The right column shows the sample points from the ground truth mesh.
Right points and yellow in the middle images are predicticted points whose distance to the closest ground truth point are below and above threshold 0.05, respectively.
Right points and yellow in the middle images are ground truth points whose distance to the closest predicted points are below and above threshold 0.05, respectively.
Briefly speaking, more yellow points in the middle images means more false positives and lower precision. More yellow points in the right images means more false negatives and lower recall.

[Interpretation]
Overall, voxel predictions have more false positives in the hollow part of the chair back and back chair legs. And all the three predictions have difficulty reconstructing the connection part of chair legs. The visualization is actually reflected to the F1 score (point cloud > mesh > voxel) and gives one better idea of what the model learns to do well and not well.
From this visualization, we start wondering why, in the middle images, most of the points inside the hollow part of the chair back are red, which we expect to be yellow. There could be two reasons. First, the threshold is too high. Second, different predicted points can have the same closet ground truth point. Based on the visualization and these two (or more) reasons, we can try more methods to find a metric that can be more relfective to human perception.

Sampled points from "voxel" prediction (middle) and ground truth mesh (right), used to calculate F1 score.

Sampled points from "mesh" prediction (middle) and ground truth mesh (right), used to calculate F1 score.

Point cloud prediction (middle) and ground truth point cloud (right), used to calculate F1 score.

3. (Extra Credit) Exploring some recent architectures.

3.1 Implicit network

I design a image2implicit model. I concatenate the image feature and points that are mapped to fourier features as input of the decoder. Below are some results. The F1 score, however, is 90.49. It is lower than the image2voxel model, which is unexpected. This kind of model may need more careful designs.