alt

Question 1

1.1 Fitting a Voxel Grid

Note: For this part, I used the cubify tool in PyTorch3D to render the voxels as it showed the voxel representation more clearly.

Predicted Voxels Ground Truth Voxels
alt alt

1. 2 Fitting a Point Cloud

Predicted Point Cloud Ground Truth Point Cloud
alt alt

1.3 Fitting a Mesh

Predicted Mesh Ground Truth Mesh
alt alt

Question 2

2.1 Image to Voxel Grid

Input RGB Predicted Voxels Ground Truth Mesh
alt alt alt
Input RGB Predicted Voxels Ground Truth Mesh
alt alt alt
Input RGB Predicted Voxels Ground Truth Mesh
alt alt alt

2.2 Image to Point Cloud

Input RGB Predicted Point Cloud Ground Truth Mesh
alt alt alt
Input RGB Predicted Point Cloud Ground Truth Mesh
alt alt alt
Input RGB Predicted Point Cloud Ground Truth Mesh
alt alt alt

2.3 Image to Mesh

Input RGB Predicted Mesh Ground Truth Mesh
alt alt alt
Input RGB Predicted Mesh Ground Truth Mesh
alt alt alt
Input RGB Predicted Mesh Ground Truth Mesh
alt alt alt

2.4 Quantitative Comparisons

Reconstruction F1 @ 0.05
Voxel 84.427
Point 95.363
Mesh 68.938

Intuitive explanation for the trend: We compute the F1 metric on a point-wise basis. That is, we sample points from the predicted 3D representation (at least in the case of voxel and mesh), and compute their distances to the points sampled from the ground truth mesh.

As such, it is easy to see that this metric is most easily optimized for point clouds, where we are predicting arbitrary points in 3D space -- where any two points have no constraint or connection to each other. It is easy for the model to predict arbitrary points close to the surface of the target shape, resulting in high F1 score.

In contrast, in the case of meshes, points are sampled from faces of the mesh. Thus, to obtain a high F1 metric, a model has to predict not only the correct locations of the points (vertices), but has to also respect the given (arbitrary) connectivity of vertices and orientations of the resulting faces. Thus, it is a much harder task than solely predicting points, as in the case of point clouds, since it requires the model to reason about location and connectivity/local structure of the points, resulting in low F1 scores.

Finally, the case of voxels lies somewhere in between points and meshes. When we sample points from a voxel grid, we do so by first converting it to a mesh by running marching cubes on it. However, the connectivity of the resulting mesh is far from arbitrary -- in fact, the faces in the generated mesh must be one of the candidate faces in the marching cubes algorithm. Thus, the model does not have to reason about connectivity and local structure as much as in the case of meshes. That said, it does have to reason about it a bit more than point clouds (which are more decoupled from each other). Thus, the F1 for voxels lies in between meshes and points.

2.5 Analyse effects of hyperparams variations

I ran experiments by varying n_points in my point cloud model. Here are the results from that set of experiments:

n_points F1 @ 0.05
1 0.295
100 50.534
5000 95.363
10000 92.810

What was particularly interesting was that even the model which predicted just a single point learned something meaningful! (Even though its F1 metric is very low.)

Concretely, I looked at the prediction of the single-point model and compared it with the mean of the coordinates of the vertices of the ground truth mesh (on the test set). Here is a subset:

GT: [ 0.03920274 -0.03639643 -0.00027726]
Pred: [ 0.03705994 -0.0324362  -0.0002422051]

GT: [ 0.02971377 -0.03746274 -0.00127459]
Pred: [ 0.0295628  -0.03555077 -0.001299112]

GT: [ 0.03429012 -0.02339058 -0.00056951]
Pred: [ 0.02522823 -0.02798535 -0.0001399013]

GT: [ 0.04282395 -0.02701716 -0.00785733]
Pred: [ 0.03400975 -0.03048868 -0.00645506 ]

GT: [ 0.03119972 -0.0203436   0.00145872]
Pred: [ 0.03795141 -0.02103944 0.00153594]

It seems that the model learned to predict the mean point representing the shape. This makes sense intuitively, because the Chamfer loss is a sum of squared distance, and the minimizer of a single-variable sum of squared error is just the mean of the samples.

100 Points

Input RGB Predicted Point Cloud Ground Point Cloud
alt alt alt
Input RGB Predicted Point Cloud Ground Point Cloud
alt alt alt

The visualizations for 5000 points are above in question 2.2.

10000 Points

Input RGB Predicted Point Cloud Ground Point Cloud
alt alt alt
Input RGB Predicted Point Cloud Ground Point Cloud
alt alt alt

Another point to be noted is that the fraction of outlier points remains roughly the same across the number of points (i.e., scales with the number of predicted points). This is probably because we use the same number n_points to sample from the ground truth mesh and evaluate the predicted n_points. If we were to increase the number of points used in evaluation, perhaps the fraction of outliers would go down.

2.6 Interpret your Model

I evaluated the voxel and point cloud models on the entire test set and created template visualizations of the ground truth (left) and the prediction (right). Then, I sorted each visualization in the descending order of the F1 metric of the prediction. Finally, I compiled that into an interactive GUI which lets you sift (no computer vision pun intended) through the results by sliding through in decreasing quality. This visualization gave me some insights which I've pointed out below.

Unfortunately, it's not possible to render the interactive GUI in HTML, but I've provided the code for that in Visualization.ipynb. I've included a couple of GIFs of me sift-ing through the visualizations.

Voxel Model

alt

Insight gained:

As we go from good predictions to bad, we see that the cases where the model fails are objects which have very thin structures. In the above GIF, I've pointed out such mistakes in the prediction with my cursor, where the model fails to predict some thin structures (leading to a low F1 metric for that prediction).

In particular, the very last few visualizations (the least accurate predictions) completely confuse the model (since it predicts random blobs for them), and they are composed almost entirely of very thin members and structures. This is visible near the end of the GIF.

Point Cloud Model

alt

Insight gained:

As we go from good predictions to bad, we see that the predictions where the model fails are those where the model has aggregated a large number of points in one area, usually the base of the chair. This often leads to a lower density of points in other areas of the image, where the model has a sparser and more uniform distribution of points, leading to an inaccurate representation of such areas.

My hypothesis as to why this happens is that the Chamfer loss is basically a trade-off between two factors:

  1. the distance of ground truth from prediction ensures that each predicted point is close to some ground truth points.
  2. the distance of predicted points from the ground truth points ensures that the predicted points cover all of the ground truth points sufficiently.

If either of the above losses is used in isolation, they can lead to the following degeneracies:

  1. all predicted points are in some very localized region of the ground truth (this phenomenon is visible in predictions in the above GIF where many points are concentrated on the base of the chair).
  2. all predicted points are uniformly distributed in space so as to ensure that some predicted point is always close to every ground truth point (this phenomenon is visible in predictions in the above GIF where many points are concentrated near the base, and other points are distributed more uniformly below the base).

My hypothesis is that on poor predictions, the network is essentially "gaming" the Chamfer loss by predicting a high concentration of points where it knows with high certainty there exists a structure (the base of the chair, since most chairs presumably have bases). Thus, by point 1 in the above analysis, it achieves a low Chamfer error in that region of the prediction. In other areas where the model is not certain about the structure (e.g. thin or complicated legs), it predicts a more uniform density of points so that it achieves a low Chamfer error in that region of the prediction by point 2 of the above analysis.

We can see this effect increasing in intensity as we go from good to bad predictions. This effect is most pronounced in the last prediction of the GIF above, where the base of the chair has a very large density of points, while the rest of the region below the base has an almost uniform distribution of points.