Note: For this part, I used the cubify
tool in PyTorch3D to render the voxels as it showed the voxel representation more clearly.
Predicted Voxels | Ground Truth Voxels |
---|---|
![]() |
![]() |
Predicted Point Cloud | Ground Truth Point Cloud |
---|---|
![]() |
![]() |
Predicted Mesh | Ground Truth Mesh |
---|---|
![]() |
![]() |
Input RGB | Predicted Voxels | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Voxels | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Voxels | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Point Cloud | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Point Cloud | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Point Cloud | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Mesh | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Mesh | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Mesh | Ground Truth Mesh |
---|---|---|
![]() |
![]() |
![]() |
Reconstruction | F1 @ 0.05 |
---|---|
Voxel | 84.427 |
Point | 95.363 |
Mesh | 68.938 |
Intuitive explanation for the trend: We compute the F1 metric on a point-wise basis. That is, we sample points from the predicted 3D representation (at least in the case of voxel and mesh), and compute their distances to the points sampled from the ground truth mesh.
As such, it is easy to see that this metric is most easily optimized for point clouds, where we are predicting arbitrary points in 3D space -- where any two points have no constraint or connection to each other. It is easy for the model to predict arbitrary points close to the surface of the target shape, resulting in high F1 score.
In contrast, in the case of meshes, points are sampled from faces of the mesh. Thus, to obtain a high F1 metric, a model has to predict not only the correct locations of the points (vertices), but has to also respect the given (arbitrary) connectivity of vertices and orientations of the resulting faces. Thus, it is a much harder task than solely predicting points, as in the case of point clouds, since it requires the model to reason about location and connectivity/local structure of the points, resulting in low F1 scores.
Finally, the case of voxels lies somewhere in between points and meshes. When we sample points from a voxel grid, we do so by first converting it to a mesh by running marching cubes on it. However, the connectivity of the resulting mesh is far from arbitrary -- in fact, the faces in the generated mesh must be one of the candidate faces in the marching cubes algorithm. Thus, the model does not have to reason about connectivity and local structure as much as in the case of meshes. That said, it does have to reason about it a bit more than point clouds (which are more decoupled from each other). Thus, the F1 for voxels lies in between meshes and points.
I ran experiments by varying n_points
in my point cloud model. Here are the results from that set of experiments:
n_points |
F1 @ 0.05 |
---|---|
1 | 0.295 |
100 | 50.534 |
5000 | 95.363 |
10000 | 92.810 |
What was particularly interesting was that even the model which predicted just a single point learned something meaningful! (Even though its F1 metric is very low.)
Concretely, I looked at the prediction of the single-point model and compared it with the mean of the coordinates of the vertices of the ground truth mesh (on the test set). Here is a subset:
GT: [ 0.03920274 -0.03639643 -0.00027726]
Pred: [ 0.03705994 -0.0324362 -0.0002422051]
GT: [ 0.02971377 -0.03746274 -0.00127459]
Pred: [ 0.0295628 -0.03555077 -0.001299112]
GT: [ 0.03429012 -0.02339058 -0.00056951]
Pred: [ 0.02522823 -0.02798535 -0.0001399013]
GT: [ 0.04282395 -0.02701716 -0.00785733]
Pred: [ 0.03400975 -0.03048868 -0.00645506 ]
GT: [ 0.03119972 -0.0203436 0.00145872]
Pred: [ 0.03795141 -0.02103944 0.00153594]
It seems that the model learned to predict the mean point representing the shape. This makes sense intuitively, because the Chamfer loss is a sum of squared distance, and the minimizer of a single-variable sum of squared error is just the mean of the samples.
Input RGB | Predicted Point Cloud | Ground Point Cloud |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Point Cloud | Ground Point Cloud |
---|---|---|
![]() |
![]() |
![]() |
The visualizations for 5000 points are above in question 2.2.
Input RGB | Predicted Point Cloud | Ground Point Cloud |
---|---|---|
![]() |
![]() |
![]() |
Input RGB | Predicted Point Cloud | Ground Point Cloud |
---|---|---|
![]() |
![]() |
![]() |
Another point to be noted is that the fraction of outlier points remains roughly the same across the number of points (i.e., scales with the number of predicted points). This is probably because we use the same number n_points
to sample from the ground truth mesh and evaluate the predicted n_points
. If we were to increase the number of points used in evaluation, perhaps the fraction of outliers would go down.
I evaluated the voxel and point cloud models on the entire test set and created template visualizations of the ground truth (left) and the prediction (right). Then, I sorted each visualization in the descending order of the F1 metric of the prediction. Finally, I compiled that into an interactive GUI which lets you sift (no computer vision pun intended) through the results by sliding through in decreasing quality. This visualization gave me some insights which I've pointed out below.
Unfortunately, it's not possible to render the interactive GUI in HTML, but I've provided the code for that in Visualization.ipynb
. I've included a couple of GIFs of me sift-ing through the visualizations.
Insight gained:
As we go from good predictions to bad, we see that the cases where the model fails are objects which have very thin structures. In the above GIF, I've pointed out such mistakes in the prediction with my cursor, where the model fails to predict some thin structures (leading to a low F1 metric for that prediction).
In particular, the very last few visualizations (the least accurate predictions) completely confuse the model (since it predicts random blobs for them), and they are composed almost entirely of very thin members and structures. This is visible near the end of the GIF.
Insight gained:
As we go from good predictions to bad, we see that the predictions where the model fails are those where the model has aggregated a large number of points in one area, usually the base of the chair. This often leads to a lower density of points in other areas of the image, where the model has a sparser and more uniform distribution of points, leading to an inaccurate representation of such areas.
My hypothesis as to why this happens is that the Chamfer loss is basically a trade-off between two factors:
If either of the above losses is used in isolation, they can lead to the following degeneracies:
My hypothesis is that on poor predictions, the network is essentially "gaming" the Chamfer loss by predicting a high concentration of points where it knows with high certainty there exists a structure (the base of the chair, since most chairs presumably have bases). Thus, by point 1 in the above analysis, it achieves a low Chamfer error in that region of the prediction. In other areas where the model is not certain about the structure (e.g. thin or complicated legs), it predicts a more uniform density of points so that it achieves a low Chamfer error in that region of the prediction by point 2 of the above analysis.
We can see this effect increasing in intensity as we go from good to bad predictions. This effect is most pronounced in the last prediction of the GIF above, where the base of the chair has a very large density of points, while the rest of the region below the base has an almost uniform distribution of points.