I used 4 late days submitting this one. Stupid paper deadlines.
Ground Truth | Optimized |
![]() |
![]() |
Ground Truth | Optimized |
![]() |
![]() |
Ground Truth | Optimized |
![]() |
![]() |
Object # | Input Image | Ground Truth Render | Predicted Geometry |
0 | ![]() |
![]() |
![]() |
20 | ![]() |
![]() |
![]() |
30 | ![]() |
![]() |
![]() |
Object # | Input Image | Ground Truth Render | Predicted Geometry |
0 | ![]() |
![]() |
![]() |
20 | ![]() |
![]() |
![]() |
30 | ![]() |
![]() |
![]() |
Object # | Input Image | Ground Truth Render | Predicted Geometry |
0 | ![]() |
![]() |
![]() |
20 | ![]() |
![]() |
![]() |
30 | ![]() |
![]() |
![]() |
The following are the F1 scores:
Method | F1 |
Vox | 85.75 |
Point | 96.34 |
Mesh | 92.35 |
In terms of topline metrics, the Pointcloud-based representation performs the best. This is no surprise: if you look into the calculation of the F1 score, the metric for all three methods is to sample a point cloud and compute the Chamfer distance between predictions and ground truth, thresholding at various distances. This is EXACTLY the loss function (not the thresholding part...) that we use when training the network, so it stands to reason that the method that directly optimizes this metric will do better than methods that optimize a different metric.
Aside from this, the optimization task for point cloud optimization is strictly an easier one; that is, there are fewer constraints on the output space when compared to the mesh prediction (since there is additional connectivity information that must be respected to minimize the loss), and less information to predict compared to the voxel prediction task (since you have to predict values for a 3D area, most of which are zeros, rather than a fixed number of surface points). Between the voxel prediction and mesh prediction, the voxel prediction performs worse, but that may be because I didn't do any class-balancing to decrease the importance of predicting non-empty regions, and I might not have let it run long enough.
I'm choosing to vary the number of points on the point cloud prediction. The general idea will be to see how things go as we make things extreme.
Number of Points | n=1 | n=10 | n=100 | n=1000 | n=10000 | n=100000 |
F1 Score | 0.44 | 8.59 | 60.76 | 93.19 | 95.92 | 61.53 |
First, some obvious observations. Predicting only 1 point yields a very poor result; and so does predicting 10 (although it's better than 1 point). Quality shoots up dramatically between 10 and 100 points, though, which suggests that the bulk of the shape of the object can be described by only a handful of points.
What's interesting is that performance doesn't really change all that much between 1000 points and ~10000 points. One would think that a denser coverage of the surface would yield substantially lower F1 scores, but it seems that the margins are diminishing. Finally, when predicting 100,000 points, the network doesn't converge enough to predict any more precisely than 100 points, which is expected. You could probably learn a more scalable model with a sampling-based approach rather than a direct regression.
The takeaway here is: density after a certain point is somewhat superfluous. This true not just of the learning task but of representing the object itself.
I wanted to answer the question of "Does the reconstruction depend on the geometry of the input image, or is it just memorizing a proto-shape". So, I generated 10 different perspectives of one object to see if the point cloud varied at all. If part of the object is different when obscured vs when visible, I consider that to be some (albeit weak) evidence that the network is looking at the image to generate the geometry. If not, then it's likely memorizing a proto-chair. Here is a grid of the different perspectives:
Ground truth | ![]() |
|
Perspective #0 | ![]() |
![]() |
Perspective #1 | ![]() |
![]() |
Perspective #2 | ![]() |
![]() |
Perspective #3 | ![]() |
![]() |
Perspective #4 | ![]() |
![]() |
Perspective #5 | ![]() |
![]() |
Perspective #6 | ![]() |
![]() |
Perspective #7 | ![]() |
![]() |
Perspective #8 | ![]() |
![]() |
Perspective #9 | ![]() |
![]() |
We see that the network does seem to rely on the geometry provided in the image to reconstruct the pointcloud. For instance, when the front of the chair is visible, the network recognizes that there should be a gap underneath the seat. When the seat is not visible, it tends to predict a solid region under the seat.
I don't have time for extra credit.