Voxel: 77.765
Pointcloud: 90.271
Meshes: 87.941
According to the average F1 score, the point cloud is the easiest one to predict and I think it is because we only need to consider the location of each point with the sampled ones from the ground truth. Even if the prediction does not reconstruct the object in the image perfectly, we can still get a high F1 score as getting the rough shape of the object and making the point near the surface.
Predicting mesh is a little bit harder. Although we use similar loss with predicting point cloud, we predict the parameter to deform the mesh from the original one and we need to consider the smoothness of the predicted mesh.
Predicting voxel grids is the hardest one since most voxel grids are not occupied in the 3d space and it's kind of hard to find a proper weight to normalize the loss for occupied voxel grids and non-occupied ones. Also, we didn't model the context for each voxel grid, and it's hard to predict whether each voxel grid is occupied just from a single-view image.
pos_weight in BCEWithLogitsLoss:
pos_weight controls the weight of positive examples. If the weight of positive examples (occupied grids in our case) is too small, each grid will be zero to get a small loss value. But if the weight of positive examples is too large, the network will pay too much attention to positive examples and make the prediction contains too many occupied grids.
pos_weight = 0: empty mesh
pos_weight = 0.1 * number_of_non_occupied_grids / number_of_occupied_grids:
pos_weight = number_of_non_occupied_grids / number_of_occupied_grids:
w_smooth:
w_soomth controls the tradeoff between accuracy and smoothness. Inceasing the weight of smothness loss will make the predicted meshed smoother.
w_smooth = 10.0
w_smooth = 100.0
I visualize the point cloud sampled from the predictions (point cloud and mesh) and ground truth mesh. As we can see, for point cloud prediction and mesh reconstruction, the predicted points are clustered at some space rather than evenly spaced in the whole object. I believe it's caused by the loss function used. As chamfer loss is used to regularize the predicted points to its neighbors, the network will put more attention to these selected ground truth points. And I think that's one of the reasons why the reconstructed meshes have many spikes.
Here are some visualizations where red points are sampled from predictions and blue points are sampled from ground truth meshes.
Point cloud:
Mesh: