2.4 Quantitative Comparisons
Voxel Network (Avg F1@0.05) : 86.531
Point Cloud Network (Avg F1@0.05) : 96.409
Mesh Network (Avg F1@0.05) : 94.012
Intuition
It can be seen that F1 score of voxel grid is worst followed by mesh network and point cloud. It is primarily because voxel has least expressive power atleast at lower resolutions such as 32x32x32 grid.
Mesh network can ideally perform well but it lacks the capability to model holes and detailed shapes as it is limited by the vertices and connections of the initial mesh which is sphere in our case.
Point cloud is the most expressive in modelling the shape and can model holes too ideally. But it lacks connectivity information.
2.5 Hyperparameter Tuning
2.5.1 Custom Mesh (Chair)

2D Image

Predicted

Groundtruth

2D Image

Predicted

Groundtruth
Analysis
Since we need to predict chairs, I wanted to see the effect of taking a random chair as the initial mesh.
I observed that the convergence became very fast and the qualititive results also look better generally. Quantitatively, F1 score only increased marginally though.
It maybe because it is doing extremely well on chairs similar to the initial mesh but chairs quite different from the initial mesh might require a lot more deformations.
Another drawback of using chair as initial mesh is that it requires prior knowlege of the problem while sphere mesh is more general.
2.6 Interpret your model
2.6.1 Prediction Confidence of Different Parts
This shows the visualization of predicted objects using voxel method at different isovalue. It shows the confidence of prediction of different parts of the chair.
As can be seen, the model is least confident in legs and hence, a low isovalue is required to capture the vertices of legs and it is also not very correct.
While the model is most confident on the seat and back-rest of the chair as this is mostly common feature of most of the chairs.

Groundtruth Mesh

Voxel @ isovalue = 0.1

Voxel @ isovalue = 0.2

Voxel @ isovalue = 0.6

Voxel @ isovalue = 0.7
2.6.2 Effect of amount of input features at inference
It shows the effect of using random number of 512 dimensional encoded 2D image input in the prediction. Specifically, I ran the inference on a single image using point cloud network
by assigning zeros to random indices at 512 dimensional resnet output of the 2D image. This modified encoded input is then passed to the rest of the network and the predition is visualized.
I assigned different percentage of indices to zeros as mentioned in the caption.
It is observed that when the encoded input is all zeros, the network still predicts a valid chair but far from the groundtruth. It shows that the network has learned to predict a mean chair representation even without input 2D image.
Also, it can be seen that with just 25 percent of valid data the model is able to predict chair close to groundtruth. It can mean two things, 1) resnet features are informative and 512-d vector may not be required.
2) Network is robust and can infer from a sparse feature set efficiently.

Groundtruth

Image Feats (all zeros)

Image Feats (random 75% zeros)

Image Feats (random 50% zeros)

Image Feats (random 25% zeros)