1. Exploring loss functions
For the GIFs shown below, the
left and
right graphics represents the
target structure and
optimized structure, respectively.
1.1. Fitting a voxel grid (5 points)
1.2. Fitting a point cloud (10 points)
1.3. Fitting a mesh (5 points)
2. Reconstructing 3D from single view
For the GIFs shown below, the leftmost figure is the input image, the middle one is the render of the ground truth mesh and rightmost figure is the predicted mesh or pointclouds, respectively.
2.1. Image to voxel grid (15 points)
2.2. Image to point cloud (15 points)
2.3. Image to mesh (15 points)
2.4. Quantitative comparisions(10 points)
F1 score comparison
F1 score @ 0.05 threshold of 3D reconstruction for
voxelgrids: 85.10
F1 score of 3D reconstruction for
pointcloud: 97.32
F1 score of 3D reconstruction for
mesh: 89.88
Intuitive explanation
From the above, we observe that the PointClouds have the best F1-score, followed closely by Mesh. On the other hand, the Voxel grids have the lowest F1 score.
Why did Voxel grids perform the poorest?
Since voxel grids are limited by the resolution of the grids, they are constrained to model within the space of 32x32x32 grid space. Moreover, since the vertices and faces of the voxel grids are generated by the Marching Cube algorithm, the approximation of the Marching Cube algrithm starts to accrue the errors. Due to this reason, we observe that the voxel grids fails to capture finer details such as the thin or finer detailed structures. For the chairs dataset that we operated upon, the most common error we observe are the missing legs of the chairs or any area of the chair that requires finer (thin) details. Due to this reason, it is either preferred to increase the resolution of voxel grids (which in turn might lead to memory and space complexity) or use either pointclouds or meshes representation.
Model: The proposed model consisted of deconvolution stages (ConvTranspose3d), followed by BatchNorm and ReLU nonlinearities. In the final layer, a sigmoid layer was kept to enforce the binary cross entropy loss.
Mesh representation came close second.
The mesh representation did not perform as poorly as voxels, however their performance was limited by their initialization procedure. In our case, we stuck with the ico-sphere which might have led to a limitation of not capturing the high frequency (finer details) details again for our datasets. Moreover, since the mesh representation also suffered from the tedious procedure of hyperparameter tuning since it comprised of two losses - Chamfer loss and Laplacian loss. Two loss terms resulted in two tuning knobs which might explain that difficulty in training this representation.
Model: The proposed model was similar to the model designed for pointcloud (explained below). The major difference was that the number of points in the model were dictated by the ico-sphere initalization.
PointCloud representation performed the best.
Since the Pointcloud representation did not face the limitations such as grid resolution, or ico-sphere initialization or tedius hyperparameter tuning, it showed the highest F1-score. Due to this reason, the proposed model for pointcloud representation was able to capture high-frequency details and thus reported the highest F1-score numbers.
Model: The proposed model consiststed of deconvolution stages of dimensions [4096, 3000, 2048, 1000, 1024, 512], followed by ReLU nonlinearities. Finally, a fully connected layer was kept to transform the points into (-1 x N x 3) where N represents the number of points in the pointcloud. This model is implemented in
pointcloud_modules.py
2.5. Analyse effects of hyperparms variations (10 points)
Results by varying n_points
@ n = 2500 | F1-score = 89.11
@ n = 5000 | F1-score = 91.89
@ n = 10000 | F1-score = 94.54
@ n = 20000 | F1-score = 97.32
As we see above, increasing the number of points in the pointcloud representation increased the F1-score accuracy of the trained model. Since the number of points increase, more points from the target mesh are sampled and used to compute the chamfer distances in the loss. Thus, a model trained with more points also better captures the high detailed (high-frequency features) compared to a model trained with lesser points.
Results by varying w_chamfer
Varying this parameter changed the penalty that the network gave to the chamfer loss. Although increasing this parameter resulted in visually pleasing meshes with smooth surfaces, it also resulted in loss in accuracy since the mesh started to fill over the pointy regions. If the
w_smooth parameter was increased substantially in proportion to
w_chamfer, the F1-score started to decrease, we observed that the F1-score started to decrease. This hypothesis correlates with the visual observation where the the surfaces of chairs started to become more smooth and started to lose the finer details.
2.6. Interpret your model(15 points)
For interpreting what our model is learning, we put up some visualizations to visualize and actually how and what the model is learning. For this, we wrote a script to see how an input voxel grid, mesh, and a pointcloud deforms and fits to the given groundtruth.
In other words, to confirm and check whether our model and network is correct, we confirm by overfitting the input data. The visualization of overfit could give an indiciation on whether any tweaks or changes are required in the proposed network. In this visualization, the GIFs show how a mesh and pointclouds are deforming and fitting to the input data as the training progresses. As shown in these GIFs, since our model is able to overfit over all the given types of inputs, it confirms our hypothesis that the capacity of the model is sufficient to learn and generalize over new categories.
In the following animations, the left most figure is the input image, and the GIF on the right is the visualization of how the input data is being overfit using the proposed networks.
Interpreting the PointCloud model
Interpreting the Mesh model
Interpreting the Voxel grid model