Assignment 2: Single View to 3D
1. Exploring loss functions
1.1 Fitting a voxel grid
To visualize the voxel in pytorch3D, I used the pytorch function "cubify" to transform voxel to cube and construct mesh on that. The left side is the predicted voxel and the right side is the target voxel.


1.2 Fitting a point cloud
The point cloud is trained with chamfer distance. The results are shown below (left: prediction, right: ground truth)


1.3 Fitting a mesh
The mesh is trained with chamfer distance and smoothness loss. The results are shown below (left: prediction, right: ground truth)


2.Reconstructing 3D from single view
2.1 Image to voxel grid
Trained model for 15k iterations. The results are shown below: (left: RGB image input; middle: prediction; right: ground truth)









2.2 Image to point cloud
Trained model for 15k iterations, and results are shown below: (left: RGB image input; middle: prediction; right: ground truth)









2.3 Image to mesh
Trained model for 30k iterations, and results are shown below: (left: RGB image input; middle: prediction; right: ground truth)









2.4 Quantitative comparisions
The average F1 scores for voxel model is 79.104, point cloud is 89.223, and the mesh is 90.868 (with w_smooth = 0.8)
The difference between mesh and point cloud are not that clear. Especially when I tuned the 'w_smooth' to be larger, I can see the mesh F1 score is slightly better than point cloud one. Overall, point cloud and mesh are trained and calculated the loss based on the GT points, and smoothness. Their F1 scores are much better than voxel since when we evaluated the F1, we also compare our prediction with the sampled points. Point cloud has slightly lower accuracy since it lacks of the connectivity information compared to the mesh. Voxel's F1 score is the lowest among others. This can also be seen from the predictions. Voxel predictions are more coarse compared to other twos. Since our voxel resolution is not large (32), the output prediction accuracy is much lower.
2.5 Analyse effects of hyperparms variations
The parameter I played with is the "n_points" to train point cloud models. The number I tried with are 1500, 3000, 5000
Overall, trained with 3000 points generate higher F1 score, and the results are shown in above section. I also trained with 1500 points and 5000 points and evaluated the models at different iterations. Below summarizes the evaluation results on different hyperparms variations.
- 1500 points:
- Model at 14000 iter: 85.675
- Model at 20000 iter: 88.154
- Model at 30000 iter: 87.534
- 3000 points:
- Model at 10000 iter: 88.924
- Model at 14000 iter: 89.223
- 5000 points:
- Model at 14000 iter: 88.516
- Model at 20000 iter: 88.277
- Model at 30000 iter: 87.713
In general, training with more points has larger F1 score since generating more points could align with the ground truth mesh better. From above experiments, we could see the model trained with 1500 points always have lower F1 compared to other two. With more points trained, less iteration will produce good result while for less points trained, we might need much more iterations to achieve better results. Below rendering shows the results from 5000 points and 1500 points at 14000 iterations and 30000 iterations. The first row shows the renderings on 14000 iterations (left: 5000 pts, right: 1500 pts). The second row shows the renderings on 30000 iterations (left: 5000 pts, right: 1500 pts). From the renderings, we could also see that with less iterations, trained with more points already get relatively good results, but for less points, the result is worse.




2.6 Interpret your model
I try to interpret the voxel model by looking at the weights learned by the model. When we interpret the 2D convolutional network, we always plot the conv layer's weight and see what it has learned. Usually, people might see that the first several layers would learn something general (i.e. edge). The last few layers would learn something specific to the data points and datasets.
Inspired by this, similarly, I want to see what the 3D model learned in the decoder. In my implementation, the decoder for voxel grid is defined as below:
- deconv(32, 64, 4, 4, 4)
- deconv(16, 32, 4, 4, 4)
- deconv(8, 16, 4, 4, 4)
- deconv(1, 8, 4, 4, 4)
From the above layers' definitions, we could simply interpret each deconv layer to be several voxel grids with size of 4. The second channel can be seen as different views of the voxel. Then I ploted the voxel 3D deconv layer's weight. I am interested in seeing the trend of difference layer's weights, so I ploted the first decoder layer's weight (as shown in the first row below) and the last decoder layer's weights (as shown in the second row below). In total, the first layer's weight can be interpreted as 32 voxels with 64 views for each. The last layer's weight can be interpreted as 1 voxel with 8 views for it. I only picked 8 exampled rendered voxels for each layer as shown below (top: first layer, bottom: last layer). For decoder part, I can see the weights in the last layer focus on more general structure compared to the first decoder's weight. On the contrast, the first layer's weights are more specific to a particular structure. This makes sense to me since we could treat the decoder part as a reverse version of the convolutional encoder (i.e. bottleneck structure). When we reconstruct the voxel from the feature map, the sequence to reconstruct is reversed.















