Luyuan Wang (luyuanw@andrew.cmu.edu)
Zero late days used.
The original input | After optimization | Target (ground truth) | |
---|---|---|---|
Voxel Grid | ![]() | ![]() | ![]() |
Point Cloud | ![]() | ![]() | ![]() |
Mesh | ![]() | ![]() | ![]() |
id | Input image | Predicted voxel grid | Ground truth mesh |
---|---|---|---|
0 | ![]() | ![]() | ![]() |
1 | ![]() | ![]() | ![]() |
5 | ![]() | ![]() | ![]() |
id | Input image | Predicted point cloud | Ground truth mesh |
---|---|---|---|
0 | ![]() | ![]() | ![]() |
1 | ![]() | ![]() | ![]() |
5 | ![]() | ![]() | ![]() |
id | Input image | Predicted mesh | Ground truth mesh |
---|---|---|---|
0 | ![]() | ![]() | ![]() |
1 | ![]() | ![]() | ![]() |
5 | ![]() | ![]() | ![]() |
Voxel grid F1 score (avg) | Point cloud F1 score (avg) | Mesh F1 score (avg) |
---|---|---|
83.685 | 93.995 | 91.601 |
First of all, this comparison is not perfectly fair, as the decoder network structure is different for voxel grid, point cloud, and mesh output. For the voxel grid, I used deconvolutional layers in the decoder. However, for the point cloud and mesh, I only used linear layers. The hyperparameters are not the same as well. Generally, the prediction difficulty should be: voxel < point cloud < mesh.
Predicting a point cloud | # points = 5000 | # points = 2000 |
---|---|---|
Avg F1 score | 93.995 | 92.178 |
Reducing the number of points will produce a worse result. This may be because fewer points are harder to represent the complex object shape.
I visualized the first and last layer outputs of the voxel decoder. The input is an RGB chair image.
The first layer:
The last layer:
The decoder contains several 3D deconvolutional layers, with batch norm layers and ReLu layers in between. The visualization only contains the 3D feature map of the 1st channel. As the feature map has three dimensions, I reshaped them into 2D for a better visualization effect. From the images above, we can see that the first layer of the decoder is very abstract. However, when it comes to the last layer, we can see a chair-like structure in the center of the feature map, which implies the network is converting a high-level abstract latent vector into a detailed concrete 3D model.