16-889 Project 2

Luyuan Wang (luyuanw@andrew.cmu.edu)

Zero late days used.

1.1 - 1.3 Fitting a voxel grid / point cloud / mesh

 The original inputAfter optimizationTarget (ground truth)
Voxel Gridvox_originvox_srcvox_target
Point Cloudpc_originpc_originpc_tgt
Meshmesh_originmesh_srcmesh_tgt

2.1 Image to voxel grid

idInput imagePredicted voxel gridGround truth mesh
00_mesh_img0_vox_pred0_vox_gt
11_mesh_img1_vox_pred1_vox_gt
55_mesh_img5_vox_pred5_vox_gt

2.2 Image to point cloud

idInput imagePredicted point cloudGround truth mesh
00_mesh_img0_point_pred0_vox_gt
11_mesh_img1_point_pred1_vox_gt
55_mesh_img5_point_pred5_vox_gt

2.3 Image to mesh

idInput imagePredicted meshGround truth mesh
00_mesh_img0_mesh_pred0_vox_gt
11_mesh_img1_mesh_pred1_vox_gt
55_mesh_img5_mesh_pred5_vox_gt

2.4 Quantitative comparisions

Voxel grid F1 score (avg)Point cloud F1 score (avg)Mesh F1 score (avg)
83.68593.99591.601

First of all, this comparison is not perfectly fair, as the decoder network structure is different for voxel grid, point cloud, and mesh output. For the voxel grid, I used deconvolutional layers in the decoder. However, for the point cloud and mesh, I only used linear layers. The hyperparameters are not the same as well. Generally, the prediction difficulty should be: voxel < point cloud < mesh.

2.5 Effects of hyperparams variations

Predicting a point cloud# points = 5000# points = 2000
Avg F1 score93.99592.178

Reducing the number of points will produce a worse result. This may be because fewer points are harder to represent the complex object shape.

2.6 Interpret the model

I visualized the first and last layer outputs of the voxel decoder. The input is an RGB chair image.

The first layer:

vis_layer_first

The last layer:

vis_layer_last

The decoder contains several 3D deconvolutional layers, with batch norm layers and ReLu layers in between. The visualization only contains the 3D feature map of the 1st channel. As the feature map has three dimensions, I reshaped them into 2D for a better visualization effect. From the images above, we can see that the first layer of the decoder is very abstract. However, when it comes to the last layer, we can see a chair-like structure in the center of the feature map, which implies the network is converting a high-level abstract latent vector into a detailed concrete 3D model.