Q1.1
ground truth voxel:

fit voxel:

Q1.2
ground truth pointcloud:

fit pointcloud:

Q1.3
ground truth mesh:

fit mesh:

Q2.1
rgb image:

render of the predicted voxel:

ground truth mesh:

Q2.2
rgb image:

render of the predicted 3D point cloud:

ground truth mesh:

Q2.3
rgb image:

render of the predicted mesh:

ground truth mesh:

Q2.4
Avg F1 voxel: 61.984
Avg F1 pointcloud: 88.874
Avg F1 mesh: 87.061
test F1 score: pointcloud > mesh > voxel
In my opnion, pointcloud is the easiest way for computer to understand, since we can see it as the completion of depth map.
That is to say, the computer can first predict the deppth of the part it can see in the image and then unproject it to world coordinate and complete it,
which is not that hard for computer.
For mesh, it has fixed geometry. The only thing that network needs to do is predict the offset of vetices, which is kind of straightforward from the images. However, we add the smoothness loss to training process, which will cause the drop of F1.
However, for voxel, it is quite hard for network to make connection of the occupacy with the image space.
Q2.5
I vary the w_smooth. At the begining of training for fitting meshes, the magnitude of chamfer_loss is 1e2 and the magnitude of smoothness_loss is 1e-1, which is very unbalanced.
So we definitely need to add the weight for smoothness_loss. Here are some results.
We can see with the increasing of smoothness_loss, the learned mesh becomes more smooth and regular. However, it loses the accuray of the whole shape.
rgb image:

ground truth mesh:

render of the predicted mesh w w_chamfer=0.1:



Q2.6
I visulize the features of decoder for training voxels.
My decoder structure is: ConvTranspose3d1->batchnorm3d1->relu1->ConvTranspose3d2->batchnorm3d2->relu2->ConvTranspose3d3->batchnorm3d3->relu3->ConvTranspose3d4->batchnorm3d4->relu4->ConvTranspose3d5->sigmoid
feature1:
The size of feature after relu2 is n*64*8*8*8.
First channel of feature1:

Mean of all channel of feature1:

feature2:
The size of feature after relu3 is n*32*16*16*16.
First channel of feature2:

Mean of all channel of feature2:

feature3:
The size of feature after relu4 is n*8*32*32*32.
First channel of feature3:

Mean of all channel of feature3:

We can see the output of first channel is less organized as the channels are not ordered. However, when we look at the mean of the features, we can see that the decoder is learning the voxel representation from rough to fine. Even after ReLU2 we can see the rough shape of the chair.