average test F1 score @ 0.05:
voxel: 89.468
point: 96.264
mesh: 79.046
I explored the decoder archiecture a bit for point cloud and mesh.
The first one is a conv3d architecture that is very similar to the one proposed in Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images, Fig 2.
The second one is a vanilla MLP that maps the encoder latent vector to the output point cloud / mesh.
Their corresponding test F1 score:
point with conv3d: 96.264
point with mlp: 89.733
mesh with conv3d: 81.844
mesh with mlp: 79.046
It seems that using conv3d brings huge benefit to point cloud prediction, and a smaller benefit for mesh prediction.