Zheren_Zhu
This section will involve defining a loss function, for fitting voxels, point clouds and meshes.
In this subsection, we will define binary cross entropy loss that can help us fit a 3D binary voxel grid. Define the loss functions here in losses.py file.
vox_tgt:
vox_src:
In this subsection, we will define chamfer loss that can help us fit a 3D point cloud . Define the loss functions here in losses.py file.
pointcloud_tgt:
pointcloud_src:
In this subsection, we will define an additional smoothening loss that can help us fit a mesh. Define the loss functions here in losses.py file.
mesh_tgt:
mesh_src:
This section will involve training a single view to 3D pipeline for voxels, point clouds and meshes.
Decoder NN structrue:
nn.ConvTranspose3d(64, 64, kernel_size=2, dilation=6),
nn.ConvTranspose3d(64, 128, kernel_size=2, dilation=6),
nn.ConvTranspose3d(128, 128, kernel_size=4, dilation=3),
nn.ConvTranspose3d(128, 128, kernel_size=4),
nn.ConvTranspose3d(128, 64, kernel_size=4),
nn.ConvTranspose3d(64, 1, kernel_size=4),
There are six 3D transposed convolutional layers. Each transposed convolutional layer is followed by a ReLU activation except for the last layer followed by a sigmoid activation layer.
From left to right: input RGB image, ground truth, prediction
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Decoder NN structrue:
nn.Linear(512, 2048),
nn.Linear(2048, 4096),
nn.Linear(4096, 4096),
nn.Linear(4096, 3000),
There are four Linear layers. Each transposed convolutional layer is followed by a batch normalization layer, a ReLU activation, and a dropout layer except for the last layer. The batch size is 32.
From left to right: input RGB image, ground truth, prediction
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Decoder NN structrue:
nn.Linear(512, 2048),
nn.Linear(2048, 4096),
nn.Linear(4096, 4096),
nn.Linear(4096, 3000),
There are four Linear layers. Each transposed convolutional layer is followed by a batch normalization layer, a ReLU activation, and a dropout layer except for the last layer. The batch size is 32.
From left to right: input RGB image, ground truth, prediction
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Quantitatively compare the F1 score of 3D reconstruction for meshes vs pointcloud vs voxelgrids. Provide an intutive explaination justifying the comparision.
Voxel_F-score@0.05: 55.068
Pointcloud_F-score@0.05: 78.213
Mesh_F-score@0.05: 90.732
Explaination:
For voxel, we used binary cross entropy loss to optimize, and we need to predict the occupancy fo each voxel grid. Also, the resolution for the voxel prediction is 32x32x32, which is low for evaluation sampling.
For pointclouds, we only have to learn their spatial locations. So, a simple network can get a better result. Also, the chamfer loss used in optimization can exaxctly lead to a higher F1 score for it's objective.
For meshes, we similarly used a chamfer loss to deform the mesh from a sphere.
w_smooth: A higher w_smooth value can lead to a smoother and more continuous output, but it will also lead to a lower F1 score. The ideal w_smooth value should below 10
The training process of a pointcloud
![]() |
![]() |
![]() |