Left and right are the prediction and ground-truth results respectively.
Left and right are the prediction and ground-truth results respectively.
Left and right are the prediction and ground-truth results respectively.
I use a stack of four deconvolution layers for the decoder as follows;
self.decoder = torch.nn.Sequential(*[
View([8, 8, 8]),
nn.ConvTranspose2d(8, 128, 7, 1),
nn.ReLU(True),
nn.ConvTranspose2d(128, 256, 7, 1),
nn.ReLU(True),
nn.ConvTranspose2d(256, 512, 5, 1),
nn.ReLU(True),
nn.ConvTranspose2d(512, 512, 5, 1),
nn.ReLU(True),
nn.ConvTranspose2d(512, 256, 5, 1),
nn.ReLU(True),
nn.Conv2d(256, 32, 1),
View([1, 32, 32, 32]),
])
From top to bottom: the rendered image, prediction, and ground-truth mesh respectively.
I use a stack of fully connected layers for the decoder as follows;
self.decoder = torch.nn.Sequential(*[
nn.Linear(512, 512),
nn.LeakyReLU(0.1, True),
nn.Linear(512, 512),
nn.LeakyReLU(0.1, True),
nn.Linear(512, 512),
nn.LeakyReLU(0.1, True),
nn.Linear(512, self.n_point * 3),
View([self.n_point, 3]),
])
From top to bottom: the rendered image, prediction, and ground-truth mesh respectively.
I use the same decoder network as mesh.
F1-Score@0.05: {Voxel: 76.539, Point Cloud: 87.665, Mesh: 82.569}
The F1 score is computed based on the sampled points from predicted and ground-truth voxel grids|point clouds|meshes. The 3D sampled points comprise the 3D shape information and are reasonably comparable among the three representations. Therefore, the F1 score is a reasonable metric to measure performance in this case.
I replaced the decoder network for voxel grid with the MLP used for mesh and point cloud. I set the output size of MLP to (batch_size, 32 * 32 * 32). As a result, the F1 score dropped from 76.539 to 72.429. I believe this is because the deconvolution (or convolution) layers can capture the spatial perception better compared to MLP.
I increased the level of icosphere to 6 to relax the limitation to fit a more complicated 3D model. However, the F1 score slightly deteriorated to 81.272. The visualizations below show that several faces are not used and stacked around the central regions, implying that it is crucial to use a mesh with an appropriate number of faces depending on 3D models.
After the point cloud training, I used t-SNE to visualize the latent features extracted from the encoder in 2D space. Mean shift is used to cluster the t-SNE features, and I sampled several images from different clusters (please see the images below). I expected that I could obtain pictures of similar shapes. Nevertheless, it seems that the embedded features of t-SNE do not reflect the appearance or shape of objects well. Therefore, I want to investigate how to learn disentangled features to generate a 3D model with the intended properties such as shape and appearance in future work.
Visualization of the embedded features of t-SNE in 2D space.
Sample images from the cluster 0
Sample images from the cluster 34
Sample images from the cluster 94