16-889 Assignment 2: Single View to 3D

1. Exploring loss functions

1.1. Fitting a voxel grid (5 points)

voxel_pred voxel_gt

Left and right are the prediction and ground-truth results respectively.

1.2. Fitting a point cloud (10 points)

point_pred point_gt

Left and right are the prediction and ground-truth results respectively.

1.3. Fitting a mesh (5 points)

mesh_pred mesh_gt

Left and right are the prediction and ground-truth results respectively.

2. Reconstructing 3D from single view

2.1. Image to voxel grid (15 points)

I use a stack of four deconvolution layers for the decoder as follows;

       
      self.decoder = torch.nn.Sequential(*[
          View([8, 8, 8]),
          nn.ConvTranspose2d(8, 128, 7, 1),
          nn.ReLU(True),
          nn.ConvTranspose2d(128, 256, 7, 1),
          nn.ReLU(True),
          nn.ConvTranspose2d(256, 512, 5, 1),
          nn.ReLU(True),
          nn.ConvTranspose2d(512, 512, 5, 1),
          nn.ReLU(True),
          nn.ConvTranspose2d(512, 256, 5, 1),
          nn.ReLU(True),
          nn.Conv2d(256, 32, 1),
          View([1, 32, 32, 32]),
      ])
       
    


From top to bottom: the rendered image, prediction, and ground-truth mesh respectively.

2.2. Image to point cloud (15 points)

I use a stack of fully connected layers for the decoder as follows;

       
      self.decoder = torch.nn.Sequential(*[
          nn.Linear(512, 512),
          nn.LeakyReLU(0.1, True),
          nn.Linear(512, 512),
          nn.LeakyReLU(0.1, True),
          nn.Linear(512, 512),
          nn.LeakyReLU(0.1, True),
          nn.Linear(512, self.n_point * 3),
          View([self.n_point, 3]),
      ])
       
    


From top to bottom: the rendered image, prediction, and ground-truth mesh respectively.

2.3. Image to mesh (15 points)

I use the same decoder network as mesh.



2.4. Quadatantitative comparisions(10 points)

F1-Score@0.05: {Voxel: 76.539, Point Cloud: 87.665, Mesh: 82.569}

The F1 score is computed based on the sampled points from predicted and ground-truth voxel grids|point clouds|meshes. The 3D sampled points comprise the 3D shape information and are reasonably comparable among the three representations. Therefore, the F1 score is a reasonable metric to measure performance in this case.

2.5. Analyse effects of hyperparms variations (10 points)

Use MLP instead of deconv for voxel grid

I replaced the decoder network for voxel grid with the MLP used for mesh and point cloud. I set the output size of MLP to (batch_size, 32 * 32 * 32). As a result, the F1 score dropped from 76.539 to 72.429. I believe this is because the deconvolution (or convolution) layers can capture the spatial perception better compared to MLP.



Increase the level of icosphere

I increased the level of icosphere to 6 to relax the limitation to fit a more complicated 3D model. However, the F1 score slightly deteriorated to 81.272. The visualizations below show that several faces are not used and stacked around the central regions, implying that it is crucial to use a mesh with an appropriate number of faces depending on 3D models.



2.6. Interpret your model (15 points)

After the point cloud training, I used t-SNE to visualize the latent features extracted from the encoder in 2D space. Mean shift is used to cluster the t-SNE features, and I sampled several images from different clusters (please see the images below). I expected that I could obtain pictures of similar shapes. Nevertheless, it seems that the embedded features of t-SNE do not reflect the appearance or shape of objects well. Therefore, I want to investigate how to learn disentangled features to generate a 3D model with the intended properties such as shape and appearance in future work.


Visualization of the embedded features of t-SNE in 2D space.

Sample images from the cluster 0

Sample images from the cluster 34

Sample images from the cluster 94