16-889 Assignment 2

Yutian Lei (Andrew ID: yutianle)

1. Exploring loss functions

1.1 Fitting a voxel grid (5 points)

Fitted Voxel using Binary Cross Entropy Loss
Ground Truth Voxel

1.2. Fitting a point cloud (10 points)

Fitted Point Cloud using Chamfer Loss
Ground Truth Point Cloud

1.3. Fitting a mesh (5 points)

Fitted Mesh using Smoothening Loss and Chamfer Loss
Ground Truth Mesh

2. Reconstructing 3D from a Single View

In sections 2.1, 2.2, and 2.3, the visualization of 3D reconstruction by meshes, point cloud, and voxel of three examples in the test datasets are shown respectively. The first two examples show the successful cases while the last one shows the failure case.

2.1. Image to voxel grid (15 points)

Fitted Voxel 1
Ground Truth Voxel 1
Ground Truth Image 1

Fitted Voxel 2
Ground Truth Voxel 2
Ground Truth Image 2

Fitted Voxel 3
Ground Truth Voxel 3
Ground Truth Image 3

2.2. Image to point cloud (15 points)

Fitted Point Cloud 1
Ground Truth Point Cloud 1
Ground Truth Image 1
Fitted Point Cloud 2
Ground Truth Point Cloud 2
Ground Truth Image 2
Fitted Point Cloud 3
Ground Truth Point Cloud 3
Ground Truth Image 3

2.3. Image to mesh (15 points)

Fitted Mesh 1
Ground Truth Mesh 1
Ground Truth Image 1
Fitted Mesh 1
Ground Truth Mesh 1
Ground Truth Image 1
Fitted Mesh 1
Ground Truth Mesh 1
Ground Truth Image 1

2.4. Quantitative Comparisons (10 points)

VoxelPoint CloudMesh
F1@0.0586.20194.18484.710

2.5. Analyse effects of hyperparms variations (10 points)

2.5.1 Voxel: Whether to Normalize the Loss Value according to Occupancy Rate

For the image to voxel model, the empty value (or voxel) generally occupies the most of areas in an object, making the reconstruction model tends to predict more “0” as a result. Also, the occupancy rates for each object are different. Thus, it will be helpful to normalize the loss value according to the occupancy rate in the training process. Here, the F1@0.05 scores for the image to voxel model with and without loss value normalizing are presented. The performance of the image to voxel model improves with loss normalization. Also, the loss normalization alleviates the discontinuity problem of the model to some extent. For example, the predicted chair legs without normalization in the figure below are very thin, while with normalization, the former two chair legs become normal.

Without NormalizationWith Normalization
F1@0.0586.20187.668
Fitted Voxel (Without Normalization)
Fitted Voxel (With Normalization)
Ground Truth Voxel
Ground Truth Image

2.5.2 Point Cloud: Number of Points

For the image to point cloud model, the number of points used to present to the 3D model will affect the performance of the model. Here, the F1@0.05 scores for the image to point cloud model with the different numbers of points used are presented. The experiments demonstrate that using too many points encumbers the model’s performance due to the capacity limit of the model, while using too few points may be insufficient to express the 3D structure well.

Number of Points250050007500
F1@0.0592.90894.18491.202
Fitted Point Cloud (2500)
Fitted Point Cloud (5000)
Fitted Point Cloud (7500)
Ground Point Cloud
Ground Truth Image

2.5.3 Mesh: Mesh Initialization

For the image to mesh model, the important factor to affect the performance of the model will be the mesh initialization of the model. Here, the F1@0.05 scores for the image to mesh model with initializing as a predefined unit ico-sphere (level 4) and a “standard” chair picked from the r2n2 dataset are presented. It’s noted that although the F1@0.05 score with ico-sphere initialization is just slightly lower than that of chair initialization, the visualization of the chair initialization is significantly better as shown below.

Mesh InitializationUnit Ico-Sphere (Level 4)Chair
F1@0.0582.64584.710
💡
The selected chair is c2ad96f56ec726d270a43c2d978e502e
Fitted Mesh (Unit Ico-Sphere Init)
Fitted Mesh (Chair Init)
Ground Truth Mesh
Ground Truth Image

2.6. Interpret your model (15 Points)

As the classic encoder-decoder structure is used for all three models in this assignment, I believe the interpolation will be a good method to interpret the latent space of the model and explore the smoothness of the learned model. Specifically, the encoded features of two selected images are linearly interpolated with a step 0.1, which are then encoded by the corresponding model to generate interpolated outputs. As the figures shown below, all three learned models can generate a smooth transfer from a chair structure (1) to a sofa structure (2), which demonstrate that the models are not only memorizing the correspondence between input images and 3D structure but rather building a robust understanding from the latent encoded features to 3D structure space.

2.6.1 Voxel

Fitted Voxel 1
Fitted Voxel 2

2.6.2 Point Cloud

Fitted Point Cloud 1
Fitted Point Cloud 2

2.6.3 Mesh

Fitted Mesh 1
Fitted Mesh 2

3. (Extra Credit) Exploring some recent architectures

3.1 Implicit network (10 points)

For the implicit network, I reimplement it based on the Occupancy Networks: Learning 3D Reconstruction in Function Space with simplifications.

Specifically,

  1. In the training stage, the training points are drawn from the discrete voxel space (32×32×32)(32\times32\times32) rather than sampling some points from a continous 3D space and determining whether these points are inside or outside the correspoing meshes.
  1. In the inference stage, the output mesh is predicted from simple marching cube algorithm similar to 2.1 rather than the Multiresolution IsoSurface Extraction (MISE) in the orignal paper. Though, I also borrowed the MISE codes from the official implementation to conduct an abilition study on two mesh generation methods. The experiments will be shown below.
  1. For the model part, I basiclly reimplment the baseline model in the original paper except for: 1) The five ResNet blocks with conditional batch normalization are implmented without the “residual” or “skip connect” structure; 2) The latent encoder is deleted due to its complexity.

The visualization of result of this naive implementation of implicit network is shown below.

Fitted Mesh 1
Ground Truth Mesh 1
Ground Truth Image 1
Fitted Mesh 2
Ground Truth Mesh 2
Ground Truth Image 2
Fitted Mesh 3
Ground Truth Mesh 3
Ground Truth Image 3

The quantitative results of using naive and MISE mesh generation methods are shown below.

MISENaive
F1@0.0584.11583.073

3.2 Parametric network (10 points)

For the parametic network, I reimplement it based on the AtlasNet: A Papier-Mach ˆ e Approach to Learning 3D Surface Generation with simplifications.

Specifically,

  1. In the training stage, the training points is sampled randomly instead of using regular sampling techinique metioned in the original paper.
  1. In the inference stage, only the naive mesh generation method is implemented, that is, transfer the unit square to 3D with keeping its connectivity. The Poisson surface reconstruction (PSR) is not implmeneted due to its complexity.
  1. For the model part, I basiclly reimplment the baseline model in the original paper except for the model only surpports using one template (or say primitive surface) instead of multiple templates in the official implmentation.

The visualization of result of this naive implementation of implicit network is shown below.

Fitted Mesh 1
Ground Truth Mesh 1
Ground Truth Image 1
Fitted Mesh 2
Ground Truth Mesh 2
Ground Truth Image 2
Fitted Mesh 3
Ground Truth Mesh 3
Ground Truth Image 3

The quantitative results of sampling 2048 points and 4096 points in training are shown below.

Number of Points20484096
F1@0.0563.40870.987