16-889: Learning for 3D Vision (Spring 2022)


Assignmenet 2


Qichen Fu

1. Exploring loss functions


1.1. Fitting a voxel grid (5 points)



source
target

1.2. Fitting a point cloud (10 points)


source
target

1.3. Fitting a mesh (5 points)


source
target

2. Reconstructing 3D from single view


2.1. Image to voxel grid (15 points)


input RGB
predicted 3D voxel
GT 3D voxel
input RGB
predicted 3D voxel
GT 3D voxel
input RGB
predicted 3D voxel
GT 3D voxel

2.2. Image to point cloud (15 points)


input RGB
predicted 3D point cloud
GT 3D point cloud
input RGB
predicted 3D point cloud
GT 3D point cloud
input RGB
predicted 3D point cloud
GT 3D point cloud

2.3. Image to mesh (15 points)


input RGB
predicted 3D mesh
GT 3D mesh
input RGB
predicted 3D mesh
GT 3D mesh
input RGB
predicted 3D mesh
GT 3D mesh

2.4. Quantitative comparisions(10 points)


Quantitative Results

Voxel: Avg F1@0.05: 66.987

Point Cloud: Avg F1@0.05: 92.013

Mesh: Avg F1@0.05: 81.851

Explaination

The point cloud prediction gives the best F1 score, the meash ranks the second and voxel has the worst F1 score.

The point cloud has the best performance because the learning objective (the loss function) is directly related to the evaluation metric which evaluating the distance between the predicted and ground truth points.

The mesh has the second performance because in the training, apart from the chamfer loss, the predicted mesh also needs to be smooth. So the training objective is less related to the evaluation metric than the point cloud. Meanwhile, since the number of the vertices is limited, the triangle faces of the mesh is relatively coarse comparing to the point cloud and the points on them are not optimized to be close to ground truth points, so the mesh will have a worse f1 score.

The training of voxel prediction focuses more on the occupancy of the 3D grid instead of fine points. Since the 3D voxel grid is pretty coarse, so it will larger distance to the ground truth than the point cloud and mesh. So it has the lowest f1 score.

2.5. Analyse effects of hyperparms variations (10 points)


I Analyze the effect of n_point for the point cloud representation. The comparisions are shown in the table below:

Type n_point w_chamfer Avg F1@0.05
Point Cloud 1000 1.0 86.111
Point Cloud 5000 1.0 91.826
Point Cloud 10000 1.0 92.438

As we can observe, when there is more points, the performance (f1 score) goes higher. This is because sampling more points gives the model a finer supervision signal so that it could learn more detailed 3D structure from the ground truth in the training.

2.6. Interpret your model (15 points)


I interpret the model for the point cloud representation by visualizing weights of the 64 first layer convolutional filters. The visualization is shown below:

As we can observe, some filters look like edge filters which capture the lines of the input images in different orientations. Meanwhile, some filters look like blob detectors are encoding the blobs and corners feature in the input images.

3. (Extra Credit) Exploring some recent architectures.


3.1 Implicit network (10 points)


Check the implementation and train/eval scripts in the code.

I first use a one linear layer to embed the 3D coordinates, then concatenate it with the image feature. For each 3D coordinate, the concatenated feature passes through an MLP with an output of one scalar repersentating the possiblilty that if this location is occupied. Similarly, I used the cross-entropy loss to supervise the prediction for each location.

The quantitative performance of the implicit model: Avg F1@0.05: 66.414. Tough it has a similar performance as the regular vox model, the trained model could support variant resolutions in the test time and provide more smooth voxels.

Below are some qualitative resutls of the implicit mode:

input RGB
predicted 3D voxel
GT 3D voxel
input RGB
predicted 3D voxel
GT 3D voxel
input RGB
predicted 3D voxel
GT 3D voxel

3.2 Parametric network (10 points)