The point cloud prediction gives the best F1 score, the meash ranks the second and voxel has the worst F1 score.
The point cloud has the best performance because the learning objective (the loss function) is directly related to the evaluation metric
which evaluating the distance between the predicted and ground truth points.
The mesh has the second performance because in the training, apart from the chamfer loss, the predicted mesh also needs to be smooth.
So the training objective is less related to the evaluation metric than the point cloud. Meanwhile, since the number of the vertices
is limited, the triangle faces of the mesh is relatively coarse comparing to the point cloud and the points on them are not optimized
to be close to ground truth points, so the mesh will have a worse f1 score.
The training of voxel prediction focuses more on the occupancy of the 3D grid instead of fine points. Since the 3D voxel grid is pretty
coarse, so it will larger distance to the ground truth than the point cloud and mesh. So it has the lowest f1 score.
2.5. Analyse effects of hyperparms variations (10 points)
I Analyze the effect of n_point for the point cloud representation. The comparisions are shown in the table below:
Type
n_point
w_chamfer
Avg F1@0.05
Point Cloud
1000
1.0
86.111
Point Cloud
5000
1.0
91.826
Point Cloud
10000
1.0
92.438
As we can observe, when there is more points, the performance (f1 score) goes higher. This is because sampling more points gives the model
a finer supervision signal so that it could learn more detailed 3D structure from the ground truth in the training.
2.6. Interpret your model (15 points)
I interpret the model for the point cloud representation by visualizing weights of the 64 first layer convolutional filters. The
visualization is shown below:
As we can observe, some filters look like edge filters which capture the lines of the input images in different orientations. Meanwhile,
some filters look like blob detectors are encoding the blobs and corners feature in the input images.
3. (Extra Credit) Exploring some recent architectures.
3.1 Implicit network (10 points)
Check the implementation and train/eval scripts in the code.
I first use a one linear layer to embed the 3D coordinates, then concatenate it with the image feature. For each 3D coordinate, the
concatenated feature passes through an MLP with an output of one scalar repersentating the possiblilty that if this location is occupied.
Similarly, I used the cross-entropy loss to supervise the prediction for each location.
The quantitative performance of the implicit model: Avg F1@0.05: 66.414. Tough it has a similar performance as the regular vox model, the trained
model could support variant resolutions in the test time and provide more smooth voxels.
Below are some qualitative resutls of the implicit mode: