16-889 Assignment 1: Rendering Basics with PyTorch3D

-1. Number of Late Days

drawing

0. Setup

1. Exploring loss functions

1.1. Fitting a voxel grid

Ground TruthLearned

1.2. Fitting a point cloud

Ground TruthLearned

1.3. Fitting a mesh

Ground TruthLearned

2. Reconstructing 3D from single view

2.1. Image to voxel grid

image - ground truth mesh - - predicted voxel grid (isovalue = 0.0)

Step 0

Step 256

Step 384

2.2. Image to point cloud

image - ground truth mesh - ground truth point cloud - predicted point cloud

Step 0

Step 256

Step 384

2.3. Image to mesh

image - ground truth mesh - - predicted mesh

Step 0

Step 256

Step 384

2.4. Quantitative comparisions

F1@0.05

VoxelMeshPoint
79.49788.39594.422

Analysis

Comparing voxel grid prediction and the other two predictions, its F1@0.05 score is the lowest. The reason could be due to the O(n^3) complexity of both computing and storing a voxel grid. The discretization granulity of voxel grid is thus limited. The default voxel size is 32x32x32 and most of the voxel value calculation doesn't provide meaningful results, while for point and mesh prediction, every decoded value plays a significant role in the final structure, leading to a high sample efficiency. One interesting phenomenon in voxel grid decoder is that comparing the predicted levelset mesh doesn't change significantly in appearance from 14000 to 18000 training iterations (corresponding to ~280 and ~360 epoches), and the training loss almost plateaus, but F1@0.05 score increases from ~65 to ~80. So if the voxel grid network is further trained, the F1@0.05 might further increase. Overall, if computation power if infinite and we could arbitrarily discretize the 3D space, voxel grid decoder should have similar potential as the point cloud decoder, as it has no constraint on topology or geometry.

Compared with the Mesh predictions, the Point Cloud predictions are also learned mainly from chamfer loss, but have much higher F1 score. The fundamental problem of the Mesh predictions is that simply predicting offsets to a sphere mesh is harder to fit complex topologies with holes. What's worse is that many chairs have sharp components or connections, e.g. those legs and pillars on the back. The smooth loss term acts as an opponent against fitting them. On the contrary, point clouds are amorphous, and easier to adapt to different chair topologies. In face, my point cloud decoder is only 1 fully connected layer, mapping 512 latent variables to 5000 3D points, while to achieve a F1@0.05 close to 90, I had to use 7 fully connected layers. These trials will be further discussed in the next session.

2.5. Analyse effects of hyperparms variations

Here, the hyperparms to be discussed are the ico_sphere level and decoder layer number for the mesh prediction. Ico_sphere level can be 4 or 5, and decoder layer number can be 1 or 7. If decoder only has 1 layer, it means the input and output of that layer is 512 and 3*ico_sphere_vertex_num, while for the 7 layer decoder, the layer dimensions are 512, 1024, 1024, 1024, 1024, 1024, 1024, and 3*ico_sphere_vertex_num in order.

The 4 combinations' F1@0.05 scores are listed as follows. In all cases, w_smooth was 0.1.

 ico_sphere level = 4ico_sphere level = 5
1 layer decoder68.70573.382
7 layer decoder82.86378.943 (88.395 after fine tuning)

By increasing the decoder layer number, F1@0.05 significantly increased in both ico_sphere level choices. This trend probably because the decoder has to learn both the offsets and the sphere geometry. The output offsets are applied to their corresponding vertices, whose connections are predefined in the mesh. The network should adpapt to that geometry, but netrons in fully connected layers are implicitly assumed to have similar, if not the same, properties. This coincides with the observation that point predictions requires less decoder layers and training iterations. Therefore, the results might be better if a graph convolutional layer is used in the decoder.

The role of the other factor, the ico_sphere level, is relatively less clear than the one of the decoder layer number. Theoretically, a finer mesh should make predictions better as the resolution is higher. However, for a network with a 7 layer decoder, incrementing ico_sphere level from 4 to 5 lowers the F1@0.05 score. It's observed that the loss curve of the network with 5 ico_spehre levels and 7 fully connected layers stagnates quite early. My finally setting to achieve 88.395 F1@0.05 is inspired by that observation. Both the learning rate and w_smooth was reduced by 10 times. A lower w_smooth should enable the network to be easier to achieve a smaller chamfer loss, but could also make the learning less stable, so a lower learning rate was used in addition.

2.6. Interpret your model

One interesting visualization is the relationship between the latent feature and the output point cloud. So I collected all latent features processed by my point cloud model's encoder, and used PCA to compress them into a 2D space. The two axes of the following figure represent the first and the second principle component, and the green dots are the compressed latent features.

I was expecting that different shapes of chairs could appear to be inside of different clusters. But it was quite surprising that there was only one major cluster, and actually if three principle components were considered, still, there was no two or more significant clusters (Otherwise, I could run K-Means to separate different clusters, and visualize the corresponding point clouds at their cluster centers.). Therefore, 20 2D points shown in red in the above figure were uniformly sampled from two line segments, roughly covering the four extremes of the cluster, and were further decoded by the second part of the point cloud model. Note that red dots are sampled from left to right, and blue ones are from bottom to top

red-0red-1red-2red-3red-4
red-5red-6red-7red-8red-9

From the PCA figure we can see that red dots 2 through 6 are inside of the cluster. Point clouds from red-2 and red-3 look like the chair similar to the 384-th image, shown in session 2.1~2.3. Red-5 and red-6 are at the center of the cluster, and they correspond to the most canonical chair shape. From red-4 to red-9, the point cloud shapes stay stable, while the chair size increases, with some minor variances in the apsect ratio. The two trends indicate that the 512D latent variables have both shape and size information encoded.

blue-0blue-1blue-2blue-3blue-4
blue-5blue-6blue-7blue-8blue-9

The results from blue dots also reveal the gradual shape morphing due to the latent variable shifting. From blue-0 to blue-4 and from blue-5 to blue-9, the shape of the chair move towards and away from the canonical one. Specifically, from blue-0 to blue-4, the height decreases, and the height ratio of the chair legs are decreasing. From blue-5 to blue-9, this ratio keeps decreasing till zero.