image

1.1. Fitting a voxel grid (5 points)

image image

1.2. Fitting a point cloud (10 points)

image image

1.3. Fitting a mesh (5 points)

image image

2.1. Image to voxel grid (15 points)

Avg F1@0.05 = 73.386

image image image

image image image

image image image

image image image

image image image

image image image

image image image

2.2. Image to point cloud (15 points)

Avg F1@0.05 = 92.637

image image image

image image image

image image image

image image image

image image image

image image image

image image image

2.3 Image to mesh (15 points)

Avg F1@0.05 = 85.821

image image image

image image image

image image image

image image image

image image image

image image image

image image image

2.4 Quantitative comparisons (10 points)

Mesh: Avg F1@0.05 = 85.823

Voxel grid: Avg F1@0.05 = 73.386

Pointcloud: Avg F1@0.05 = 92.637

Intuitively, having the decoder predict pointclouds results in the highest (avg) F1 score as it is the most flexible representation out of three, in the sense that each predicted opint can represent arbitrary points in 3D space. This means that, for example, point clouds can readily represent surfaces with holes and slim structures to arbitrary resolution (as long as you predict enough output points). On the other hand, while the mesh decoder can also predict arbitrary vertex offsets relative to a template mesh initialized from a unit icosphere, it is unable to change the template mesh's connectivity. In other words, since the mesh decoder can only output watertight meshes, it cannot model chairs that have discontinuities such as holes in the chair surface (c.f. Images 3, 4, and 7 in Q2.3). Finally, the voxel grid exhibits the lowest F1 score most likely because instead of being able to predict features (3d location for point clouds and meshes) arbitrary points it can only predict features (binary occupancy) for discretized cells, preventing the voxel grid from modeling fine geometric details of the chairs.

2.5 Analyse effects of hyperparams variations (10 points)

a) Voxel grid: decoder_layer_type = (deconv, deconv512, fc) Avg F1@0.05 = (73.386, 77.273, 79.949)

For decoders that predict voxel grids, we examine the effect of variations in the decoder architecture. We experiment with the following architectures:

  1. "deconv": a decoder comprised of a stack of deconvolutional layers with a kernel size of 4 and a stride of 2, thereby increasing the output volume's resolution by a factor 2 with every successive deconvolution; the number of output channels for this architecture are (64, 64, 64, 64, 8, 1)

  2. "deconv512": a higher-capcity decoder comprised of the same number of deconvolutional layers with a kernel size 4 and a stride of 2, with a sequence of larger number of channels (512, 512, 512, 512, 32, 1)

  3. "fc": a fully-connected network with 2 hidden layers with output dimensions of (4096, 32^3) such that it has a similar number of parameters as "deconv512" (136 million parameters vs 131 million parameters).

Despite the accepted wisdom that problem-specific inductive biases generally improve performance for the same number of parameters, the 'fc' variant of the decoder outperforms the 'deconv512' variant in terms of the average F1@0.05 score. This suggests that the representation power of the decoder architecture has a greater effect on the performance (than does the actual inductive bias), which is further corroborated by the fact that a higher-capcity version of the deconvolutional decoder ('deconv512' vs 'deconv') significantly outperforms our baseline variant of the decoder. The difference in representation power on the predicted voxel grid is evident in the following qualitative comparisons; the higher the decoder capacity, the better the model is able to predict the thin structure of the chair legs in a contiguous fashion.

image image image

image image image

image image image

b) Pointcloud: n_points = (2500, 5000, 10000, 20000) Avg F1@0.05 = (90.432, 92.637, 93.063, 94.234)

For decoders that predict pointclouds, we examine the effect of variations in the number of points in the pointcloud. We accordingly adjust the weight sizes in the final fully-connected layer of the decoder to change the number of points predicted by the decoder. It can be seen that every time we double the number of predicted points in the point cloud, the average F1@0.05 metric consistently increases, possibly suggesting that the increasing the number of predicted points allows one to model the finer details of the ground truth object. However, the following visualizations suggest that this quantitative comparison is somewhat deceptive. When n_points = 2500, the predicted pointcloud is able to somewhat predict the slim structure of the legs of the chair. As n_points increases, however, the structure of the legs disappears and the bottom half of the chair appears to change into a contiguous block. On the other hand, the decoder appears to predict a denser set of points along the intersecting line between the seat plane and the back support plane, which likely offsets the drop in F1@0.05 that would've been induced by the increasing inability of denser predicted pointclouds to model finer structures.

image image image

image image image

image image image

image image image

Mesh: num_GCN layers = (2, 3, 4, 5)

Avg F1@0.05 = (85.593, 85.821, 85.117, 85.091)

Avg self similarity = (0.9870, 0.9881, 0.9901, 0.9917)

For decoders that predict meshes, we examine the effect of changing the number of graph-convolutional layers in the decoder. We used graph-convolutional layers in the decoder so as to make use of the connectivity information in the prediction of the vertex offsets, which could help model better local geometry of the object. This could be one reason why the F1@0.05 score increases as we increase the number of GCN layers from 2 to 3. However, the performance peaks at 3 layers and with every successive layer, the F1@0.05 decreases, even below that of the decoder with only 2 GCN layers.

This is most likely due to the fact that repeated application of the message passing step of graph convolutional layers induces a phenomenon known as "over-smoothing" [1] whereby the output features of vertices across the entire graph become virtually identical. We verify the existence of this "over-smoothing" pheonomenon in our experiments as well, by computing the cosine similarity scores between features of neighbouring vertices of the final GCN layer averaged over the number of vertex pairs [1]; as the number of GCN layers increases, so does the average self similarity scores. This phenomenon can be deterimental to fine-grained prediction tasks such as mesh prediction as the final linear layer of the decoder, which succeeds the final GCN layer, has to predict the target vertex offsets from a set of very similar input features.

[1] Chen et. al, Measuring and Relieving the Over-smoothing Problem for Graph Neural Networks from the Topological View, AAAI 2020.

2.6 Interpret your model (15 points)

One interesting phenomenon from the visualizations from Q2.1 ~ Q2.3 is the fact that the voxel grids, pointclouds, and meshes predicted for the 4th and 7th input images are extremely similar to each other, even though the actual chairs depicted in these two input images are different. This begs the question whether the network has simply memorized a select handful of chairs that each represent a distinct category of chairs (armchair vs wooden chair with 4 legs) rather than being able to model and predict the minute intra-category differences between the diverse range of chairs of the same category. To verify this, we plot the t-SNE embeddings of the trained encoder outputs and compare them to the t-SNE embeddings of the output of the original ResNet18 encoder (before training on the R2N2 dataset) to check the extent of possible mode collapse. It can be seen that there is not much evidence of mode collapse as a result on training on the R2N2 dataset as the distribution of embedded points after training (right figure), each of which corresponds to a single sample from the testset, remains just as diverse as the distribution of embedded points before training (left figure):

image