16-889: Learning for 3D Vision

Assignment 2

Adnan Ahmad (adnana)

 

 A picture containing green, toilet, light

Description automatically generated

 

 

 

 

1.1 Fitting a Voxel Grid

Histogram

Description automatically generated with medium confidence

 

 

1.2 Fitting a Point Cloud

 

 

1.3 Fitting a Mesh

 

 

A picture containing logo

Description automatically generated

 

 

2.1 Image to Voxel Grid

 

Input RGB                                                            Predicted                                                            GT

Icon

Description automatically generatedLogo

Description automatically generatedIcon

Description automatically generated

 

 

A picture containing gear

Description automatically generated

 

 

A picture containing dark

Description automatically generatedLogo

Description automatically generated

 

 

 

 

 

2.2 Image to Point Cloud

 

 

                Input RGB                                                            Predicted                                                             GT

A picture containing outdoor object

Description automatically generated

 

 

A white chair with a black background

Description automatically generated with medium confidence

 

 

2.3  Image to Mesh

 

 

                                Input RGB                                                            Predicted                                                                            GT

A picture containing text

Description automatically generatedA jet flying in the sky

Description automatically generated with medium confidenceA picture containing logo

Description automatically generated

 

 

A picture containing outdoor, dark, arch

Description automatically generatedA picture containing aircraft, airplane

Description automatically generated

 

 

 

 

A picture containing seat

Description automatically generatedA picture containing aircraft, silhouette

Description automatically generated

 

 

2.4  Quantitative Comparison

 

Voxel (Avg F1@0.05)

Points (Avg F1@0.05)

Mesh (Avg F1@0.05)

72.671

85.330

87.228

 

The results indicate that Mesh representations learn the best transformations followed by Point Cloud. Voxel representations perform the worst. This makes intuitive sense for multiple reasons. Voxels discretize the 3d space into a grid and hence are reliant on the specified resolution. For our use case the resolution is given as 32x32x32 which seems incapable of mapping the finer details of chairs such as holes and intricate design patterns. Meshes and point clouds are better suited to represent these specific patterns and hence are found to perform better.

2.5 Hyperparameter Variations

 

Points 500  = Avg F1@0.05: 77.538

Points 5000 =  Avg F1@0.05: 85.330

Point 15000 =  Avg F1@0.05: 89.202

 

I used the n_points hyperparameter and varied its value in the set (500,5000,15000). The results of this experiment are shown in the table and visualization below. I found that the model performs the worst when the points are set at the lowest and gradually improves as the number of points is increased. This makes sense because as we increase the points, we are also increasing the representative power of our 3d model. More points can map finer details in chair designs and model a wide range of patterns found across the test set. This can be seen in the visualizations below where the lower half of the chair is captured in much more detail when we use 15,000 points.

 

                               

                                GT Image                                                                             N = 500 points                                                                         N = 5000 points                                                                     N  = 15000 points

A picture containing outdoor object

Description automatically generated

 

 

 

 

2.6 Interpret Model – Analyzing Autoencoder latent vector space

 

In this section, I decided to analyze the intermediate latent vector representation between the encoder and decoder. Several papers have shown that intermediate vectors in auto-encoder style models capture semantic information about the input sample. Based on that, my intention was to check if this vector learns interesting properties about various chairs in the dataset. My hypothesis was that given a query chair’s latent vector, if I found it’s K nearest neighbors in this latent space, I would probably be able to obtain chairs that look similar to the source chair. This is validated in the results below where I show 4 chairs that were found closest to the query chair (arranged in descending order of similarity) in the intermediate latent space. This indicates that our autoencoder model is learning useful semantic information about chairs as it learns to predict their 3d structure.

 

                  Query Chair                                                                                              Closest neighbors in intermediate Autoencoder Latent Space

 

                        A picture containing furniture, seat, chair

Description automatically generated 

                                                              

 

 

A picture containing furniture, table, seat

Description automatically generated