Assignment 2

Paritosh Mittal (paritosm)

1.1 Fitting a Voxel Grid

1.1 tgt 1.1 src
Ground Truth Target Fit Voxel Grid

1.2 Fitting Point Cloud

1.2 tgt 1.2 src
Ground Truth Target Fit Point Cloud

1.3 Fitting Mesh

1.3 tgt 1.3 src
Ground Truth Target Fit Mesh

2.1 Image to Voxel Grid

1.3 tgt 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src
Single View Image GT Reconstructed Voxel

2.2 Image to Point Cloud

1.3 tgt 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src
Single View Image GT Point Cloud Reconstructed Point Cloud

2.3 Image to Mesh

1.3 tgt 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src
Single View Image GT Mesh Reconstructed Mesh

2.4 Quantitative comparisions

3D Representation Voxels 32**3 Point Cloud Mesh
Avg F1@0.05 Score 83.4 96.3 96.0

The above numbers are computed using 5K points. Intuitively, the above numbers make sense.

The numbers are low for Voxels, this is mainly because Voxel prediction is an expensive task and most generation do not capture shape features very well. For example, multiple reconstructions had missing legs (this structures) and design of chair backs were also not completely captured. Voxels prediction networks need to predict empty and occupied (inside object) regions as well.

For Meshes, the number is high as mesh deformations are dense and hence can represent shapes faithfully. However still, the deformations cannot incorporate holes and are observed to be pointy. Since meshes deal with relatively less input dimensions and model volumes better, thier F1 score is higher.

For Point Clouds, the F1 score at 0.05 threshold is highest (with less margin bwtween Meshes). This makes sense as point clouds allow signifantly more freedom to represent objects as the input space is sparse. Example, we need ~32K outputs to represent voxels while only 5K points for point clouds. It is observed that point clouds can capture global shape features better (with atleast some points near thin structures) thereby improving the overall score.

2.5 Analyse effects of hyperparms variations

In these experiments, I used the following hyper-parameters and design choice:

Learning Rate batch size Max iters Scheduler step size n_points w_smooth Optimizer (design choice) Scheduler (design choice)
0.0008 32 7200 200 iters 5000 0.2 AdamW StepLR

While most of these choices were a result of numerous debugging and performance based optimizations (Grad Student Descent), in this section I will explore the impact of 'n_points' or the number of points in point cloud for Single Image to Point Cloud. I consider [50, 500, 5000, 1000] as possible values for 'n_points'.

1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
Training loss with 50 points Training loss with 500 points Training loss with 5000 points Training loss with 1000 points

The following are the curves of training loss. It is interesting that while all values converge well, the magnitude of loss is approximately proportional to the number of points. This is because my impletentation of Chamfer distance takes sum over all points. Next I compute the F1 score at 0.05 threshold.

Number of Points Avg F1@0.05
50 41.91
500 89.93
5000 96.29
10000 97.29

This clearly indicates that F1 score is better with more number of points. Note that same number of points were used for training and evaluation. Since more points enable better capturing of local shape features. This conflicts with the loss values as expected. However, it is also important to not that training is slow for large(r) number of points. Because of minor improvement from 5k to 10k number of points. I picked 5k points for reporting my results.

2.6 Interpret your model

Single view 2D to 3D reconstruction is an ill-posed problem as more information (especially local features) needs to be created than what a 2D image might offer. Hence to better interpret what the model does, I qualitatively visualize the performance of model based on different views of the same object. I modified the 'r2n2_custom.py' file to return different views and developed a 'qual_eval.py' script to perform this task. Use flag '--eval_views'.

Informative View Uninformative View Informative View Uninformative View Informative View Uninformative View
1.3 tgt 1.3 src 1.3 src 1.3 src 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src 1.3 src 1.3 src 1.3 src
Details about backrest Most Common back given no info Thin arm-rest Possibly high armrest Visible armrest captured Invisible armrest

For Single image to Voxel we visualize three objects with less and more obstruction. This shows that using 32x32x32 voxels result in approximately correct global shape and misses on most local information.

Informative View Uninformative View Informative View Uninformative View Informative View Uninformative View
1.3 tgt 1.3 src 1.3 src 1.3 src 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src 1.3 src 1.3 src 1.3 src
Slanted back Flat back Arm-rest Missing armrest Curved leg based on info Four leg based on prior+image

Single image based Mesh prediction.

These visualization signify the high impact of data driven prior learnt by neural model. Hence the performance for this task is limited.

Next, I wish to measure the drift between predicted and true 3D representations. Hence, I overlay the predicted voxel/ point cloud on groud truth representation. The script is included in 'qual_eval.py' and can be run using '--eval_overlap'. Note that purple denote predicted and yellow denote ground truth.

Height Offset Height Offset No Height Offset Slant Offset Leg Slant Offset Back Leg Offset Leg Offset
1.3 tgt 1.3 src 1.3 src 1.3 src 1.3 src 1.3 src 1.3 src
1.3 tgt 1.3 src 1.3 src 1.3 src 1.3 src 1.3 src 1.3 src

This visualization highlights that while most shapes align well with GT representations. In certain scenarios, it is possible for a shape to have slight offset. This visualization also highlights the importance of multi-view consistency.

I also observe that voxels often miss thin legs. This 'case of missing legs' indicates that it is likely that probability of these voxels being occupied is low (as features were this (e.g. legs)). Hence, if we increase the occupancy threshold (which will make the shape thicker) then thin(er) features might appear without model re-training.

3.1 Implicit Network (Extra Credit)

The benefit of implicit decoder is that we can generate shapes of any arbitrary resolution without re-training. Hence here I visualize the results on 8**3, 16**3, 32**3, 64**3. Note, the model is trained only on 32**3 resolution GT. I could not go above 64**3 due to hardware limitations.

During inference, I arrange (voxel_dimension x voxel_dimension x voxel_dimension) 3D points (from [-1,1] for each axis) and compute the occupancy probability for each 3D location. For visualization, I re-imagine the 3D probabilites into occupancy grid and use marching cubes.

8**3 16**3 32**3 64**3
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt

The shapes are observed to be relatively smooth and better aligned to information from single view 2D image.

3.2 Parametric Network (Extra Credit)

Parametric Neural Networks project a 2D point into 3D. For this implementation we use a combination of 5 decoders. Such that, total_points // 5 number of points are projected by each decoder[i] into a 3D surface.

The initial visualization looks like:

1.3 tgt

Key advantage of Parametric Network is that we can generate arbitrary number of 2D points and predict location of correponding 3D points. This enables the model to generate point clouds of arbitrary resolution without any re-training. In this experiment, I visualize the generated point clouds with 50, 500, 5000 and 10000 points. Note that Parametric Neural Model was trained using 5000 randomly sampled points.

50 500 5000 10000
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt
1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt 1.3 tgt

Using Parametric Network enables dense predictions even when model is trained for sparse(r) points.