Assignment 2

Paritosh Mittal (paritosm)

1.1 Fitting a Voxel Grid


Ground Truth Target	Fit Voxel Grid

1.2 Fitting Point Cloud


Ground Truth Target	Fit Point Cloud

1.3 Fitting Mesh


Ground Truth Target	Fit Mesh

2.1 Image to Voxel Grid




Single View Image	GT	Reconstructed Voxel

2.2 Image to Point Cloud




Single View Image	GT Point Cloud	Reconstructed Point Cloud

2.3 Image to Mesh




Single View Image	GT Mesh	Reconstructed Mesh

2.4 Quantitative comparisions

3D Representation	Voxels 32**3	Point Cloud	Mesh
Avg F1@0.05 Score	83.4	96.3	96.0

The above numbers are computed using 5K points. Intuitively, the above numbers make sense.

The numbers are low for Voxels, this is mainly because Voxel prediction is an expensive task and most generation do not capture shape features very well. For example, multiple reconstructions had missing legs (this structures) and design of chair backs were also not completely captured. Voxels prediction networks need to predict empty and occupied (inside object) regions as well.

For Meshes, the number is high as mesh deformations are dense and hence can represent shapes faithfully. However still, the deformations cannot incorporate holes and are observed to be pointy. Since meshes deal with relatively less input dimensions and model volumes better, thier F1 score is higher.

For Point Clouds, the F1 score at 0.05 threshold is highest (with less margin bwtween Meshes). This makes sense as point clouds allow signifantly more freedom to represent objects as the input space is sparse. Example, we need ~32K outputs to represent voxels while only 5K points for point clouds. It is observed that point clouds can capture global shape features better (with atleast some points near thin structures) thereby improving the overall score.

2.5 Analyse effects of hyperparms variations

In these experiments, I used the following hyper-parameters and design choice:

Learning Rate	batch size	Max iters	Scheduler step size	n_points	w_smooth	Optimizer (design choice)	Scheduler (design choice)
0.0008	32	7200	200 iters	5000	0.2	AdamW	StepLR

While most of these choices were a result of numerous debugging and performance based optimizations (Grad Student Descent), in this section I will explore the impact of 'n_points' or the number of points in point cloud for Single Image to Point Cloud. I consider [50, 500, 5000, 1000] as possible values for 'n_points'.


Training loss with 50 points	Training loss with 500 points	Training loss with 5000 points	Training loss with 1000 points

The following are the curves of training loss. It is interesting that while all values converge well, the magnitude of loss is approximately proportional to the number of points. This is because my impletentation of Chamfer distance takes sum over all points. Next I compute the F1 score at 0.05 threshold.

Number of Points	Avg F1@0.05
50	41.91
500	89.93
5000	96.29
10000	97.29

This clearly indicates that F1 score is better with more number of points. Note that same number of points were used for training and evaluation. Since more points enable better capturing of local shape features. This conflicts with the loss values as expected. However, it is also important to not that training is slow for large(r) number of points. Because of minor improvement from 5k to 10k number of points. I picked 5k points for reporting my results.

2.6 Interpret your model

Single view 2D to 3D reconstruction is an ill-posed problem as more information (especially local features) needs to be created than what a 2D image might offer. Hence to better interpret what the model does, I qualitatively visualize the performance of model based on different views of the same object. I modified the 'r2n2_custom.py' file to return different views and developed a 'qual_eval.py' script to perform this task. Use flag '--eval_views'.

Informative View	Uninformative View	Informative View	Uninformative View	Informative View	Uninformative View


Details about backrest	Most Common back given no info	Thin arm-rest	Possibly high armrest	Visible armrest captured	Invisible armrest

For Single image to Voxel we visualize three objects with less and more obstruction. This shows that using 32x32x32 voxels result in approximately correct global shape and misses on most local information.

Informative View	Uninformative View	Informative View	Uninformative View	Informative View	Uninformative View


Slanted back	Flat back	Arm-rest	Missing armrest	Curved leg based on info	Four leg based on prior+image

Single image based Mesh prediction.

These visualization signify the high impact of data driven prior learnt by neural model. Hence the performance for this task is limited.

Next, I wish to measure the drift between predicted and true 3D representations. Hence, I overlay the predicted voxel/ point cloud on groud truth representation. The script is included in 'qual_eval.py' and can be run using '--eval_overlap'. Note that purple denote predicted and yellow denote ground truth.

Height Offset	Height Offset	No Height Offset	Slant Offset	Leg Slant Offset	Back Leg Offset	Leg Offset

This visualization highlights that while most shapes align well with GT representations. In certain scenarios, it is possible for a shape to have slight offset. This visualization also highlights the importance of multi-view consistency.

I also observe that voxels often miss thin legs. This 'case of missing legs' indicates that it is likely that probability of these voxels being occupied is low (as features were this (e.g. legs)). Hence, if we increase the occupancy threshold (which will make the shape thicker) then thin(er) features might appear without model re-training.

3.1 Implicit Network (Extra Credit)

The benefit of implicit decoder is that we can generate shapes of any arbitrary resolution without re-training. Hence here I visualize the results on 8**3, 16**3, 32**3, 64**3. Note, the model is trained only on 32**3 resolution GT. I could not go above 64**3 due to hardware limitations.

During inference, I arrange (voxel_dimension x voxel_dimension x voxel_dimension) 3D points (from [-1,1] for each axis) and compute the occupancy probability for each 3D location. For visualization, I re-imagine the 3D probabilites into occupancy grid and use marching cubes.

8**3	16**3	32**3	64**3

The shapes are observed to be relatively smooth and better aligned to information from single view 2D image.

3.2 Parametric Network (Extra Credit)

Parametric Neural Networks project a 2D point into 3D. For this implementation we use a combination of 5 decoders. Such that, total_points // 5 number of points are projected by each decoder[i] into a 3D surface.

The initial visualization looks like:

Key advantage of Parametric Network is that we can generate arbitrary number of 2D points and predict location of correponding 3D points. This enables the model to generate point clouds of arbitrary resolution without any re-training. In this experiment, I visualize the generated point clouds with 50, 500, 5000 and 10000 points. Note that Parametric Neural Model was trained using 5000 randomly sampled points.

50	500	5000	10000

Using Parametric Network enables dense predictions even when model is trained for sparse(r) points.