![]() |
![]() |
Ground Truth Target | Fit Voxel Grid |
![]() |
![]() |
Ground Truth Target | Fit Point Cloud |
![]() |
![]() |
Ground Truth Target | Fit Mesh |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Single View Image | GT | Reconstructed Voxel |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Single View Image | GT Point Cloud | Reconstructed Point Cloud |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Single View Image | GT Mesh | Reconstructed Mesh |
3D Representation | Voxels 32**3 | Point Cloud | Mesh |
Avg F1@0.05 Score | 83.4 | 96.3 | 96.0 |
The above numbers are computed using 5K points. Intuitively, the above numbers make sense.
The numbers are low for Voxels, this is mainly because Voxel prediction is an expensive task and most generation do not capture shape features very well. For example, multiple reconstructions had missing legs (this structures) and design of chair backs were also not completely captured. Voxels prediction networks need to predict empty and occupied (inside object) regions as well.
For Meshes, the number is high as mesh deformations are dense and hence can represent shapes faithfully. However still, the deformations cannot incorporate holes and are observed to be pointy. Since meshes deal with relatively less input dimensions and model volumes better, thier F1 score is higher.
For Point Clouds, the F1 score at 0.05 threshold is highest (with less margin bwtween Meshes). This makes sense as point clouds allow signifantly more freedom to represent objects as the input space is sparse. Example, we need ~32K outputs to represent voxels while only 5K points for point clouds. It is observed that point clouds can capture global shape features better (with atleast some points near thin structures) thereby improving the overall score.
In these experiments, I used the following hyper-parameters and design choice:
Learning Rate | batch size | Max iters | Scheduler step size | n_points | w_smooth | Optimizer (design choice) | Scheduler (design choice) |
0.0008 | 32 | 7200 | 200 iters | 5000 | 0.2 | AdamW | StepLR |
While most of these choices were a result of numerous debugging and performance based optimizations (Grad Student Descent), in this section I will explore the impact of 'n_points' or the number of points in point cloud for Single Image to Point Cloud. I consider [50, 500, 5000, 1000] as possible values for 'n_points'.
![]() |
![]() |
![]() |
![]() |
Training loss with 50 points | Training loss with 500 points | Training loss with 5000 points | Training loss with 1000 points |
The following are the curves of training loss. It is interesting that while all values converge well, the magnitude of loss is approximately proportional to the number of points. This is because my impletentation of Chamfer distance takes sum over all points. Next I compute the F1 score at 0.05 threshold.
Number of Points | Avg F1@0.05 |
50 | 41.91 |
500 | 89.93 |
5000 | 96.29 |
10000 | 97.29 |
This clearly indicates that F1 score is better with more number of points. Note that same number of points were used for training and evaluation. Since more points enable better capturing of local shape features. This conflicts with the loss values as expected. However, it is also important to not that training is slow for large(r) number of points. Because of minor improvement from 5k to 10k number of points. I picked 5k points for reporting my results.
Single view 2D to 3D reconstruction is an ill-posed problem as more information (especially local features) needs to be created than what a 2D image might offer. Hence to better interpret what the model does, I qualitatively visualize the performance of model based on different views of the same object. I modified the 'r2n2_custom.py' file to return different views and developed a 'qual_eval.py' script to perform this task. Use flag '--eval_views'.
Informative View | Uninformative View | Informative View | Uninformative View | Informative View | Uninformative View |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Details about backrest | Most Common back given no info | Thin arm-rest | Possibly high armrest | Visible armrest captured | Invisible armrest |
For Single image to Voxel we visualize three objects with less and more obstruction. This shows that using 32x32x32 voxels result in approximately correct global shape and misses on most local information.
Informative View | Uninformative View | Informative View | Uninformative View | Informative View | Uninformative View |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Slanted back | Flat back | Arm-rest | Missing armrest | Curved leg based on info | Four leg based on prior+image |
Single image based Mesh prediction.
These visualization signify the high impact of data driven prior learnt by neural model. Hence the performance for this task is limited.
Next, I wish to measure the drift between predicted and true 3D representations. Hence, I overlay the predicted voxel/ point cloud on groud truth representation. The script is included in 'qual_eval.py' and can be run using '--eval_overlap'. Note that purple denote predicted and yellow denote ground truth.
Height Offset | Height Offset | No Height Offset | Slant Offset | Leg Slant Offset | Back Leg Offset | Leg Offset |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
This visualization highlights that while most shapes align well with GT representations. In certain scenarios, it is possible for a shape to have slight offset. This visualization also highlights the importance of multi-view consistency.
I also observe that voxels often miss thin legs. This 'case of missing legs' indicates that it is likely that probability of these voxels being occupied is low (as features were this (e.g. legs)). Hence, if we increase the occupancy threshold (which will make the shape thicker) then thin(er) features might appear without model re-training.
The benefit of implicit decoder is that we can generate shapes of any arbitrary resolution without re-training. Hence here I visualize the results on 8**3, 16**3, 32**3, 64**3. Note, the model is trained only on 32**3 resolution GT. I could not go above 64**3 due to hardware limitations.
During inference, I arrange (voxel_dimension x voxel_dimension x voxel_dimension) 3D points (from [-1,1] for each axis) and compute the occupancy probability for each 3D location. For visualization, I re-imagine the 3D probabilites into occupancy grid and use marching cubes.
8**3 | 16**3 | 32**3 | 64**3 | |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
The shapes are observed to be relatively smooth and better aligned to information from single view 2D image.
Parametric Neural Networks project a 2D point into 3D. For this implementation we use a combination of 5 decoders. Such that, total_points // 5 number of points are projected by each decoder[i] into a 3D surface.
The initial visualization looks like:
Key advantage of Parametric Network is that we can generate arbitrary number of 2D points and predict location of correponding 3D points. This enables the model to generate point clouds of arbitrary resolution without any re-training. In this experiment, I visualize the generated point clouds with 50, 500, 5000 and 10000 points. Note that Parametric Neural Model was trained using 5000 randomly sampled points.
50 | 500 | 5000 | 10000 | |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Using Parametric Network enables dense predictions even when model is trained for sparse(r) points.