Goals: In this assignment, you will explore the types of loss and decoder functions for regressing to voxels, point clouds, and mesh representation from single view RGB input.
Note:
python fit_data.py --type 'vox'
OR
python main.py -q 1.1
Visualization
Optimized Voxel | Ground Truth |
---|---|
![]() |
![]() |
python fit_data.py --type 'point'
OR
python main.py -q 1.2
Visualization
Optimized Point Cloud | Ground Truth |
---|---|
![]() |
![]() |
python fit_data.py --type 'mesh'
OR
python main.py -q 1.3
Visualization
Optimized Mesh | Ground Truth |
---|---|
![]() |
![]() |
# For training
python train_model.py --type 'vox' --max_iter 10001 --save_freq 2000
# For evaluation
python eval_model.py --type 'vox' --load_checkpoint --load_step 10000 --vis_freq 20
OR
python main.py -q 2.1
Visualizing 3 examples
Ground Truth Image | Ground Truth Voxel | Predicted Voxel |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
# For training
python train_model.py --type 'point' --max_iter 10001 --save_freq 2000
# For evaluation
python eval_model.py --type 'point' --load_checkpoint --load_step 10000 --vis_freq 20
OR
python main.py -q 2.2
Visualizing 3 examples
Ground Truth Image | Ground Truth Voxel | Predicted Voxel |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
# For training
python train_model.py --type 'mesh' --max_iter 10001 --save_freq 2000
# For evaluation
python eval_model.py --type 'mesh' --load_checkpoint --load_step 10000 --vis_freq 20
OR
python main.py -q 2.3
Visualizing 3 examples
Ground Truth Image | Ground Truth Voxel | Predicted Voxel |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Avg F1@0.05 Vox | Avg F1@0.05 Point | Avg F1@0.05 Mesh |
---|---|---|
74.439 | 90.849 | 87.206 |
I tried playing around with the different tunable hyperparameters. I observed the following
n_points
- In case of meshes, the loss is given by
loss = args.w_chamfer * loss_reg + args.w_smooth * loss_smooth
The chamfer loss is being calculated as the sum of the distances and so, as n_points
parameter is increased the number of points increases. With the chamfer loss being proportional to the number of points, it contributed more to the loss than the smoothness loss. Thus, the model, in this case, learns to lower the chamfer loss and so, the resulting meshes end up being spiky as the smoothness aspects are not touched.
w_smooth
- The obvious next step was to increase the smoothness weight. As the smoothness was increased (to great extents), the focus of the model shifted from accurately representing the chair to ensuring smoothness. As the value was increased, the resulting chairs were smoother but hardly showed any variations (all chairs looked the same as below).
w_smooth 100 |
---|
![]() |
And when the weight was pushed to a very high value, the model insisted on keeping everything planar and so, there were hardly any deformations in the sphere it began with.
w_smooth 700 |
---|
![]() |
ico_sphere level
- The initial experiments of my mesh model always resulted in chair meshes with spiky legs. I believed this to be due the limited vertices and connectivity in the sphere. So, by increasing the level, I was able to increase the number of vertices and faces. I had to also increase the model complexity to handle the higher number of values to be predicted. The resulting images had much more rectangular structure in the legs.
level 4 | level 6 |
---|---|
![]() |
![]() |
!python interpret_model.py --load_step 10000 --index1 100 --index2 340
For this question, all my experiments and observations are based on the point cloud encoder-decoder model.
What has the decoder learned
- One of the first thoughts that came to my mind when I saw this question was to actually understand what kind of information the decoder contains. In order to view this, I just ran the trained decoder on an encoded feature vector containing zeros. The output was the following
As can be seen, the decoder contains the basic structure of a chair.
Latent Space Interpolation
- I tried to combine 2 encoded feature vectors (from 2 images at random) at different weights. The idea was that the combination of 2 encoded features representing the same object (different instances) would result in a new encoding of the object. And that the decoder would be able to understand and correctly predict it.
encoded2 | 0.25 * encoded1 + 0.75 encoded2 | 0.5 * encoded1 + 0.5 encoded2 | 0.75 * encoded1 + 0.25 encoded2 | encoded1 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
From the above outputs, it is clear that the encoder captures information about different aspects of chair such as height, width, concavity, length of legs etc. Combination of encoded vectors thus results in a change in a new object with a modification in these properties.
python train_implicit.py --save_freq 2000 --max_iter 10001
OR
python main.py -q 3.1
Visualizing 3 examples
Ground Truth Image | Ground Truth Voxel | Predicted Voxel using Implicit Decoder |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Implement a parametric function that takes in as input sampled 2D points and outputs their respective 3D point. Some papers for inspiration [1,2]