Assignment 2 - Learning for 3D Vision [16889]
NAME: Hiresh Gupta
ANDREW ID: hireshg
1. Exploring Loss Functions
1.1. Fitting a voxel grid (5 points)
Visualize the optimized voxel grid along-side the ground truth voxel grid using the tools learnt in previous section.
python fit_data.py --type 'vox' --save_dir "./output" --save_predictions --save_visualizations
1.2. Fitting a point cloud (10 points)
Visualize the optimized point cloud along-side the ground truth point cloud using the tools learnt in previous section.
python fit_data.py --type 'point' --save_dir "./output" --save_predictions --save_visualizations
1.3. Fitting a mesh (5 points)
Visualize the optimized mesh along-side the ground truth mesh using the tools learnt in previous section.
python fit_data.py --type 'mesh' --save_dir "./output" --save_predictions --save_visualizations
2. Reconstructing 3D from single view
2.1. Image to voxel grid (15 points)
Training
# Train model:
python train_model.py --type 'vox' --id 'q2' --batch_size 64 --save_freq 500 --output_dir=./outputs --max_iter 10000 --vis_freq 250 --num_workers 4
Hyperparameters
lr=4e-4
batch_size=64
max_iters=10000
Evaluation
python eval_model.py --type 'vox' --id 'q2' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --vis_freq 100
Visualisations
2.2. Image to point cloud (15 points)
Training
python train_model.py --type 'point' --id 'q2' --batch_size 8 --save_freq 500 --output_dir=./outputs --max_iter 10000 --vis_freq 250 --num_workers 4
Hyperparameters
lr=4e-4
batch_size=8
max_iters=10000
Evaluation
python eval_model.py --type 'point' --id 'q2' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --vis_freq 100
Visualisations
2.3. Image to mesh (15 points)
Training
python train_model.py --type 'mesh' --id 'q2' --batch_size 8 --save_freq 500 --output_dir=./outputs --max_iter 10000 --vis_freq 250 --num_workers 4
Hyperparameters
lr=4e-4
batch_size=8
max_iters=10000
w_smooth=0.1
w_chamfer=1.0
Evaluation
python eval_model.py --type 'mesh' --id 'q2' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --vis_freq 100
Visualisations
2.4. Quantitative comparisions (10 points)
Quantitatively compare the F1 score of 3D reconstruction for meshes vs pointcloud vs voxelgrids. Provide an intutive explaination justifying the comparision.
The quantative comparison of the F1 Score @ 0.05 can be found in the following table:
Representation | Avg. F1 Score | Hyper-Parameters |
---|---|---|
Voxels | 90.406 | lr=4e-4, batch_size=8, max_iters=10000 |
Point Cloud | 94.938 | lr=4e-4, batch_size=8, max_iters=10000 |
Mesh | 90.006 | lr=4e-4, batch_size=8, max_iters=10000, w_smooth=0.1, w_chamfer=1.0 |
Explanation:
Based on the above table, we can see that the average F1 score for point clouds is the maximum (around 95%) while the other two representations (voxels and the mesh) converge around 90%. This is intuitive because unlike others Point cloud representation is free to take any pixels in the space (can expand/contract freely), while the voxel representation is limited in predicting its output in a fixed cube volume and is restrained by resolution. The mesh representations are somewhat limited by the initial topology of the mesh which might be a bit restrictive in capturing 3D objects with varying topologies with different surface/holes representation.
2.5. Analyse effects of hyperparms variations (10 points)
Analyse the results, by varying an hyperparameter of your choice. For
example n_points
or vox_size
or w_chamfer
or initial mesh(ico_sphere)
etc.
Solution: I measured the effect of model performance on varying the n_points parameter in the Point Cloud Decoder. Please find the corresponding training and evaluation commands below:
Training Commands (varying the n_points parameter)
# 500 points
python train_model.py --type 'point' --id 'q2_500' --batch_size 8 --output_dir=./outputs --max_iter 5000 --num_workers 4 --n_points 500
# 1000 points
python train_model.py --type 'point' --id 'q2_1000' --batch_size 8 --output_dir=./outputs --max_iter 5000 --num_workers 4 --n_points 1000
# 2500 points
python train_model.py --type 'point' --id 'q2_2500' --batch_size 8 --output_dir=./outputs --max_iter 5000 --num_workers 4 --n_points 2500
# 5000 points
python train_model.py --type 'point' --id 'q2_5000' --batch_size 8 --output_dir=./outputs --max_iter 5000 --num_workers 4 --n_points 5000
Evaluation (varying the n_points parameter)
# 500 points
python eval_model.py --type 'point' --id 'q2_500' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --n_points 500
# 1000 points
python eval_model.py --type 'point' --id 'q2_1000' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --n_points 1000
# 2500 points
python eval_model.py --type 'point' --id 'q2_2500' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --n_points 2500
# 5000 points
python eval_model.py --type 'point' --id 'q2_5000' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --n_points 5000
Quantitative Comparison
I observed the following results:
Number of Sampled Points | Avg. F1 Score | Hyper-Parameters |
---|---|---|
n_points=500 | 82.991 | lr=4e-4, batch_size=8, max_iters=5000 |
n_points=1000 | 88.563 | lr=4e-4, batch_size=8, max_iters=5000 |
n_points=2500 | 92.098 | lr=4e-4, batch_size=8, max_iters=5000 |
n_points=5000 | 94.377 | lr=4e-4, batch_size=8, max_iters=5000 |
Analysis
The network does train a bit faster with the lesser number of points. Based on the above table, we can see that there is a decline in the F1 score on decreasing the number of points from 5000 to 500. It is quite intuitive because adding more points help the point cloud capture more finer structure and hence increase the F1 score.
Visualizations
Sample point cloud predictions with n_points=500:
Sample point cloud predictions with n_points=1000:
Sample point cloud predictions with n_points=2500:
Sample point cloud predictions with n_points=5000:
2.6. Interpret your model (15 points)
Simply seeing final predictions and numerical evaluations is not always insightful. Can you create some visualizations that help highlight what your learned model does? Be creative and think of what visualizations would help you gain insights. There is no `right' answer - although reading some papers to get inspiration might give you ideas.
Solution: To understand the network is actually learning, I tried to run GradCAM and Guided Backpropagation to understand the image features that activate the gradients. You can run the following commands to visualize the gradient map on our test set.
Usage:
python gradcam.py --type vox --checkpoint_path <vox_trained_checkpoint_path> --max_iter 10 --visualizer gradcam++ --output_dir ./output/gradcam++_visualizations/vox
python gradcam.py --type point --checkpoint_path <point_trained_checkpoint_path> --max_iter 10 --visualizer gradcam++ --output_dir ./output/gradcam++_visualizations/point
python gradcam.py --type mesh --checkpoint_path <mesh_trained_checkpoint_path> --max_iter 10 --visualizer gradcam++ --output_dir ./output/gradcam++_visualizations/mesh
GradCAM++ Visualizations:
1. GradCAM++ on Voxel predictions:
2. GradCAM++ on Point predictions:
3. GradCAM++ on Mesh predictions:
My understanding on visualizing the gradient outputs:
It was quite interesting to see that almost all the networks are focussing on the chair edges to detect the corners and end points to be able to project the image onto the 3D space. Here's a deeper interpretation of each architecture:
-
Voxel: Since the Voxel Decoder is trying to predict a probability value over the whole 3D space, we do see somewhat high gradient values within the chair region. It is also able to pick up gradient signals in somewhat hard cases with bad lighting or occlusion.
-
Point: The point decoder is also very focussing on the chair edges. However, we do see some low gradients around some occluded parts like the top of the chair or the chair leg. This clearly explains why point clouds hallucinate over certain regions and are unable to capture the fine structure around certain regions.
-
Mesh: On visualizing the mesh gradients, we can see that like the other two it is also focussed on extracting the edges. However, if we look very closely, the gradients are not as good as the point or voxel decoder. In the above example we can clearly spot gradient values on the front part and less gradients on the occluded part which might be because the mesh is allocating more faces to the frontal part and is somewhat less focussed on the occluded part.
3. (Extra Credit) Exploring some recent architectures.
3.1 Implicit network (10 points)
Implement a implicit decoder that takes in as input 3D locations and outputs the occupancy value.
Architecture:
Inspired by the occupancy network, I implemented the following version of the implicit network. Although their paper suggested using Conditional Batch Norm (CBN) to condition over the image features, for the purpose of this assignment I have chosen to simply add the image and point encodings. Adding it might improve the performance further.
class OccupancyNetwork(nn.Module):
def __init__(self, args):
super(OccupancyNetwork, self).__init__()
self.device = "cuda"
vision_model = torchvision_models.__dict__[args.arch](pretrained=True)
self.encoder = torch.nn.Sequential(*(list(vision_model.children())[:-1]))
self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
self.image_branch = nn.Sequential(*[
nn.Linear(512, 256),
nn.PReLU(),
nn.Linear(256, 128),
nn.PReLU()
])
self.point_branch = nn.Sequential(*[
nn.Linear(3, 64),
nn.PReLU(),
nn.Linear(64, 128),
nn.PReLU(),
nn.Linear(128, 128),
nn.PReLU(),
])
self.decoder = nn.Sequential(*[
nn.Linear(128, 128),
nn.PReLU(),
nn.Linear(128, 128),
nn.PReLU(),
nn.Linear(128, 64),
nn.PReLU(),
nn.Linear(64, 1)
])
def forward(self, images, points):
B = images.shape[0]
images_normalize = self.normalize(images.permute(0, 3, 1, 2)) # (N, 3, 137, 137)
encoded_feat = self.encoder(images_normalize).squeeze(-1).squeeze(-1) # (N, 512)
image_feats = encoded_feat.view(B, 1, -1)
img_feats = self.image_branch(image_feats)
point_feats = self.point_branch(points)
combined_feats = img_feats + point_feats
final_probs = self.decoder(combined_feats)
return final_probs
Training & Evaluation commands:
python q3_train_implicit.py --id 'q3_implicit' --batch_size 32 --output_dir ./output --max_iter 10000 --num_workers 4 --num_points_to_sample 10000
python eval_q3.py --id 'q3_implicit' --model_type "occupancy" --type 'vox' --output_dir "./test_outputs" --checkpoint_path <checkpoint_path>
Visualizations:
3.2 Parametric network (10 points)
Implement a parametric function that takes in as input sampled 2D points and outputs their respective 3D point.
Architecture:
Inspired by AtlasNet, I implemented a very naive version of parametric networks. To achieve this, I sample random points from a 2D surface to predict the 3D location of point clouds. Also, I have only used a single decoder that is capable of generating one-fold output. Adding multiple parallel decoders might improve the performance further. The architecture used for my approach can be found below:
class ParametricNetwork(nn.Module):
def __init__(self, args):
super(ParametricNetwork, self).__init__()
self.device = "cuda"
vision_model = torchvision_models.__dict__[args.arch](pretrained=True)
self.n_points = args.n_points
self.encoder = torch.nn.Sequential(*(list(vision_model.children())[:-1]))
self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
self.image_branch = nn.Sequential(*[
nn.Linear(512, 1024),
nn.PReLU(),
nn.Linear(1024, 1024),
nn.PReLU()
])
self.parametric_branch = nn.Sequential(*[
nn.Linear(self.n_points * 2, 2048),
nn.PReLU(),
nn.Linear(2048, 1024),
nn.PReLU(),
])
self.decoder = nn.Sequential(*[
nn.Linear(1024, 2048),
nn.PReLU(),
nn.Linear(2048, 2048),
nn.PReLU(),
nn.Linear(2048, self.n_points * 3)
])
def forward(self, images, points):
B = images.shape[0]
images_normalize = self.normalize(images.permute(0, 3, 1, 2)) # (N, 3, 137, 137)
encoded_feat = self.encoder(images_normalize).squeeze(-1).squeeze(-1) # (N, 512)
img_feats = self.image_branch(encoded_feat)
points = points.view(B, -1)
point_feats = self.parametric_branch(points)
combined_feats = img_feats + point_feats
final_preds = self.decoder(combined_feats)
return final_preds.reshape(-1, self.n_points, 3)
Training & Evaluation commands:
python q3_train_parametric.py --type 'point' --model_type parametric --id 'q3_parametric' --batch_size 32 --save_freq 500 --output_dir ./output --max_iter 10000 --num_workers 4
python eval_q3.py --type 'point' --model_type parametric --id 'q3_parametric' --output_dir "./test_outputs" --checkpoint_path <checkpoint_path>