Assignment 2 - Learning for 3D Vision [16889]

NAME: Hiresh Gupta

ANDREW ID: hireshg

Late days used: 1 In

1. Exploring Loss Functions

1.1. Fitting a voxel grid (5 points)

Visualize the optimized voxel grid along-side the ground truth voxel grid using the tools learnt in previous section.

python fit_data.py --type 'vox' --save_dir "./output" --save_predictions --save_visualizations

Q.1.1 Source Q.1.1 Target

1.2. Fitting a point cloud (10 points)

Visualize the optimized point cloud along-side the ground truth point cloud using the tools learnt in previous section.

python fit_data.py --type 'point' --save_dir "./output" --save_predictions --save_visualizations

Q.1.2 Source Q.1.2 Target

1.3. Fitting a mesh (5 points)

Visualize the optimized mesh along-side the ground truth mesh using the tools learnt in previous section.

python fit_data.py --type 'mesh' --save_dir "./output" --save_predictions --save_visualizations

Q.1.3 Source Q.1.3 Target

2. Reconstructing 3D from single view

2.1. Image to voxel grid (15 points)

Training

# Train model: 
python train_model.py --type 'vox' --id 'q2' --batch_size 64 --save_freq 500 --output_dir=./outputs --max_iter 10000 --vis_freq 250 --num_workers 4
Hyperparameters
lr=4e-4
batch_size=64
max_iters=10000

Evaluation

python eval_model.py --type 'vox' --id 'q2' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --vis_freq 100 
Visualisations

Q.2.1 Input Q.2.1 Input Q.2.1 Target

Q.2.1 Input Q.2.1 Input Q.2.1 Target

Q.2.1 Input Q.2.1 Input Q.2.1 Target

2.2. Image to point cloud (15 points)

Training

python train_model.py --type 'point' --id 'q2' --batch_size 8 --save_freq 500 --output_dir=./outputs --max_iter 10000 --vis_freq 250 --num_workers 4
Hyperparameters
lr=4e-4
batch_size=8
max_iters=10000

Evaluation

python eval_model.py --type 'point' --id 'q2' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --vis_freq 100 
Visualisations

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

2.3. Image to mesh (15 points)

Training

python train_model.py --type 'mesh' --id 'q2' --batch_size 8 --save_freq 500 --output_dir=./outputs --max_iter 10000 --vis_freq 250 --num_workers 4
Hyperparameters
lr=4e-4
batch_size=8
max_iters=10000
w_smooth=0.1
w_chamfer=1.0

Evaluation

python eval_model.py --type 'mesh' --id 'q2' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --vis_freq 100 
Visualisations

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

2.4. Quantitative comparisions (10 points)

Quantitatively compare the F1 score of 3D reconstruction for meshes vs pointcloud vs voxelgrids. Provide an intutive explaination justifying the comparision.

The quantative comparison of the F1 Score @ 0.05 can be found in the following table:

Representation Avg. F1 Score Hyper-Parameters
Voxels 90.406 lr=4e-4, batch_size=8, max_iters=10000
Point Cloud 94.938 lr=4e-4, batch_size=8, max_iters=10000
Mesh 90.006 lr=4e-4, batch_size=8, max_iters=10000, w_smooth=0.1, w_chamfer=1.0

Explanation:

Based on the above table, we can see that the average F1 score for point clouds is the maximum (around 95%) while the other two representations (voxels and the mesh) converge around 90%. This is intuitive because unlike others Point cloud representation is free to take any pixels in the space (can expand/contract freely), while the voxel representation is limited in predicting its output in a fixed cube volume and is restrained by resolution. The mesh representations are somewhat limited by the initial topology of the mesh which might be a bit restrictive in capturing 3D objects with varying topologies with different surface/holes representation.

2.5. Analyse effects of hyperparms variations (10 points)

Analyse the results, by varying an hyperparameter of your choice. For example n_points or vox_size or w_chamfer or initial mesh(ico_sphere) etc.

Solution: I measured the effect of model performance on varying the n_points parameter in the Point Cloud Decoder. Please find the corresponding training and evaluation commands below:

Training Commands (varying the n_points parameter)

# 500 points
python train_model.py --type 'point' --id 'q2_500' --batch_size 8 --output_dir=./outputs --max_iter 5000 --num_workers 4 --n_points 500

# 1000 points
python train_model.py --type 'point' --id 'q2_1000' --batch_size 8 --output_dir=./outputs --max_iter 5000 --num_workers 4 --n_points 1000

# 2500 points
python train_model.py --type 'point' --id 'q2_2500' --batch_size 8 --output_dir=./outputs --max_iter 5000 --num_workers 4 --n_points 2500

# 5000 points
python train_model.py --type 'point' --id 'q2_5000' --batch_size 8 --output_dir=./outputs --max_iter 5000 --num_workers 4 --n_points 5000

Evaluation (varying the n_points parameter)

# 500 points
python eval_model.py --type 'point' --id 'q2_500' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --n_points 500 

# 1000 points
python eval_model.py --type 'point' --id 'q2_1000' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --n_points 1000 

# 2500 points
python eval_model.py --type 'point' --id 'q2_2500' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --n_points 2500 

# 5000 points
python eval_model.py --type 'point' --id 'q2_5000' --output_dir "./test_outputs" --checkpoint_path "<ckpt_path>" --n_points 5000 

Quantitative Comparison

I observed the following results:

Number of Sampled Points Avg. F1 Score Hyper-Parameters
n_points=500 82.991 lr=4e-4, batch_size=8, max_iters=5000
n_points=1000 88.563 lr=4e-4, batch_size=8, max_iters=5000
n_points=2500 92.098 lr=4e-4, batch_size=8, max_iters=5000
n_points=5000 94.377 lr=4e-4, batch_size=8, max_iters=5000

Analysis

The network does train a bit faster with the lesser number of points. Based on the above table, we can see that there is a decline in the F1 score on decreasing the number of points from 5000 to 500. It is quite intuitive because adding more points help the point cloud capture more finer structure and hence increase the F1 score.

Visualizations

Sample point cloud predictions with n_points=500:

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Sample point cloud predictions with n_points=1000:

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Sample point cloud predictions with n_points=2500:

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Sample point cloud predictions with n_points=5000:

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

2.6. Interpret your model (15 points)

Simply seeing final predictions and numerical evaluations is not always insightful. Can you create some visualizations that help highlight what your learned model does? Be creative and think of what visualizations would help you gain insights. There is no `right' answer - although reading some papers to get inspiration might give you ideas.

Solution: To understand the network is actually learning, I tried to run GradCAM and Guided Backpropagation to understand the image features that activate the gradients. You can run the following commands to visualize the gradient map on our test set.

Usage:

python gradcam.py --type vox --checkpoint_path <vox_trained_checkpoint_path> --max_iter 10  --visualizer gradcam++ --output_dir ./output/gradcam++_visualizations/vox

python gradcam.py --type point --checkpoint_path <point_trained_checkpoint_path> --max_iter 10  --visualizer gradcam++ --output_dir ./output/gradcam++_visualizations/point

python gradcam.py --type mesh --checkpoint_path <mesh_trained_checkpoint_path> --max_iter 10  --visualizer gradcam++ --output_dir ./output/gradcam++_visualizations/mesh

GradCAM++ Visualizations:

1. GradCAM++ on Voxel predictions:

In Out

In Out

In Out

2. GradCAM++ on Point predictions:

In Out

In Out

In Out

3. GradCAM++ on Mesh predictions:

In Out

In Out

In Out

My understanding on visualizing the gradient outputs:

It was quite interesting to see that almost all the networks are focussing on the chair edges to detect the corners and end points to be able to project the image onto the 3D space. Here's a deeper interpretation of each architecture:

  1. Voxel: Since the Voxel Decoder is trying to predict a probability value over the whole 3D space, we do see somewhat high gradient values within the chair region. It is also able to pick up gradient signals in somewhat hard cases with bad lighting or occlusion.

  2. Point: The point decoder is also very focussing on the chair edges. However, we do see some low gradients around some occluded parts like the top of the chair or the chair leg. This clearly explains why point clouds hallucinate over certain regions and are unable to capture the fine structure around certain regions.

  3. Mesh: On visualizing the mesh gradients, we can see that like the other two it is also focussed on extracting the edges. However, if we look very closely, the gradients are not as good as the point or voxel decoder. In the above example we can clearly spot gradient values on the front part and less gradients on the occluded part which might be because the mesh is allocating more faces to the frontal part and is somewhat less focussed on the occluded part.

3. (Extra Credit) Exploring some recent architectures.

3.1 Implicit network (10 points)

Implement a implicit decoder that takes in as input 3D locations and outputs the occupancy value.

Architecture:

Inspired by the occupancy network, I implemented the following version of the implicit network. Although their paper suggested using Conditional Batch Norm (CBN) to condition over the image features, for the purpose of this assignment I have chosen to simply add the image and point encodings. Adding it might improve the performance further.

class OccupancyNetwork(nn.Module):
    def __init__(self, args):
        super(OccupancyNetwork, self).__init__()
        self.device = "cuda"
        vision_model = torchvision_models.__dict__[args.arch](pretrained=True)
        self.encoder = torch.nn.Sequential(*(list(vision_model.children())[:-1]))
        self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        self.image_branch = nn.Sequential(*[
            nn.Linear(512, 256),
            nn.PReLU(),
            nn.Linear(256, 128),
            nn.PReLU()
        ])
        self.point_branch = nn.Sequential(*[
            nn.Linear(3, 64),
            nn.PReLU(),
            nn.Linear(64, 128),
            nn.PReLU(),
            nn.Linear(128, 128),
            nn.PReLU(),
        ])
        self.decoder = nn.Sequential(*[
            nn.Linear(128, 128),
            nn.PReLU(),
            nn.Linear(128, 128),
            nn.PReLU(),
            nn.Linear(128, 64),
            nn.PReLU(),
            nn.Linear(64, 1)
        ])

    def forward(self, images, points):
        B = images.shape[0]
        images_normalize = self.normalize(images.permute(0, 3, 1, 2))  # (N, 3, 137, 137)
        encoded_feat = self.encoder(images_normalize).squeeze(-1).squeeze(-1)  # (N, 512)
        image_feats = encoded_feat.view(B, 1, -1)
        img_feats = self.image_branch(image_feats)
        point_feats = self.point_branch(points)
        combined_feats = img_feats + point_feats
        final_probs = self.decoder(combined_feats)
        return final_probs

Training & Evaluation commands:

python q3_train_implicit.py --id 'q3_implicit' --batch_size 32  --output_dir ./output --max_iter 10000 --num_workers 4 --num_points_to_sample 10000

python eval_q3.py --id 'q3_implicit' --model_type "occupancy" --type 'vox' --output_dir "./test_outputs" --checkpoint_path <checkpoint_path> 

Visualizations:

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

3.2 Parametric network (10 points)

Implement a parametric function that takes in as input sampled 2D points and outputs their respective 3D point.

Architecture:

Inspired by AtlasNet, I implemented a very naive version of parametric networks. To achieve this, I sample random points from a 2D surface to predict the 3D location of point clouds. Also, I have only used a single decoder that is capable of generating one-fold output. Adding multiple parallel decoders might improve the performance further. The architecture used for my approach can be found below:

class ParametricNetwork(nn.Module):
    def __init__(self, args):
        super(ParametricNetwork, self).__init__()
        self.device = "cuda"
        vision_model = torchvision_models.__dict__[args.arch](pretrained=True)
        self.n_points = args.n_points
        self.encoder = torch.nn.Sequential(*(list(vision_model.children())[:-1]))
        self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        self.image_branch = nn.Sequential(*[
            nn.Linear(512, 1024),
            nn.PReLU(),
            nn.Linear(1024, 1024),
            nn.PReLU()
        ])
        self.parametric_branch = nn.Sequential(*[
            nn.Linear(self.n_points * 2, 2048),
            nn.PReLU(),
            nn.Linear(2048, 1024),
            nn.PReLU(),
        ])
        self.decoder = nn.Sequential(*[
            nn.Linear(1024, 2048),
            nn.PReLU(),
            nn.Linear(2048, 2048),
            nn.PReLU(),
            nn.Linear(2048, self.n_points * 3)
        ])

    def forward(self, images, points):
        B = images.shape[0]
        images_normalize = self.normalize(images.permute(0, 3, 1, 2))  # (N, 3, 137, 137)
        encoded_feat = self.encoder(images_normalize).squeeze(-1).squeeze(-1)  # (N, 512)
        img_feats = self.image_branch(encoded_feat)
        points = points.view(B, -1)
        point_feats = self.parametric_branch(points)
        combined_feats = img_feats + point_feats
        final_preds = self.decoder(combined_feats)
        return final_preds.reshape(-1, self.n_points, 3)

Training & Evaluation commands:

python q3_train_parametric.py --type 'point' --model_type parametric --id 'q3_parametric' --batch_size 32 --save_freq 500 --output_dir ./output --max_iter 10000 --num_workers 4

python eval_q3.py  --type 'point' --model_type parametric --id 'q3_parametric' --output_dir "./test_outputs" --checkpoint_path <checkpoint_path> 

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target

Q.2.2 Input Q.2.2 Input Q.2.2 Target