lateday

16-889 Assignment 2

Name: Anirudh Chakravarthy

Andrew ID: achakrav

Late day: 1

Question 1

For visualizations in Q1.1 to Q1.3, left is the prediction, and right is the ground truth.

Question 1.1

Question 1.1 Question 1.1

Usage:

python fit_data.py --type vox

Question 1.2

Question 1.2 Question 1.2

Usage:

python fit_data.py --type point

Question 1.3

Question 1.3 Question 1.2

Usage:

python fit_data.py --type mesh

Question 2

For visualizations in Q2.1 to Q2.3, left is the Image, center is the prediction, and right is the ground truth.

Question 2.1: Voxel

Question 2.1.a Question 2.1.a Question 2.1.a

Question 2.1.b Question 2.1.b Question 2.1.b

Question 2.1.c Question 2.1.c Question 2.1.c

Usage:

python train_model.py --type vox --batch_size 32
python eval_model.py --type vox --load_checkpoint

Question 2.2: Point Cloud

Question 2.2.a Question 2.2.a Question 2.2.a

Question 2.2.b Question 2.2.b Question 2.2.b

Question 2.2.c Question 2.2.c Question 2.2.c

Usage:

python train_model.py --type point ---batch_size 8
python eval_model.py --type point --load_checkpoint

Question 2.3: Mesh

Question 2.3.a Question 2.3.a Question 2.3.a

Question 2.3.b Question 2.3.b Question 2.3.b

Question 2.3.c Question 2.3.c Question 2.3.c

Usage:

python train_model.py --type mesh --w_smooth 500 --batch_size 32
python eval_model.py --type mesh --load_checkpoint

Question 2.4

Type Avg F1 Score
Voxel 85.613
Point Cloud 95.505
Mesh 94.848

Analysis:

Question 2.5

To train the mesh decoder, I used a smoothness term of 500 while performing a sum reduction over the chamfer distance. On reducing this weight to 200, I noticed that the rendered chairs were more pointy and populated with abrupt triangles. On increasing this weight further, the chair became smoother and more continuous.

The rendered GIF on using a low smoothness factor (left) vs high smoothness factor (right):

Question 2.5.1 Question 2.5.2

P.S: Sorry for the differing frame rates and different colours :)

Question 2.6

In this assignment, since our network is trained on single views to generate 3D reconstructions, a natural question arises whether it can generalize across slight changes in the viewpoint. Concretely, if we perform image-level data augmentation with an affine transformation, how does the predicted 3D structure look? Intuitively, the structure should be same since we predict objects in canonical frame of reference.

In the figures below, left-most: original image, center-left: transformed image, center-right: predicted 3D, right: GT 3D.

First, I attempted a random rotation on the input RGB Image and observed the corresponding reconstruction.

Question 2.6.a.1 Question 2.6.a.2 Question 2.6.a.3 Question 2.6.a.4

The results still look good, which means the network could do reconstruction well if we rotate the camera a bit about it's Z-axis.

Next, what if we use an affine transformation with rotation, translation, and scale transformations?

Question 2.6.b.1 Question 2.6.b.2 Question 2.6.b.3 Question 2.6.b.4

We observe that results really take a hit. My hypothesis is that the dataset consists of images in the center, so a change in translation and scale exposes the fragilities of the network in this regard.

Finally, as a fun experiment, what if we change the appearance of the chair using a colour jitter?

Question 2.6.c.1 Question 2.6.c.2 Question 2.6.c.3 Question 2.6.c.4

As anticipated, the network still performs reasonably well on reconstruction.

And as an even more fun experiment, what if we colour the chair red?

Question 2.6.d.1 Question 2.6.d.2 Question 2.6.d.3 Question 2.6.d.4

Still works pretty well!

Usage:

python interpret.py --type vox

Question 3

Question 3.1

I implemented the following decoder which takes concatenates encoded feature vector with spatial location as input to generate occupancy predictions for the corresponding location.

self.decoder = nn.Sequential(*[
    nn.Linear(512+3, 1024),
    nn.ReLU(),
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, 1),
])

For visualizations, left is the Image, center is the prediction, and right is the ground truth.

Question 3.1.a.1 Question 3.1.a.2 Question 3.1.a.3

Question 3.1.b.1 Question 3.1.b.2 Question 3.1.b.3

Question 3.1.c.1 Question 3.1.c.2 Question 3.1.c.3

Usage:

python train_occupancy.py --batch_size 32
python eval_occupancy.py --load_checkpoint