16-889 Assignment 2
Name: Anirudh Chakravarthy
Andrew ID: achakrav
Late day: 1
Question 1
For visualizations in Q1.1 to Q1.3, left is the prediction, and right is the ground truth.
Question 1.1
Usage:
python fit_data.py --type vox
Question 1.2
Usage:
python fit_data.py --type point
Question 1.3
Usage:
python fit_data.py --type mesh
Question 2
For visualizations in Q2.1 to Q2.3, left is the Image, center is the prediction, and right is the ground truth.
Question 2.1: Voxel
Usage:
python train_model.py --type vox --batch_size 32
python eval_model.py --type vox --load_checkpoint
Question 2.2: Point Cloud
Usage:
python train_model.py --type point ---batch_size 8
python eval_model.py --type point --load_checkpoint
Question 2.3: Mesh
Usage:
python train_model.py --type mesh --w_smooth 500 --batch_size 32
python eval_model.py --type mesh --load_checkpoint
Question 2.4
Type | Avg F1 Score |
---|---|
Voxel | 85.613 |
Point Cloud | 95.505 |
Mesh | 94.848 |
Analysis:
- Point clouds are easy to optimize since we only have to reason about it's spatial locations. Therefore, a simple network is good enough to get a high F1 score since inter-connectivity is not enforced.
- Meshes with strong smoothness supervision performs pretty well too since there are sufficient vertices to generate the chair and there is sufficient supervision to have local smoothness.
- Voxel is a bit challenging since there is an implicit class-imbalance between occupied and unoccupied grids. Therefore, the F1 score is a bit lower.
Question 2.5
To train the mesh decoder, I used a smoothness term of 500 while performing a sum reduction over the chamfer distance. On reducing this weight to 200, I noticed that the rendered chairs were more pointy and populated with abrupt triangles. On increasing this weight further, the chair became smoother and more continuous.
The rendered GIF on using a low smoothness factor (left) vs high smoothness factor (right):
P.S: Sorry for the differing frame rates and different colours :)
Question 2.6
In this assignment, since our network is trained on single views to generate 3D reconstructions, a natural question arises whether it can generalize across slight changes in the viewpoint. Concretely, if we perform image-level data augmentation with an affine transformation, how does the predicted 3D structure look? Intuitively, the structure should be same since we predict objects in canonical frame of reference.
In the figures below, left-most: original image, center-left: transformed image, center-right: predicted 3D, right: GT 3D.
First, I attempted a random rotation on the input RGB Image and observed the corresponding reconstruction.
The results still look good, which means the network could do reconstruction well if we rotate the camera a bit about it's Z-axis.
Next, what if we use an affine transformation with rotation, translation, and scale transformations?
We observe that results really take a hit. My hypothesis is that the dataset consists of images in the center, so a change in translation and scale exposes the fragilities of the network in this regard.
Finally, as a fun experiment, what if we change the appearance of the chair using a colour jitter?
As anticipated, the network still performs reasonably well on reconstruction.
And as an even more fun experiment, what if we colour the chair red?
Still works pretty well!
Usage:
python interpret.py --type vox
Question 3
Question 3.1
I implemented the following decoder which takes concatenates encoded feature vector with spatial location as input to generate occupancy predictions for the corresponding location.
self.decoder = nn.Sequential(*[
nn.Linear(512+3, 1024),
nn.ReLU(),
nn.Linear(1024, 2048),
nn.ReLU(),
nn.Linear(2048, 1),
])
For visualizations, left is the Image, center is the prediction, and right is the ground truth.
Usage:
python train_occupancy.py --batch_size 32
python eval_occupancy.py --load_checkpoint