Learning for 3D Vision - 16889 [Assigment 2]

NAME: Mayank Agarwal

ANDREW ID: mayankag

LATE DAYS USED:

Late days used

1. Exploring Loss Functions

1.1. Fitting a voxel grid (5 points)

Visualize the optimized voxel grid along-side the ground truth voxel grid using the tools learnt in previous section.

python fit_data.py --type 'vox'

Question 1.1 Source Question 1.1 Target

1.2. Fitting a point cloud (10 points)

Visualize the optimized point cloud along-side the ground truth point cloud using the tools learnt in previous section.

python fit_data.py --type 'point'

Question 1.2 Source Question 1.2 Target

1.3. Fitting a mesh (5 points)

Visualize the optimized mesh along-side the ground truth mesh using the tools learnt in previous section.

python fit_data.py --type 'mesh'

Question 1.3 Source Question 1.3 Target

2. Reconstructing 3D from single view

2.1. Image to voxel grid (15 points)

Training

python train_model.py --type 'vox' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20
# Checkpoints stored at "checkpoints/vox.pth"
Hyperparameters
lr=1e-5
batch_size=32
max_iters=10000
Tensorboards Visualizations

Q2.1 Losses

Question 2.2 Reconstructions Question 2.2 Reconstructions Question 2.2 Reconstructions Question 2.2 Reconstructions

Evaluation

python eval_model.py --type 'vox' --load_checkpoint --vis_freq 15
# Outputs stored at "visualizations/vox"
Visualisations

F1 @ 0.05 = 82.246 Question 2.1 Input | Prediction | GT

F1 @ 0.05 = 63.143 Question 2.1 Input | Prediction | GT

F1 @ 0.05 = 56.259 Question 2.1 Input | Prediction | GT

2.2. Image to point cloud (15 points)

Training

python train_model.py --type 'point' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20
Hyperparameters
lr=1e-5
batch_size=32
n_points=5000
max_iters=10000
Training Tensorboards

Q2.1 Losses

Question 2.2 Reconstructions Question 2.2 Reconstructions Question 2.2 Reconstructions Question 2.2 Reconstructions

Evaluation

python eval_model.py --type 'point' --load_checkpoint --pred_gif_name 'q22_source' --pred_path 'outputs' --gt_gif_name 'q22_target' --gt_path 'outputs' --vis_freq 100
# Outputs stored at "visualizations/point"
Visualisations

F1 @ 0.05 = 99.9 Question 2.2 Input | Prediction | GT

F1 @ 0.05 = 99.169 Question 2.2 Input | Prediction | GT

F1 @ 0.05 = 96.449 Question 2.2 Input | Prediction | GT

2.3. Image to mesh (15 points)

Training

python train_model.py --type 'mesh' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20
Hyperparameters
lr=1e-5
batch_size=32
w_chamfer=1
w_smooth=0.1
max_iters=10000
Training Tensorboards

Q2.1 Losses

Question 2.3 Reconstructions Question 2.3 Reconstructions Question 2.3 Reconstructions Question 2.3 Reconstructions

Evaluation

python eval_model.py --type 'mesh' --load_checkpoint --pred_gif_name 'q23_source' --pred_path 'outputs' --gt_gif_name 'q23_target' --gt_path 'outputs' --vis_freq 100
# Outputs stored at "visualizations/mesh"
Visualisations

F1 @ 0.05 = 99.449 Question 2.3 Input | Prediction | GT

F1 @ 0.05 = 98.515 Question 2.3 Input | Prediction | GT

F1 @ 0.05 = 97.952 Question 2.3 Input | Prediction | GT

2.4. Quantitative comparisions(10 points)

Voxels reconstructions are the least accurate compared to the other two representations. This could be primarily because of they are limited by the resolution of 3D space they can capture, resulting in coarse reconstructions. Point Clouds and Meshes on the other hand have no such limitations. They can capture arbitrary resolutions in 3D. Although they can capture arbitrary resolutions, these can't capture arbitrary number of points on the object surface. Explicit representations such as Point Clouds and Meshes are trained with fixed number of points on the surface, thereby limiting their performance in some sense. Mesh reconstructions are poorer compared to Point Cloud reconstructions because they are restricted by the topology of the initial mesh (icosphere in our case) which might not be sufficient to capture the variety of topologies of each unique 3d object (such as slits in chairs, etc.). Thus, the quantitative F1 scores make sense intuitively. Point Cloud reconstructions are better than Mesh reconstructions, which perform better than Voxels reconstructions.

Representation Average F1 Score Hyper-Parameters
Voxels 63.786 lr=1e-5, batch_size=32, max_iters=10000
Point Cloud 96.406 lr=1e-5, batch_size=32, n_points=5000, max_iters=10000
Mesh 90.191 lr=1e-5, batch_size=32, w_smooth=0.1, w_chamfer=1, max_iters=10000

2.5. Analyse effects of hyperparms variations (10 points)

2.5.1 Image to Point Cloud

Analysis

In the previous section (2.3), we observed clusters in the point cloud predictions. I thought it is a waste of computation, since a lot of the point clouds are being clustered at certain locations, and ony a subset of point clouds were sufficient to capture the shape of the 3d object. So, I trained a point cloud reconstruction network that predicts 2048 points (instead of 5000 points predicted earlier). Although, the network still predicts some clusters in the visualizations below, the shape reconstructions are comparable to the earlier model. Even the quantitative scores are quite similar with very limited degradation in metrics. This is also what I had expected.

Training
python train_model.py --type 'point' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20 --n_points 2048
Hyperparameters
lr=1e-5
batch_size=32
n_points=2048 # (Was 5000)
max_iters=10000
Evaluation
python eval_model.py --type point --load_checkpoint --vis_freq 100 --n_points 2048
# Outputs stored at "visualizations/point"

Question 2.5.1 Input | Prediction | GT

Question 2.5.1 Input | Prediction | GT

Question 2.5.1 Input | Prediction | GT

Representation Average F1 Score Hyper-Parameters
Point Cloud 94.964 lr=1e-5, batch_size=32, n_points=2048, max_iters=10000
Point Cloud 96.406 lr=1e-5, batch_size=32, n_points=5000, max_iters=10000

2.5.2 Image to Mesh

Analysis

In the previous section (2.3), we observed that the mesh reconstructions were very sharp and not smooth. To circumvent the issue, I tried increasing the weight of the laplacian smoothening loss to 2 (which was 0.1 earlier). Although the reconstructed meshes still have some pointed locations, it is smoother than the meshes reconstructed from the previous section. However, this comes at the cost of reconstruction accuracy. As seen in the results below (both qualitative and quantitative), the reconstruction accuracy has taken a considerable hit.

Training
python train_model.py --type 'mesh' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20 --w_smooth 2
Hyperparameters
lr=1e-5
batch_size=32
w_chamfer=1
w_smooth= 2 # (Was 0.1)
max_iters=10000
Evaluation
python eval_model.py --type 'mesh' --load_checkpoint --vis_freq 100
# Outputs stored at "visualizations/mesh"

Question 2.5.2 Input | Prediction | GT

Question 2.5.2 Input | Prediction | GT

Question 2.5.2 Input | Prediction | GT

Representation Average F1 Score Hyper-Parameters
Mesh 74.069 lr=1e-5, batch_size=32, w_smooth=2, w_chamfer=1, max_iters=10000
Mesh 90.191 lr=1e-5, batch_size=32, w_smooth=0.1, w_chamfer=1, max_iters=10000

2.6. Interpret your model (15 points)

Visualization

For understanding the network better, I tried generating interpolations of reconstructions between two input images. My hypothesis is that, if the decoder has learned to effectively generate reconstructions from 2d image encodings, then it should be able to generate good reconstructions from interpolations of the image encodings.

I take two random input images (from different object instances), pass them through the resnet encoder to obtain two image encodings. I interpolate between the two image encodings to generate 3D reconstructions for each of the interpolated latent codes. The resulting reconstructions are visualized. As we can see from the visualizations below, we see a gradual change in shape as we traverse from one object to another. Hence, this succesfully proves the hypothesis.

2.6.1 Image to Voxel Grid Interpolations

python interpolate_model.py --type 'vox' --load_checkpoint

Question 2.6.1 Image 1 | Interpolations | Image 2

2.6.2 Image to Point Cloud Interpolations

python interpolate_model.py --type 'point' --load_checkpoint

Question 2.6.2 Image 1 | Interpolations | Image 2

2.6.3 Image to Mesh Interpolations

python interpolate_model.py --type 'mesh' --load_checkpoint

Question 2.6.3 Image 1 | Interpolations | Image 2

3. Exploring some recent architectures

3.1 Implicit Network

Architecture

I implemented a vanilla version of the implicit network (inspired by Occupancy Networks). The decoder network consists of a simple stack of fully connected layers that predict the occupancy given the 3d location in space, conditioned on the input image. Input image encoding and points are mapped to a common dimension of 128 by passing through fully connected layers. These are then added together to form a unified embedding, which is passed through a fully connected decoder network to predict final occupancy. Although, the original paper used conditional batch norm to condition the image features, for the purpose of this assignmment I have chosen to simply add the image and point encodings. The reconstructed voxels are faithful to shape, however I observe that they are somewhat misaligned to the ground truth canonical meshes. It would be interesting to understand the root cause of this issue. It might be due to the simplification I made in the network.

Training
python train_implicit.py --batch_size 64 --num_workers 12 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20 --n_points 10000
Hyperparameters
lr=1e-5
batch_size=64
n_points=10000 # (Number of points to perform cross entropy loss per iteration)
max_iters=10000
Tensorboard Logs

3.1 Tensorboard Logs

Evaluation
python eval_implicit.py --load_checkpoint --vis_freq 100
# Outputs stored at "visualizations/implicit"

Question 3.1 Input | Prediction | GT

Question 3.1 Input | Prediction | GT

Question 3.1 Input | Prediction | GT

Representation Average F1 Score
Implicit Voxels 52.049

3.2 Parametric Network

Architecture

I implemented a vanilla version of the parametric networks (inspired by AtlasNet). The architecture is very simplified and similar to the architecture in implicit networks (Sec 3.1). The only difference is that instead of taking points in the 3D space and predicting their occupancy, now we sample random points from a 2d surface and the network predicts their corresponding 3d location (resulting in a point cloud representation). Note, for the assignment I have implemented a simplified version of AtlasNet with only a single decoder. Hence, the output looks somewhat like a folded paper. Still, the qualitative results looked impressive to me. With this simple change, we can now predict point clouds of arbitrary resolution and the points are also more uniformly sampled. I feel this is a great improvement from the explicit point cloud representation we trained in Section 2.3. This could be further improved if we train multiple independent decoders that can capture different aspects of the topology. I liked the fact that we can predict point clouds of arbitrary resolution. For the same trained network, by sampling more number of points, we can see there is a jump in the quantitative metrics and qualitative reconstruction results as well.

Training
python train_parametric.py --batch_size 64 --num_workers 12 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20 --n_points 10000
Hyperparameters
lr=1e-5
batch_size=64
n_points=2048 # (Number of points to sample from 2d surface)
max_iters=10000
Tensorboard Logs

3.2 Tensorboard Logs

Evaluation
python eval_parametric.py --load_checkpoint --vis_freq 100
# Outputs stored at "visualizations/parametric"
2048 Points Visualizations

Question 3.2 Input | Prediction | GT

Question 3.2 Input | Prediction | GT

Question 3.2 Input | Prediction | GT

8192 Points Visualizations

Question 3.2 Input | Prediction | GT

Question 3.2 Input | Prediction | GT

Question 3.2 Input | Prediction | GT

Representation Average F1 Score Hyper-Parameters
Parametric Point Clouds 83.495 lr=1e-5, batch_size=64, n_points=2048, max_iters=10000
Parametric Point Clouds 86.024 lr=1e-5, batch_size=64, n_points=8192, max_iters=10000