Learning for 3D Vision - 16889 [Assigment 2]
NAME: Mayank Agarwal
ANDREW ID: mayankag
LATE DAYS USED:
1. Exploring Loss Functions
1.1. Fitting a voxel grid (5 points)
Visualize the optimized voxel grid along-side the ground truth voxel grid using the tools learnt in previous section.
python fit_data.py --type 'vox'
1.2. Fitting a point cloud (10 points)
Visualize the optimized point cloud along-side the ground truth point cloud using the tools learnt in previous section.
python fit_data.py --type 'point'
1.3. Fitting a mesh (5 points)
Visualize the optimized mesh along-side the ground truth mesh using the tools learnt in previous section.
python fit_data.py --type 'mesh'
2. Reconstructing 3D from single view
2.1. Image to voxel grid (15 points)
Training
python train_model.py --type 'vox' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20
# Checkpoints stored at "checkpoints/vox.pth"
Hyperparameters
lr=1e-5
batch_size=32
max_iters=10000
Tensorboards Visualizations
Evaluation
python eval_model.py --type 'vox' --load_checkpoint --vis_freq 15
# Outputs stored at "visualizations/vox"
Visualisations
2.2. Image to point cloud (15 points)
Training
python train_model.py --type 'point' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20
Hyperparameters
lr=1e-5
batch_size=32
n_points=5000
max_iters=10000
Training Tensorboards
Evaluation
python eval_model.py --type 'point' --load_checkpoint --pred_gif_name 'q22_source' --pred_path 'outputs' --gt_gif_name 'q22_target' --gt_path 'outputs' --vis_freq 100
# Outputs stored at "visualizations/point"
Visualisations
2.3. Image to mesh (15 points)
Training
python train_model.py --type 'mesh' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20
Hyperparameters
lr=1e-5
batch_size=32
w_chamfer=1
w_smooth=0.1
max_iters=10000
Training Tensorboards
Evaluation
python eval_model.py --type 'mesh' --load_checkpoint --pred_gif_name 'q23_source' --pred_path 'outputs' --gt_gif_name 'q23_target' --gt_path 'outputs' --vis_freq 100
# Outputs stored at "visualizations/mesh"
Visualisations
2.4. Quantitative comparisions(10 points)
Voxels reconstructions are the least accurate compared to the other two representations. This could be primarily because of they are limited by the resolution of 3D space they can capture, resulting in coarse reconstructions. Point Clouds and Meshes on the other hand have no such limitations. They can capture arbitrary resolutions in 3D. Although they can capture arbitrary resolutions, these can't capture arbitrary number of points on the object surface. Explicit representations such as Point Clouds and Meshes are trained with fixed number of points on the surface, thereby limiting their performance in some sense. Mesh reconstructions are poorer compared to Point Cloud reconstructions because they are restricted by the topology of the initial mesh (icosphere in our case) which might not be sufficient to capture the variety of topologies of each unique 3d object (such as slits in chairs, etc.). Thus, the quantitative F1 scores make sense intuitively. Point Cloud reconstructions are better than Mesh reconstructions, which perform better than Voxels reconstructions.
Representation | Average F1 Score | Hyper-Parameters |
---|---|---|
Voxels | 63.786 | lr=1e-5, batch_size=32, max_iters=10000 |
Point Cloud | 96.406 | lr=1e-5, batch_size=32, n_points=5000, max_iters=10000 |
Mesh | 90.191 | lr=1e-5, batch_size=32, w_smooth=0.1, w_chamfer=1, max_iters=10000 |
2.5. Analyse effects of hyperparms variations (10 points)
2.5.1 Image to Point Cloud
Analysis
In the previous section (2.3), we observed clusters in the point cloud predictions. I thought it is a waste of computation, since a lot of the point clouds are being clustered at certain locations, and ony a subset of point clouds were sufficient to capture the shape of the 3d object. So, I trained a point cloud reconstruction network that predicts 2048 points (instead of 5000 points predicted earlier). Although, the network still predicts some clusters in the visualizations below, the shape reconstructions are comparable to the earlier model. Even the quantitative scores are quite similar with very limited degradation in metrics. This is also what I had expected.
Training
python train_model.py --type 'point' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20 --n_points 2048
Hyperparameters
lr=1e-5
batch_size=32
n_points=2048 # (Was 5000)
max_iters=10000
Evaluation
python eval_model.py --type point --load_checkpoint --vis_freq 100 --n_points 2048
# Outputs stored at "visualizations/point"
Representation | Average F1 Score | Hyper-Parameters |
---|---|---|
Point Cloud | 94.964 | lr=1e-5, batch_size=32, n_points=2048, max_iters=10000 |
Point Cloud | 96.406 | lr=1e-5, batch_size=32, n_points=5000, max_iters=10000 |
2.5.2 Image to Mesh
Analysis
In the previous section (2.3), we observed that the mesh reconstructions were very sharp and not smooth. To circumvent the issue, I tried increasing the weight of the laplacian smoothening loss to 2 (which was 0.1 earlier). Although the reconstructed meshes still have some pointed locations, it is smoother than the meshes reconstructed from the previous section. However, this comes at the cost of reconstruction accuracy. As seen in the results below (both qualitative and quantitative), the reconstruction accuracy has taken a considerable hit.
Training
python train_model.py --type 'mesh' --batch_size 32 --num_workers 4 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20 --w_smooth 2
Hyperparameters
lr=1e-5
batch_size=32
w_chamfer=1
w_smooth= 2 # (Was 0.1)
max_iters=10000
Evaluation
python eval_model.py --type 'mesh' --load_checkpoint --vis_freq 100
# Outputs stored at "visualizations/mesh"
Representation | Average F1 Score | Hyper-Parameters |
---|---|---|
Mesh | 74.069 | lr=1e-5, batch_size=32, w_smooth=2, w_chamfer=1, max_iters=10000 |
Mesh | 90.191 | lr=1e-5, batch_size=32, w_smooth=0.1, w_chamfer=1, max_iters=10000 |
2.6. Interpret your model (15 points)
Visualization
For understanding the network better, I tried generating interpolations of reconstructions between two input images. My hypothesis is that, if the decoder has learned to effectively generate reconstructions from 2d image encodings, then it should be able to generate good reconstructions from interpolations of the image encodings.
I take two random input images (from different object instances), pass them through the resnet encoder to obtain two image encodings. I interpolate between the two image encodings to generate 3D reconstructions for each of the interpolated latent codes. The resulting reconstructions are visualized. As we can see from the visualizations below, we see a gradual change in shape as we traverse from one object to another. Hence, this succesfully proves the hypothesis.
2.6.1 Image to Voxel Grid Interpolations
python interpolate_model.py --type 'vox' --load_checkpoint
2.6.2 Image to Point Cloud Interpolations
python interpolate_model.py --type 'point' --load_checkpoint
2.6.3 Image to Mesh Interpolations
python interpolate_model.py --type 'mesh' --load_checkpoint
3. Exploring some recent architectures
3.1 Implicit Network
Architecture
I implemented a vanilla version of the implicit network (inspired by Occupancy Networks). The decoder network consists of a simple stack of fully connected layers that predict the occupancy given the 3d location in space, conditioned on the input image. Input image encoding and points are mapped to a common dimension of 128 by passing through fully connected layers. These are then added together to form a unified embedding, which is passed through a fully connected decoder network to predict final occupancy. Although, the original paper used conditional batch norm to condition the image features, for the purpose of this assignmment I have chosen to simply add the image and point encodings. The reconstructed voxels are faithful to shape, however I observe that they are somewhat misaligned to the ground truth canonical meshes. It would be interesting to understand the root cause of this issue. It might be due to the simplification I made in the network.
Training
python train_implicit.py --batch_size 64 --num_workers 12 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20 --n_points 10000
Hyperparameters
lr=1e-5
batch_size=64
n_points=10000 # (Number of points to perform cross entropy loss per iteration)
max_iters=10000
Tensorboard Logs
Evaluation
python eval_implicit.py --load_checkpoint --vis_freq 100
# Outputs stored at "visualizations/implicit"
Representation | Average F1 Score |
---|---|
Implicit Voxels | 52.049 |
3.2 Parametric Network
Architecture
I implemented a vanilla version of the parametric networks (inspired by AtlasNet). The architecture is very simplified and similar to the architecture in implicit networks (Sec 3.1). The only difference is that instead of taking points in the 3D space and predicting their occupancy, now we sample random points from a 2d surface and the network predicts their corresponding 3d location (resulting in a point cloud representation). Note, for the assignment I have implemented a simplified version of AtlasNet with only a single decoder. Hence, the output looks somewhat like a folded paper. Still, the qualitative results looked impressive to me. With this simple change, we can now predict point clouds of arbitrary resolution and the points are also more uniformly sampled. I feel this is a great improvement from the explicit point cloud representation we trained in Section 2.3. This could be further improved if we train multiple independent decoders that can capture different aspects of the topology. I liked the fact that we can predict point clouds of arbitrary resolution. For the same trained network, by sampling more number of points, we can see there is a jump in the quantitative metrics and qualitative reconstruction results as well.
Training
python train_parametric.py --batch_size 64 --num_workers 12 --save_freq 100 --lr 1e-5 --vis_freq 20 --log_freq 20 --n_points 10000
Hyperparameters
lr=1e-5
batch_size=64
n_points=2048 # (Number of points to sample from 2d surface)
max_iters=10000
Tensorboard Logs
Evaluation
python eval_parametric.py --load_checkpoint --vis_freq 100
# Outputs stored at "visualizations/parametric"
2048 Points Visualizations
8192 Points Visualizations
Representation | Average F1 Score | Hyper-Parameters |
---|---|---|
Parametric Point Clouds | 83.495 | lr=1e-5, batch_size=64, n_points=2048, max_iters=10000 |
Parametric Point Clouds | 86.024 | lr=1e-5, batch_size=64, n_points=8192, max_iters=10000 |