16-889 Assignment 2: Single View to 3D

Name: Ayush Pandey Andrew ID: ayushp

1. Exploring loss functions

This section will involve defining a loss function, for fitting voxels, point clouds and meshes.

1.1. Fitting a voxel grid (5 points)

def voxel_loss(voxel_src, voxel_tgt):
	criterion = torch.nn.BCEWithLogitsLoss()
	prob_loss = criterion(voxel_src, voxel_tgt)
	
	return prob_loss

You can run

python main.py --question_number q1_1

or

python fit_data.py --type vox
Optimized voxel grid Ground Truth

1.2. Fitting a point cloud (10 points)

def chamfer_loss(point_cloud_src, point_cloud_tgt):
	src_dists, src_idxs, src_nn = knn_points(point_cloud_src, point_cloud_tgt, K=1)
	tgt_dists, tgt_idxs, tgt_nn = knn_points(point_cloud_tgt, point_cloud_src, K=1)

	loss_chamfer = torch.sum(src_dists + tgt_dists)

	return loss_chamfer

You can run

python main.py --question_number q1_2

or

python fit_data.py --type point
Optimized point cloud Ground Truth

1.3. Fitting a mesh (5 points)

def chamfer_loss(point_cloud_src, point_cloud_tgt):
	src_dists, src_idxs, src_nn = knn_points(point_cloud_src, point_cloud_tgt, K=1)
	tgt_dists, tgt_idxs, tgt_nn = knn_points(point_cloud_tgt, point_cloud_src, K=1)

	loss_chamfer = torch.sum(src_dists + tgt_dists)

	return loss_chamfer

def smoothness_loss(mesh_src):
	loss_laplacian = mesh_laplacian_smoothing(mesh_src)
	return loss_laplacian

You can run

python main.py --question_number q1_3

or

python fit_data.py --type mesh
Optimized Mesh Ground Truth

2. Reconstructing 3D from single view

This section will involve training a single view to 3D pipeline for voxels, point clouds and meshes. Refer to the save_freq argument in train_model.py to save the model checkpoint quicker/slower.

2.1. Image to voxel grid (15 points)

Decoder Definition

if args.type == "vox":
    self.decoder = nn.Sequential(nn.Flatten(),
                                 nn.Linear(512, 2048),
                                 nn.ReLU(),
                                 nn.Linear(2048, 8192),
                                 nn.ReLU(),
                                 nn.Linear(8192, 32768))

Decoder call

if args.type == "vox":
    voxels_pred = self.decoder(encoded_feat)
    voxels_pred = torch.reshape(voxels_pred, (voxels_pred.shape[0], 1, 32, 32, 32))
    return voxels_pred

You can run

python main.py --question_number q2_1

or

python train_model.py --type vox --batch_size 64 --num_workers 4 --save_freq 1000 --max_iter 10000
python eval_model.py --type vox --batch_size 64 --num_workers 4

F1 Score: 90.78

Predicted Voxel Ground Truth Voxel Ground Truth Image
Predicted Voxel Ground Truth Voxel Ground Truth Image
Predicted Voxel Ground Truth Voxel Ground Truth Image

2.2. Image to point cloud (15 points)

Decoder Definition

elif args.type == "point":
    self.n_point = args.n_points          
    self.decoder = nn.Sequential(nn.Flatten(),
                                    nn.Linear(512, 1024),
                                    nn.ReLU(),
                                    nn.Linear(1024, 4096),
                                    nn.ReLU(),
                                    nn.Linear(4096, self.n_point * 3))

Decoder call

elif args.type == "point":
    pointclouds_pred = self.decoder(encoded_feat)
    pointclouds_pred = torch.reshape(pointclouds_pred, (pointclouds_pred.shape[0], self.n_point, 3))
    
    return pointclouds_pred

You can run

python main.py --question_number q2_2

or

python train_model.py --type point --batch_size 64 --num_workers 4 --save_freq 1000 --max_iter 10000
python eval_model.py --type point --batch_size 64 --num_workers 4

F1 Score: 96.352

Predicted Point Cloud Ground Truth Point Cloud Ground Truth Image
Predicted Point Cloud Ground Truth Point Cloud Ground Truth Image
Predicted Point Cloud Ground Truth Point Cloud Ground Truth Image

2.3. Image to mesh (15 points)

Decoder Definition

elif args.type == "mesh":
    # try different mesh initializations
    mesh_pred = ico_sphere(4,'cuda')
    self.mesh_pred = pytorch3d.structures.Meshes(mesh_pred.verts_list()*args.batch_size, mesh_pred.faces_list()*args.batch_size)
    # TODO:

    self.decoder = nn.Sequential(nn.Flatten(),
                                    nn.Linear(512, 1024),
                                    nn.ReLU(),
                                    nn.Linear(1024, self.mesh_pred.verts_list()[0].shape[0] * 3))

Decoder call

elif args.type == "mesh":
    deform_vertices_pred = self.decoder(encoded_feat)
    mesh_pred = self.mesh_pred.offset_verts(deform_vertices_pred.reshape([-1,3]))
    return  mesh_pred    

You can run

python main.py --question_number q2_3

or

python train_model.py --type mesh --batch_size 64 --num_workers 4 --save_freq 1000 --max_iter 10000
python eval_model.py --type mesh --batch_size 64 --num_workers 4

F1 Score: 96.327

Predicted Mesh Ground Truth Mesh Ground Truth Image
Predicted Mesh Ground Truth Mesh Ground Truth Image
Predicted Mesh Ground Truth Mesh Ground Truth Image

2.4. Quantitative comparisions(10 points)

3D Representation F1 Score Epoch
Mesh 96.327 10000
Points 96.352 2500
Voxels 90.78 6000

We can see Mesh and Points representation have higher F1 score as compared to Voxels. Points and Mesh based representation directly optimize on output points to generate a prediction that is as close as possible to the ground truth points. Voxels are limited to representing an object in a 32 x 32 x 32 cuboid and generate a coarse mesh of the ground truth as they predict whether a point is inside or outside the object. Since the marching cube algorithm is an approximation of the mesh on the cuboid depending on the number of voxels the output points of the mesh might be very far off from the sampled ground truth points

2.5. Analyse effects of hyperparms variations (10 points)

The one hyper-parameter that I decided to tune was w_smooth. I set it to 1000 as loss_chamfer was 1000 times more than loss_smooth and hence loss_chamfer was the one which was being optimized on. By setting it to 1000 we make sure that the network also emphasizes on the loss_smooth and leads to much more smooth mesh outputs.

You can run

python main.py --question_number q2_5

or

python train_model.py --type mesh --w_smooth 1000 --batch_size 64 --num_workers 4 --save_freq 1000 --max_iter 10000
python eval_model.py --type mesh --batch_size 64 --num_workers 4
Predicted Mesh Ground Truth Mesh Ground Truth Image
Predicted Mesh Ground Truth Mesh Ground Truth Image
Predicted Mesh Ground Truth Mesh Ground Truth Image

2.6. Interpret your model (15 points)

All the models are following an encoder decoder architecture where the encoder encodes the information about the image into a latent vector and decoder utilizes information in the encoder to generate outputs. Since the decoder doesn't have access to the multiple view points it tries to generate the 3D representation from the prior data that it has learnt. For Point clouds the decoder just doesn't memorize the data it learns a continous space where you can move continously from one kind of chair to another another by interpolating between the vectors. Here are some examples showing the smooth transition from the chair in the top left to the chair on bottom right (2nd example and 4th example of the test set):

python main.py --question_number q2_6

And here are some more examples of the transition from 4th example to 649th example

3. (Extra Credit) Exploring some recent architectures.

3.1 Implicit network (10 points)

Implement a implicit decoder that takes in as input 3D locations and outputs the occupancy value. Some papers for inspiration [1,2]

3.2 Parametric network (10 points)

Implement a parametric function that takes in as input sampled 2D points and outputs their respective 3D point. Some papers for inspiration [1,2]