Skip to content

16-889 Assignment 2: Single View to 3D

Note: Used 5 late days for this assignment.

5 late days


Goals: This assignment explores the types of loss and decoder functions for regressing to voxels, point clouds, and mesh representation from single view RGB input.

1. Exploring loss functions

This section will involve defining a loss function, for fitting voxels, point clouds and meshes.

1.1. Fitting a voxel grid (5 points)

Run python main.py fit_data --type 'vox', to fit the source point cloud to the target point cloud.

Voxel Loss
1
2
3
4
def voxel_loss(voxel_src, voxel_tgt):
    loss_fn = torch.nn.BCEWithLogitsLoss()
    prob_loss = loss_fn(voxel_src, voxel_tgt)
    return prob_loss
Result Voxel Grid GT Voxel Grid
q1.1 result q1.1 GT

1.2. Fitting a point cloud (10 points)

Run python main.py fit_data --type 'point', to fit the source point cloud to the target point cloud.

Chamfer Loss
1
2
3
4
5
6
7
def chamfer_loss(point_cloud_src, point_cloud_tgt):
    dist1 = knn_points(point_cloud_src, point_cloud_tgt)
    dist2 = knn_points(point_cloud_tgt, point_cloud_src)

    loss_chamfer = torch.sum(dist1.dists) + torch.sum(dist2.dists)

    return loss_chamfer
Result Pointcloud GT Pointcloud
q1.2 result q1.2 GT

1.3. Fitting a mesh (5 points)

Run python main.py fit_data --type 'mesh', to fit the source mesh to the target mesh.

Chamfer Loss
1
2
3
4
5
6
7
def smoothness_loss(mesh_src):
    V = mesh_src.verts_packed()
    L = mesh_src.laplacian_packed()

    loss_laplacian = torch.square(torch.norm(torch.matmul(L, V)))

    return loss_laplacian
Result Mesh GT Mesh
q1.3 result q1.3 GT

2. Reconstructing 3D from single view

This section involves training a single-view-to-3D pipeline for voxels, point clouds and meshes.

2.1. Image to voxel grid (15 points)

Voxel Decoder
# ...
self.decoder = torch.nn.Sequential(
    nn.Linear(512, 1024),
    nn.PReLU(),
    nn.Linear(1024, 32*32*32)
)
# ...
def forward(images, args):
    images_normalize = self.normalize(images.permute(0, 3, 1, 2))
    encoded_feat = self.encoder(images_normalize).squeeze(-1).squeeze(-1)

    voxels_pred = self.decoder(encoded_feat)
    voxels_pred = torch.reshape(voxels_pred, (-1, 1, 32, 32, 32))

    return voxels_pred
Run Command
# to train
python3 main.py train_model --type 'vox'
# to eval
python3 main.py eval_model --type 'vox' --load_checkpoint
# Input Image GT Mesh Predicted Voxel Grid
1 q2.1.1 image q2.1.1 GT q2.1.1 result
2 q2.1.2 image q2.1.1 GT q2.1.1 result
3 q2.1.3 image q2.1.1 GT q2.1.1 result

2.2. Image to point cloud (15 points)eset

Pointcloud Decoder
#...
self.decoder = torch.nn.Sequential(
    nn.Linear(512, args.n_points),
    nn.PReLU(),
    nn.Linear(args.n_points, args.n_points),
    nn.PReLU(),
    nn.Linear(args.n_points, 3 * args.n_points),
    nn.PPReLU()
)
#...
def forward(images, args):
    images_normalize = self.normalize(images.permute(0, 3, 1, 2))
    encoded_feat = self.encoder(images_normalize).squeeze(-1).squeeze(-1)

    decoded_features = self.decoder(encoded_feat)
    pointclouds_pred = torch.reshape(decoded_features,(-1, args.n_points, 3))

    return pointclouds_pred
Run Command
# to train
python3 main.py train_model --type 'point'
# to eval
python3 main.py eval_model --type 'point' --load_checkpoint
# Input Image GT Mesh Predicted Pointclouds
1 q2.1.1 image q2.1.1 GT q2.1.1 result
2 q2.1.2 image q2.1.1 GT q2.1.1 result
3 q2.1.3 image q2.1.1 GT q2.1.1 result

2.3. Image to mesh (15 points)

Mesh Decoder
mesh_pred = ico_sphere(4, 'cuda')
self.mesh_pred = pytorch3d.structures.Meshes(
    mesh_pred.verts_list() * args.batch_size,
    mesh_pred.faces_list() * args.batch_size)
verts = self.mesh_pred.verts_list()[0]

self.n_points = verts.shape[0]

self.decoder = torch.nn.Sequential(
    nn.ConvTranspose2d(2048, 2048, 5, stride=1, padding=0),
    nn.Flatten(),
    nn.Linear(51200, 1024),
    nn.ReLU(),
    nn.Linear(1024, 1024), 
    nn.ReLU(), 
    nn.Linear(1024, 1024),
    nn.ReLU(), 
    nn.Linear(1024, 3 * self.n_points), 
)

def forward(images, args):
    decoded_features = self.decoder(encoded_feat)
    deform_vertices_pred = decoded_features
    mesh_pred = self.mesh_pred.offset_verts(
        deform_vertices_pred.reshape([-1, 3]))

    return mesh_pred
Run Command
# to train
python3 main.py train_model --type 'mesh'
# to eval
python3 main.py eval_model --type 'mesh' --load_checkpoint
# Input Image GT Mesh Predicted Mesh
1 q2.1.1 image q2.1.1 GT q2.1.1 result
2 q2.1.2 image q2.1.1 GT q2.1.1 result
3 q2.1.3 image q2.1.1 GT q2.1.1 result

2.4. Quantitative comparisions(10 points)

# Type F1 Score
1 Voxel Grid 81.348
2 Pointcloud 96.654
3 Mesh 85.459

F1 score is computed based on pointclouds. For voxels and mesh, the point are samples from the predictions. We can see that Pointcloud has the highest F1 score because it is predicted directly on the points without any other conversion.

Meshes have a constraint based on the connectivity of the vertices and faces from the initial structures used. In this case we use a sphere, which is a water tight structures. It'll not be possible to deform this into a chair which has holes in them. As a result, the F1 score is less.

F1 score for voxels is the least because the voxels have to first be converted to a mesh and then the points have to be samples from this mesh. However, if the voxel grid itself has errors or is at a much lower resolution, then it'll result in the much lesser F1 score. Here, we're using a resolution of 32x32x32 which is too less to capture some of the thin structres in the chair shapes, and hence results in the lowest F1 scores.

2.5. Analyse effects of hyperparms variations (10 points)

I played around with --w_chamfer and --w_smooth hyperparameters and the results are as seen below. We can see the mesh structures has sharp edges and faces with low smoothness value. This is expected because, by increasing the smoothness value, the loss is penalized more to ensure all the vertices are as much co-planar as possible.

Run Command
# to train
python3 main.py train_model --type 'mesh' --w_smooth 0.4
python3 main.py train_model --type 'mesh' --w_smooth 1
python3 main.py train_model --type 'mesh' --w_smooth 1.5
--w_smooth=0.4 --w_smooth=1 --w_smooth=1.5
q2.5 image q2.5 image q2.5 image
q2.5 image q2.5 image q2.5 image
q2.5 image q2.5 image q2.5 image

Another experiment I tried was to use a mesh model of a chair as an initial object instead of the sphere mesh. My hypothesis was that since a sphere mesh inherently cannot represent chair models with holes, using a generic chair model should make the network learn better to represent such complex chair models. However, my model didn't learn any better. The loss saturated after a few epochs and the generated outputs weren't any better. The results are as seen below.

q2.5 plot

Sample result 1 Sample result 2 Sample result 3
q2.5 image q2.5 image q2.5 image

2.6. Interpret your model (15 points)

To understand my model, I wanted to check if the model is learning the features of the chairs, i.e, some chairs have thin legs, some chairs have holes, some are flat and big, etc. If the model is learning these features correctly, it should be able to identify similar kind of objects. Therefore for this purpose, I experimented by checking if the model is able to query a given type of chair and return similar ones. The results are shown below. I've used 3 types of chairs, and the corresponding results are shown besides this.

To perform this experiment, I chose the pointcloud model, and extracted the features from the second last layers. All these layers are indexed and a KNN algorithm is run to identify the models closest to the given query model.

From these results, it is seen that my model is able to learn individual features of the chair.

Run Command
# to interpret
python3 main.py interpret_model --type 'point' --load_checkpoint
Query Object Result 1 Result 2 Result 3
q2.6 image q2.6 image q2.6 image q2.6 image
q2.6 image q2.6 image q2.6 image q2.6 image
q2.6 image q2.6 image q2.6 image q2.6 image

3. (Extra Credit) Exploring some recent architectures.

3.1 Implicit network (10 points)

Implement a implicit decoder that takes in as input 3D locations and outputs the occupancy value. Some papers for inspiration [1,2]

3.2 Parametric network (10 points)

Implement a parametric function that takes in as input sampled 2D points and outputs their respective 3D point. Some papers for inspiration [1,2]


Last update: March 2, 2022
Back to top