16-889 Assignment 2: Single View to 3D

Name: Adithya Sampath

Andrew ID: adithyas

Late days used:

Late Days

1. Exploring loss functions

This section will involve defining a loss function, for fitting voxels, point clouds and meshes.

1.1. Fitting a voxel grid (5 points)

I have defined the loss funtion as follows:

def voxel_loss(voxel_src,voxel_tgt):
  loss = torch.nn.BCEWithLogitsLoss()
  prob_loss = loss(voxel_src, voxel_tgt)
  return prob_loss

To re-create the below results:

python main.py --question 1.1

or

python fit_data.py --type "vox"
Fit Voxel Grid Ground Truth Voxel Grid
Optimized Voxel Grid Ground Truth Voxel Grid
Optimized Voxel Grid Ground Truth Voxel Grid
Optimized Voxel Grid Ground Truth Voxel Grid

1.2. Fitting a point cloud (10 points)

I have defined the loss funtion as follows:

def chamfer_loss(point_cloud_src, point_cloud_tgt):
    p1_dists, p1_idx, p1_nn = knn_points(point_cloud_src, point_cloud_tgt, K=1, return_nn=True)
    p2_dists, p2_idx, p2_nn = knn_points(point_cloud_tgt, point_cloud_src, K=1, return_nn=True)
    loss_chamfer = torch.sum(p1_dists) + torch.sum(p2_dists)
    return loss_chamfer

To re-create the below results:

python main.py --question 1.2

or

python fit_data.py --type "point"
Fit Point Cloud Ground Truth Point Cloud
Optimized Point Cloud Ground Truth Point Cloud
Optimized Point Cloud Ground Truth Point Cloud
Optimized Point Cloud Ground Truth Point Cloud

1.3. Fitting a mesh (5 points)

I have defined the loss funtion as follows:

def smoothness_loss(mesh_src):
    loss_laplacian = pytorch3d.loss.mesh_laplacian_smoothing(mesh_src, method='uniform')    
    return loss_laplacian

To re-create the below results:

python main.py --question 1.3

or

python fit_data.py --type "mesh"
Fit Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh

2. Reconstructing 3D from single view

This section will involve training a single view to 3D pipeline for voxels, point clouds and meshes. Refer to the save_freq argument in train_model.py to save the model checkpoint quicker/slower.

2.1. Image to voxel grid (15 points)

The decoder architecture I used:

class VoxelDecoder(nn.Module):
    def __init__(self):
        super(VoxelDecoder, self).__init__()
        self.device = torch.device("cuda")
        self.decoder = nn.Sequential(
                      nn.Linear(512, 1024),
                      nn.ReLU(),
                      nn.Linear(1024, 32*32*32),
                      Reshape_2()).to(self.device)

    def forward(self, x):
        x = x.to(self.device)
        decoded_features = self.decoder(x)
        return decoded_features

class Reshape_2(nn.Module):
    def __init__(self):
        super(Reshape_2, self).__init__()

    def forward(self, x):
        return x.view(x.shape[0], 1, 32, 32, 32)

To re-create the below results:

python main.py --question 2.1

or

python train_model.py --type "vox" --batch_size 8
python eval_model.py --type "vox" --batch_size 8
Source Image Ground Truth Mesh Predicted Voxel grid
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh

2.2. Image to point cloud (15 points)

The decoder architecture I used:

class PointCloudDecoder(nn.Module):
    def __init__(self, in_features, out_verts):
        super(PointCloudDecoder, self).__init__()
        self.device = torch.device("cuda")
        self.decoder = nn.Sequential(
                    nn.Linear(in_features, 1024),
                    nn.ReLU(),
                    nn.Linear(1024, 2048),
                    nn.ReLU(),
                    nn.Linear(2048, out_verts*3), 
                    Reshape_3(out_verts)
                    ).to(self.device)

    def forward(self, x):
        x = x.to(self.device)
        decoded_features = self.decoder(x)
        return decoded_features

class Reshape_3(nn.Module):
    def __init__(self, out_verts):
        super(Reshape_3, self).__init__()
        self.out_verts = out_verts

    def forward(self, x):
        return x.view(x.shape[0], self.out_verts, 3)

To re-create the below results:

python main.py --question 2.2

or

python train_model.py --type "point" --batch_size 8
python eval_model.py --type "point" --batch_size 8
Source Image Ground Truth Point Cloud Predicted Point Cloud
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh

2.3. Image to mesh (15 points)

The decoder architecture I used:

class MeshDecoder(nn.Module):
    def __init__(self, in_features, out_verts):
        super(MeshDecoder, self).__init__()
        self.device = torch.device("cuda")
        self.decoder = nn.Sequential(
                      nn.Linear(in_features, 1024),
                      nn.ReLU(),
                      nn.Linear(1024, 2048),
                      nn.ReLU(),
                      nn.Linear(2048, 4096),
                      nn.ReLU(),
                      nn.Linear(4096, 8192),
                      nn.ReLU(),
                      nn.Linear(8192, out_verts*3),
                      Reshape_3(out_verts)
                    ).to(self.device)

    def forward(self, x):
        x = x.to(self.device)
        decoded_features = self.decoder(x)
        return decoded_features

class Reshape_3(nn.Module):
    def __init__(self, out_verts):
        super(Reshape_3, self).__init__()
        self.out_verts = out_verts

    def forward(self, x):
        return x.view(x.shape[0], self.out_verts, 3)

To re-create the below results:

python main.py --question 2.3

or

python train_model.py --type "mesh" --batch_size 8
python eval_model.py --type "mesh" --batch_size 8
Source Image Ground Truth Mesh Predicted Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh

2.4. Quantitative comparisions(10 points)

Type Avg. F1 Score
Vox 78.781
Point 91.764
Mesh 89.072

Point Clouds have the highest F1 score, followed by Mesh, and finally Voxel Grid with the lowest F1 score.

Analysis

  1. Voxels need to predict occupancy on a constrained 32x32x32 grid. The ground truths for voxels are generated using Marching Cube algorithm to get an approximation of the ground truth mesh into the 32x32x32 space. This is not the best representation to capture the required features we want our model to learn. I found in many of my outputs that legs of the chairs were missing. Hence, we would prefer point clouds and meshes over this representation

  2. Although Mesh performs better compared to Voxel Grid, the meshes are restriced by definition of the icosphere. The model is essentially trying to deform this icosphere into a chair. Also, since the loss is a weighted sum of Chamfer and the Laplacian smoothness loss, the weight hyperparameters need to be chosen carefully - making the training also slightly challenging. The model is unable to capture finer details like gaps in the back of the chair (and struggles to capture thin legs).

  3. Point clouds have the highest F1 score since they don't have any of the above restrictions (like a 32x32x32 grid or deforming an icosphere). The model directly tries to predict the 3D coordinates of a point - thus it is easier to train. The model is able to capture some level of intricate structures unlike the above two approaches.

2.5. Analyse effects of hyperparms variations (10 points)

(1) w_smooth and n_points Hyperparameter tuning on Mesh

To re-create the below results:

python main.py --question 2.5.1

or

python train_model.py --type "mesh" --batch_size 8 --w_smooth 500
python train_model.py --type "mesh" --batch_size 8 --w_smooth 1000
python train_model.py --type "mesh" --batch_size 8 --w_smooth 800 --n_points 10000

Analysis:

  1. As the w_smooth smooth parameter was increased, the meshes were smoothened more. As observed in the 4rd column, the back of the chairs have become one uniform surface (i.e there's no space between the back legs of the chair). This parameter is a weight for the Laplacian smoothness loss - increasing it causes increased smoothening.

  2. As the n_points parameter was increased, the more points from the mesh were used to compute the chamfer loss component of the final loss. As observed in the 3rd column, the model is able to capture more features (like better height and width of chair) compared to the other outputs.

Source Image Ground Truth Mesh Predicted Mesh with w_smooth 800 and n_points 10000 Predicted Mesh with w_smooth 1000 Predicted Mesh with w_smooth 500
Optimized Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh

(2) n_points Hyperparameter tuning on Point Cloud

To re-create the below results:

python main.py --question 2.5.2

or

python train_model.py --type "point" --batch_size 8 --n_points 2500
python train_model.py --type "point" --batch_size 8 --n_points 5000
python train_model.py --type "point" --batch_size 8 --n_points 10000
python train_model.py --type "point" --batch_size 8 --n_points 20000
Source Image n_points 2500 n_points 5000 n_points 10000 n_points 20000
Optimized Mesh Optimized Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh Ground Truth Mesh

Analysis:

  1. As the n_points parameter was increased, the more points from the ground truth mesh were used to compute the point clouds ground truth. This essentialy affects the size of the final Dense layer, since output of final layer is n_points * 3. As observed in the outputs above, the outputs are more dense as the n_points is increases. The model trained with more points also better captures the shape of the chair (like the height and width of chair) compared to the model trained with lesser n_points parameter. Also, as shown in the table below, the model trained with n_points 20000 has the highest F1 score of 95.297

Comparison of Avg. F1 scores on changing n_points:

n_points Avg. F1 Score
2500 90.893
5000 91.764
10000 93.574
20000 95.297

2.6. Interpret your model (15 points)

To re-create the below results:

python main.py --question 2.6

or

python collect_test_features.py --type "point" 
python knn.py --type "point" 
python knn_pca.py --type "point" 

To interpret my model, I first chose the model with the highest F1 score for analysis - i.e. the Point Cloud model. I then used the pre-final layer output to collect 2048 dim feature vectors for each test image.

To get this, I load the model and its weights from the checkpoint, set the last layer to nn.Identity(), and run python collect_test_features.py --type "point" to collect the feature vectors.

This feature vector should essentially give an idea of what are the features of the chair. Given any any chair's feature vector, I wanted to check if I can retrive similar chairs from the test set. So I tried an image retrival setup, where I pick random indices from test data and tried to find it's 4 Nearest Neighbors. I pick random feature vectors and used the K-Nearest Neighbors (KNN) algorithm to get the closest/most similar images.

The results are shown below. The first image is the test image, the other 4 are it's nearest neighbors:

Ground Truth Mesh

Ground Truth Mesh

Although, the results are not bad, I wanted to improve it.

So I experimented with PCA (Principal Component Analysis), to reduce the high-dimentional feature vectors (of dim 2048) to a lower dimention (of dim 64). I then applied KNN, on the set of reduced dimentional vectors to retrive the 4 most similar images for each test image.

The results are shown below. The first image is the test image, the other 4 are it's nearest neighbors. These results are a significant improvement:

1) This output shows all the white couches with wide seats.

Ground Truth Mesh

2) This output shows all the office chairs with multiple wheels as a base.

Ground Truth Mesh

3) This output shows all the couches with wide seats.

Ground Truth Mesh

4) This output shows all the tall, white chairs with a small seat.

Ground Truth Mesh

Thus, we can observe that the feature vector from the pre-final layers is able to able capture features which can help distinguish or group similar chairs. The output can be improved further by using a dimentionality reduction technique like PCA.

3. (Extra Credit) Exploring some recent architectures.

3.1 Implicit network (10 points)

Implement a implicit decoder that takes in as input 3D locations and outputs the occupancy value. Some papers for inspiration [1,2]

3.2 Parametric network (10 points)

Implement a parametric function that takes in as input sampled 2D points and outputs their respective 3D point. Some papers for inspiration [1,2]

Architecture:

I have implemented a simplified version of the AtlasNet paper using only Dense layers. What this model essentially does is first, encode the output from ResNet18 encoder to a feature vector of size 128. Then, the 2D sampled points from the image are also encoded into a feature vector of size 128. Next, the feature vectors from step 1 and step 2 are added element wise. Finally, this vector is then passed to the decoder to predict the 3D coordinate of the point.

I added the below class to be used as a decoder in model.py. Have added a new --type called parametric_point argument in the argparse. I use the same ResNet18 encoder (used for Question 2) even for this setup. The results could be improved by training with deeper networks and for more epochs.

The model architecture I used:

class ParametricDecoder(torch.nn.Module):
    def __init__(self):
        super(ParametricDecoder, self).__init__()
        self.device = torch.device("cuda")
        self.linear_encoder = nn.Sequential(
            nn.Linear(512, 128),
            nn.ReLU()
        ).to(self.device)
        self.points_encoder = nn.Sequential(
            nn.Linear(2, 32),
            nn.ReLU(),
            nn.Linear(32, 128),
            nn.ReLU(),
        ).to(self.device)
        self.decoder = nn.Sequential(
            nn.Linear(128, 32),
            nn.ReLU(),
            nn.Linear(32, 3)
        ).to(self.device)              

    def forward(self, encoder_features, points):       
        encoded_features = self.linear_encoder(encoder_features).unsqueeze(1)               
        point_features = self.points_encoder(points)                            
        final_features = torch.add(encoded_features, point_features)                                      
        output = self.decoder(final_features)                                         
        return output

To re-create the below results:

python main.py --question 3.2

or

python train_model.py --type "parametric_point" --batch_size 8
python eval_model.py --type "parametric_point" --batch_size 8
Source Image Ground Truth Point Cloud Predicted Point Cloud
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh
Optimized Mesh Ground Truth Mesh Ground Truth Mesh