16-889 Assignment 2: Single View to 3D

Name: Sri Nitchith Akula
Andrew ID: srinitca

1. Exploring loss functions

1.1. Fitting a voxel grid (5 points)

Implementation

def voxel_loss(voxel_src,voxel_tgt):
  loss = nn.BCEWithLogitsLoss()
  prob_loss = loss(voxel_src,voxel_tgt)
  return prob_loss

Run Command

python main.py -q q1.1

Optimized Voxel Grid	Ground Truth Voxel Grid

1.2. Fitting a point cloud (10 points)

Implementation

def chamfer_loss(point_cloud_src,point_cloud_tgt):
  knn_1 = pytorch3d.ops.knn.knn_points(point_cloud_src, point_cloud_tgt)[0]
  knn_2 = pytorch3d.ops.knn.knn_points(point_cloud_tgt, point_cloud_src)[0]
  loss_chamfer = torch.sum(knn_1) + torch.sum(knn_2) 
  return loss_chamfer

Run Command

python main.py -q q1.2

Optimized Point Cloud	Ground Truth Point Cloud

1.3. Fitting a mesh (5 points)

Implementation

def smoothness_loss(mesh_src):
  loss_laplacian = pytorch3d.loss.mesh_laplacian_smoothing(mesh_src)
  return loss_laplacian

Run Command

python main.py -q q1.3

Optimized Mesh	Ground Truth Mesh

2. Reconstructing 3D from single view

2.1. Image to voxel grid (15 points)

Architecture

class VoxNet(nn.Module):
  def __init__(self, feature_size):
    super(VoxNet3, self).__init__()
    self.fc1 = nn.Linear(feature_size, 1024)
    self.relu = nn.PReLU()
    self.fc2 = nn.Linear(1024, 32*32*32)
    return

  def forward(self, x):
    b = x.shape[0]
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    x = x.reshape((b, 1, 32, 32, 32))
    return x

Run Command

python main.py -q q2.1

#	Input RGB Image	Predicted 3D Voxel Grid	Ground Truth Mesh
1
2
3

2.2. Image to point cloud (15 points)

Architecture

class PointNet(nn.Module):
  def __init__(self, features, num_verts):
    super(PointNet, self).__init__()
    self.num_verts = num_verts
    self.fc1 = nn.Linear(features, 4096)
    self.fc2 = nn.Linear(4096, 4096)
    self.fc3 = nn.Linear(4096, num_verts * 3)
    self.relu = nn.PReLU()
    return

  def forward(self, x):
    n = x.shape[0]
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    x = self.relu(x)
    x = self.fc3(x)
    x = x.reshape((n, self.num_verts, 3))
    return x

Run Command

python main.py -q q2.2

#	Input RGB Image	Predicted 3D point cloud	Ground Truth Mesh
1
2
3

2.3. Image to mesh (15 points)

Architecture

class MeshNet(nn.Module):
  def __init__(self, features, num_verts):
    super(MeshNet, self).__init__()
    self.num_verts = num_verts
    self.fc1 = nn.Linear(features, 1024)
    self.fc2 = nn.Linear(1024, 2048)
    self.fc3 = nn.Linear(2048, 4096)
    self.fc4 = nn.Linear(4096, 4096)
    self.fc5 = nn.Linear(4096, num_verts * 3)
    self.relu = nn.LeakyReLU()
    self.tanh = nn.Tanh()
    return

  def forward(self, x):
    n = x.shape[0]
    x = self.fc1(x)
    x = self.relu(x)
    x = self.fc2(x)
    x = self.relu(x)
    x = self.fc3(x)
    x = self.relu(x)
    x = self.fc4(x)
    x = self.relu(x)
    x = self.fc5(x)
    x = self.tanh(x)
    x = x.reshape((n, self.num_verts, 3))
    return x

Run Command

python main.py -q q2.3

#	Input RGB Image	Predicted 3D Voxel Grid	Ground Truth Mesh
1
2
3

2.4. Quantitative comparisions(10 points)

Run Command

python main.py -q q2.4

Type	F1 Score
Voxel grid	80.226
Point Cloud	93.021
Mesh	85.142

Point Cloud : We notice that the point cloud has the highest F1 score. This is understandable because we are trying to directly predict points independent of each other and its easier to predict atleast a few points at any part of chair

Mesh : Unlike point clouds, the predicted model is restricted by the neighboring vertices and face connectivity. If the original chair model has a thin structure, it is harder task for mesh to deform the original sphere to represent it well. If a chair has holes in it, it cant be represented well by deforming a sphere. Thus it gives less F1 score.

Voxel This has the least F1 score of all. To compute F1 score, we have to first generate mesh and then sample points. If the voxel grid itself has errors, the mesh representation will be effected which decresaes the F1 score even more. In the q2.1, the second image has thin chair and network output is missing its legs. Also, we are using 32x32x32 resolution for representing any kind of chair. For chairs that have thin parts, its better to increase the resolution of the prediction.

2.5. Analyse effects of hyperparms variations (10 points)

In general, I observed that training the networks for batch_size >= 8 obtains good results for point, mesh and vox networks.
Below I compared the results of mesh decoder for different w_smooth values
In my implementation, I divided w_chamfer with n_points in train_model.py to be able to do analysis on varying n_points as well
I didnot see big difference in outputs when I changed n_points and traineing them for 10000 iterations

#	`w_smooth = 0.1`	`w_smooth = 1.5`	`w_smooth = 4`
1
2
3

Observations:

We can notice the triangular structures (or pointy) for low smooth values
On increasing w_smooth, smoothness of the final reconstruction improves and takes the form of the ground truth object

2.6. Interpret your model (15 points)

Run Command

python main.py -q q2.6

I interpreted the Point Cloud Decoder model using feature embeddings. I wanted to test the following hypothesis

For similar chairs, the point cloud generated should be similar. This can happen when underlying feature representation in the fc layers are also similar.

To test this hypothesis, I followed the steps below

Select broad set of chair categories and an image corresponding to each cateogry

Type	Image idx in Test set
Arm Chair	0
Dining Chair	1
Club Chair (Large seat space)	120
Chair with base support	140
Chair with slant legs	165

Pass compelete test dataset into the network and extract feature embeddings of penultimate fc layer
For the above 5 images, find closest 4 images to each image in the feature embedding space

K-closest Images

In the above image, each row correspond to a category. The first column are the images we have selected in the previous step and the other columns correspond to the nearest neighbor images

Observations:

We do observe that chairs of similar kind are clustered together
Becuase we trained the models only on chairs, we have effectively constructed 4096 dim vector representation of any chair
We can use this network as a recommender system. Given a chair, we can find similar closest chair.