Assignment 2. Single View to 3D

Course: 16-889 Learning for 3D Vision

Name: Soyong Shin

Due by: Feb. 24 (Thu)

I used 1 late day for this assignment

Contents:

0. Introduction

In this assigment, we practiced to build loss function and neural network model to reconstruct various 3D representations from single-view images.

1. Exploring loss functions

1.1. Fitting a voxel grid (5 points)

1.1.1. Loss function

Here I built binary cross-entropy loss to optimize voxels. To do this, I use pytorch.nn.functional.binary_cross_entropy_with_logitis function. This does not require pre-normalization or pre-clamping of voxel_src.

I also define pos_weight which is the scalar weight added to positive value (occupied voxel). Although this is not necessary for this voxel-wise optimization task, I implemented it for the future neural network training which will be described here.

import torch
from torch import nn
from torch.nn import functional as F

def voxel_loss(voxel_src, voxel_tgt):
		pos_weight = (0.5 / voxel_tgt.mean())
		prob_loss = F.binary_cross_entropy_with_logits(
				voxel_src, voxel_tgt, reduction='mean', pos_weight=pos_weight)
		return prob_loss

1.1.2. Results

Here I visualize the results of optimization fitting.

Ground truth voxels
Fitted voxels
Optimization progress

1.2. Fitting a point cloud (10 points)

1.2.1. Loss function

In this task, I implemented chamfer loss to fit 5,000 points cloud to the given points. Chamfer loss is defined as:

dCD(S1,S2)=xS1minyS2xy22+yS2minxS1xy22d_{CD}(S_1, S_2) = \sum_{x\in S_1} \min_{y\in S_2} ||x - y||^2_2 + \sum_{y\in S_2} \min_{x \in S_1}||x-y||^2_2

To do this, I use pytorch3d.ops.knn.knn_points function to get the minimum distance from prediction to target and vice-versa. The code implementation is as below.

from pytorch3d.ops.knn import knn_points
def chamfer_loss(point_cloud_src, point_cloud_tgt):
		knn_src = knn_points(point_cloud_src, point_cloud_tgt, K=1)
		knn_tgt = knn_points(point_cloud_tgt, point_cloud_src, K=1)
		dist_src, idx_src = knn_src[:2]
		dist_tgt, idx_tgt = knn_tgt[:2]
		loss_chamfer = dist_src.mean(1).sum() + dist_tgt.mean(1).sum()
		return loss_chamfer

1.2.2. Results

Optimization results are shown as below.

Groundtruth point cloud
Fitted point cloud
Optimization progress

1.3. Fitting a mesh (5 points)

1.3.1. Loss function

Here, I implemented Laplacian smoothing loss. This smoothing function is to set that each vertex should stay along with its neighbor. To do this, I use laplacian_packed function to compute laplacian of the vertex.

δi=1dijN(i)(vivj)\delta_i = \frac{1}{d_i} \sum_{j \in N(i)} (v_i - v_j)
def smoothness_loss(mesh_src):
		verts = mesh_src.verts_packed()
		laplacian = mesh_src.laplacian_packed()
		loss_laplacian = (laplacian @ verts).norm(dim=1)
		loss_laplacian = loss_laplacian.mean()
		return loss_laplacian
		

1.3.2. Results

The below gif files are the results for mesh fitting.

2. Reconstructing 3D from single view

2.0. Dataset analysis

Before build the model, I first analyzed the structure of the given dataset.

# args.batch_size = 2

# Size of dataset (available pairs)
>> args.batch_size * len(train_loader)
>> 6098

# Size of image
>> images.size()
>> torch.Size([2, 137, 137, 3])

# Size of encoded features
>> encoded_feat.size()
>> torch.Size([2, 512])

# Size of voxel 
# args.arch = 'vox'
>> ground_truth_3d.size()
>> torch.Size([2, 1, 32, 32, 32])

# Size of point cloud
# args.arch = 'point'
>> ground_truth_3d.size()
>> torch.Size([2, ])

# Size of mesh
# args.arch = 'mesh'
>> ground_truth_3d.size()
>> torch.Size([2, ])

2.1. Image to voxel grid (15 points)

2.1.1. Decoder architecture

Decoder architecture for image to voxel model

The above figure illustrates my decoder architecture to reconstruct 3D volumes.

  1. From the encoded feature zR512z \in \mathbb{R}^{512}, I remapped it into small volume (VinR2×2×2×64V_{in} \in \mathbb{R}^{2\times 2 \times 2 \times 64}).
  1. Using multiple 3D Deconv layers, I upsampled the volume (VoutR32×32×32×8V_{out} \in \mathbb{R}^{32\times 32 \times 32 \times 8}).
  1. By passing a single 3D Conv layer, I estimate voxel-wise heatmaps (VhmR32×32×32×1V_{hm} \in \mathbb{R}^{32\times 32 \times 32 \times1})

Loss balancing

Since the groundtruth data has more frequent empty voxel (i.e. unoccupied volume that filled with 0), the network may learn to output smaller value to find the local optimum. Therefore, I weighted the positive value more to balance with the negatives.

To do this, I calculate the portion of positives for every mini-batch groundtruths and applied the weights inverse proporional to that. The code implementation was introduced above.

2.1.2. Sample results

Sample results for voxel prediction. 1st and 3rd rows: input images, 2nd and 4th rows: predictions

2.2. Image to point cloud (15 points)

2.2.1. Decoder architecture

Decoder architecture for point cloud estimation model.

The decoder architecture for this task is relatively simpler. Since the point cloud can be represented as 3-dimensional location p=(x,y,z)RN×3\vec{p}=(x, y, z)\in\mathbb{R}^{N \times 3} where NN is the number of points in the point cloud. Thus, I simply created 3-layers MLP with ReLU and Dropout modules.

2.2.2. Sample results

2.3. Image to mesh (15 points)

2.3.1. Decoder architecture

Here, I used same decoder following Image to point cloud model.

2.3.2. Sample results

2.4. Quantitative comparisions(10 points)

Here I ran quantitative comparisons among 3 different models; voxel/point cloud/mesh estimation.

3D representationVoxelPoint cloudMesh
F1@0.05 Score68.092.494.4
Voxel representation
Point cloud representation
Mesh representation

Although relatively low F1 score was recorded for the volumetric representation, the reconstructed results is comparable or even better than the other. The above visualization shows the reconstruction for the same target.

2.5. Analyse effects of hyperparms variations (10 points)

2.5.1. Varying the number of points

Point cloud with 2500 points
Point cloud with 5000 points

The first experiment were using two different numbers of points for point cloud. The default setting was 5000 points, and I trained with sparser points (2500). Both models were trained 10 epochs with resnet18 encoder architecture.

F1 score were reported as 92.4 for 5000 points and 89.8 for 2500 points. Not only for the score, when visualizing the results, 5000 points are more likely to generate dense 3D representation.

2.5.2. Using / Not using weight on binary cross-entropy loss for voxel prediction

In this section, I trained voxel prediction network without using positive weight (pos_weight for binary_cross_entropy_with_logits function). Unfortunately, with this setting, the mesh rendering was failed since none of the voxels were predicted higher than the threshold. This indicates that the groundtruth 3D voxels are more likely to be filled with 0s and the network was biased to predict lower value. Therefore, I confirmed the fact that balancing weight is essential for training when the labels are unbalanced across classes.

2.6. Interpret your model (15 points)

In this section, I analyze the robustness of my models on the variation of camera views. To do this, I input 10 different views containing an identical object, but captured by different camera views. As shown in below, all models are quite robust from the different camera views. Some results (e.g., 1st column - 4th row of voxel prediction) shows some limitation in predicting occluded side, which is entailed problem in single-view approaches.

2.6.1. View invariant prediction for voxel representation

Voxel prediction from 10 different views

2.6.2. View invariant prediction for point cloud representation

Points prediction from 10 different views

2.6.3. View invariant prediction for mesh representation