Assignment 2. Single View to 3D
Course: 16-889 Learning for 3D Vision
Name: Soyong Shin
Due by: Feb. 24 (Thu)

I used 1 late day for this assignment
Contents:
0. Introduction
In this assigment, we practiced to build loss function and neural network model to reconstruct various 3D representations from single-view images.
1. Exploring loss functions
1.1. Fitting a voxel grid (5 points)
1.1.1. Loss function
Here I built binary cross-entropy loss to optimize voxels. To do this, I use pytorch.nn.functional.binary_cross_entropy_with_logitis
function. This does not require pre-normalization or pre-clamping of voxel_src
.
I also define pos_weight
which is the scalar weight added to positive value (occupied voxel). Although this is not necessary for this voxel-wise optimization task, I implemented it for the future neural network training which will be described here.
import torch
from torch import nn
from torch.nn import functional as F
def voxel_loss(voxel_src, voxel_tgt):
pos_weight = (0.5 / voxel_tgt.mean())
prob_loss = F.binary_cross_entropy_with_logits(
voxel_src, voxel_tgt, reduction='mean', pos_weight=pos_weight)
return prob_loss
1.1.2. Results
Here I visualize the results of optimization fitting.
1.2. Fitting a point cloud (10 points)
1.2.1. Loss function
In this task, I implemented chamfer loss to fit 5,000 points cloud to the given points. Chamfer loss is defined as:
To do this, I use pytorch3d.ops.knn.knn_points
function to get the minimum distance from prediction to target and vice-versa. The code implementation is as below.
from pytorch3d.ops.knn import knn_points
def chamfer_loss(point_cloud_src, point_cloud_tgt):
knn_src = knn_points(point_cloud_src, point_cloud_tgt, K=1)
knn_tgt = knn_points(point_cloud_tgt, point_cloud_src, K=1)
dist_src, idx_src = knn_src[:2]
dist_tgt, idx_tgt = knn_tgt[:2]
loss_chamfer = dist_src.mean(1).sum() + dist_tgt.mean(1).sum()
return loss_chamfer
1.2.2. Results
Optimization results are shown as below.
1.3. Fitting a mesh (5 points)
1.3.1. Loss function
Here, I implemented Laplacian smoothing loss. This smoothing function is to set that each vertex should stay along with its neighbor. To do this, I use laplacian_packed
function to compute laplacian of the vertex.
def smoothness_loss(mesh_src):
verts = mesh_src.verts_packed()
laplacian = mesh_src.laplacian_packed()
loss_laplacian = (laplacian @ verts).norm(dim=1)
loss_laplacian = loss_laplacian.mean()
return loss_laplacian
1.3.2. Results
The below gif
files are the results for mesh fitting.
2. Reconstructing 3D from single view
2.0. Dataset analysis
Before build the model, I first analyzed the structure of the given dataset.
# args.batch_size = 2
# Size of dataset (available pairs)
>> args.batch_size * len(train_loader)
>> 6098
# Size of image
>> images.size()
>> torch.Size([2, 137, 137, 3])
# Size of encoded features
>> encoded_feat.size()
>> torch.Size([2, 512])
# Size of voxel
# args.arch = 'vox'
>> ground_truth_3d.size()
>> torch.Size([2, 1, 32, 32, 32])
# Size of point cloud
# args.arch = 'point'
>> ground_truth_3d.size()
>> torch.Size([2, ])
# Size of mesh
# args.arch = 'mesh'
>> ground_truth_3d.size()
>> torch.Size([2, ])
2.1. Image to voxel grid (15 points)
2.1.1. Decoder architecture

The above figure illustrates my decoder architecture to reconstruct 3D volumes.
- From the encoded feature , I remapped it into small volume ().
- Using multiple 3D Deconv layers, I upsampled the volume ().
- By passing a single 3D Conv layer, I estimate voxel-wise heatmaps ()
Loss balancing
Since the groundtruth data has more frequent empty voxel (i.e. unoccupied volume that filled with 0
), the network may learn to output smaller value to find the local optimum. Therefore, I weighted the positive value more to balance with the negatives.
To do this, I calculate the portion of positives for every mini-batch groundtruths and applied the weights inverse proporional to that. The code implementation was introduced above.
2.1.2. Sample results

2.2. Image to point cloud (15 points)
2.2.1. Decoder architecture

The decoder architecture for this task is relatively simpler. Since the point cloud can be represented as 3-dimensional location where is the number of points in the point cloud. Thus, I simply created 3-layers MLP with ReLU
and Dropout
modules.
2.2.2. Sample results

2.3. Image to mesh (15 points)
2.3.1. Decoder architecture
Here, I used same decoder following Image to point cloud model.
2.3.2. Sample results
2.4. Quantitative comparisions(10 points)
Here I ran quantitative comparisons among 3 different models; voxel/point cloud/mesh estimation.
3D representation | Voxel | Point cloud | Mesh |
---|---|---|---|
F1@0.05 Score | 68.0 | 92.4 | 94.4 |
Although relatively low F1 score was recorded for the volumetric representation, the reconstructed results is comparable or even better than the other. The above visualization shows the reconstruction for the same target.
2.5. Analyse effects of hyperparms variations (10 points)
2.5.1. Varying the number of points
The first experiment were using two different numbers of points for point cloud. The default setting was 5000 points, and I trained with sparser points (2500). Both models were trained 10 epochs with resnet18
encoder architecture.
F1 score were reported as 92.4 for 5000 points and 89.8 for 2500 points. Not only for the score, when visualizing the results, 5000 points are more likely to generate dense 3D representation.
2.5.2. Using / Not using weight on binary cross-entropy loss for voxel prediction
In this section, I trained voxel prediction network without using positive weight (pos_weight
for binary_cross_entropy_with_logits
function). Unfortunately, with this setting, the mesh rendering was failed since none of the voxels were predicted higher than the threshold. This indicates that the groundtruth 3D voxels are more likely to be filled with 0s
and the network was biased to predict lower value. Therefore, I confirmed the fact that balancing weight is essential for training when the labels are unbalanced across classes.
2.6. Interpret your model (15 points)
In this section, I analyze the robustness of my models on the variation of camera views. To do this, I input 10 different views containing an identical object, but captured by different camera views. As shown in below, all models are quite robust from the different camera views. Some results (e.g., 1st column - 4th row of voxel prediction) shows some limitation in predicting occluded side, which is entailed problem in single-view approaches.
2.6.1. View invariant prediction for voxel representation

2.6.2. View invariant prediction for point cloud representation

2.6.3. View invariant prediction for mesh representation
