CMU 2022 Spring 16-889

Assignment 3: Single View to 3D


1. Differentiable Volume Rendering

1.3 Ray sampling

1.4 Point sampling

1.5 Volume rendering

2. Reconstructing 3D from single view

2.2 Loss and training

2.3 Visualization

Left is my predicted image, right is the reference image of TA. The two look very much the same.

3. Optimizing a Neural Radiance Field (NeRF)

Below is a visualization of my trained NeRF.

Regarding the network, I use the similar architecture to the original NeRF paper, and I use the default config provide in the config file. That is, the network consists of 6 fully-connected layer, each have 128 hidden units with ReLU activations. The xyz of the ray sample points pass through a harmonic embedding network. The xyz embedding is fed into the network at the front of the 1st layer and the 4th layer. The output feature pass through a 1-d linear layer to generate density, and also pass through a 64-d fully connected layer and a 3-d linear layer to generate rgb value.

Regarding training parameters, I use the one provided in the config file, which is learning rate 0.0005 and train for 250 epoch.

4. NeRF Extras

4.1 View Dependence

To adopt view dependency, I pass the ray directions into harmonic embedding layer, and concat it with the feature after backbone full-connected layer to predict rgb value. Results on 128x128 is shown below, left is without view dependence, right is with view dependence. Pay attention to the shovel of the bull dozer in the sampled frame, the shadow inside the shovel is correctly modeled when training with view dependence.

Another result on 400x400 is shown below, left is without view dependence, right is with view dependence. Pay attention to the track of the bull dozer in the sampled frame, the specularity is correctly modeled when training with view dependence.

However, there is a trade-off between view dependency and generalization. If we add more model capacity to the view dependency, there is a possibility that it would overfit on training views. That is, the model only learns to represent views that are presented during training, but could not represent novel views. If this is the case, the model could render seen views well, but render novel views bad. To prevent this, I follow the original NeRF paper to add the view feature after the backbone feature network, before the rgb prediction head, so that view feature has less parameter to learn.

4.2 Hierarchical Sampling

In this part, I follow the detail in the original NeRF paper to implement the heirarchical sampling scheme. Down below on the left is the result without it, on the right is with it. The difference is minor, though the L2 loss is a touch better when training with it. The effect might be more obvious on high-res image. For some implementation detail, the coarse network is same as original network, the fine network prediect original sample points plus 64 additional sample points sampled with weights.

Regarding the trade-off between quality and speed, it is obvious that using a coarse and a fine sampler and network would improve the scene representation, as more sample points are used for representation. However, the trade-off is that because there are two networks involve in one forward pass, the speed in very very slow. In my implementation, I observe a 50-30% speed deduction.

4.3 High Resolution Imagery

In order to render high-res image of better quality, I change several setting to achieve so. The final result is presented here. This is done by increasing the model depth 6 -> 8 layer, increasing hidden dim 128 -> 256, increasing positional encoding function of xyz 6 -> 10, increasing positional encoding function of direction 2 -> 4, using Xavier initialization for all weights, and adapting view dependency during training.

Down below, I detailed the aspects that I have tried to change in order increase high-res imagery quality.

Layer numbers

I increase the layer number 6 -> 8 in order to incease model capacity. Down below on the left is results with 6 layers, right is 8 layers. The result with 8 layers is a touch crispier.

Points sample per ray

I try to increase the points sample per ray 128 -> 256. Down below on the left is results with 128, right is 256. Nevertheless, this does not help with image quality.

Number of positional encoding function

I increase number of positional encoding fuction 6 -> 10. Down below on the left is results with 6, right is 10. It is clear that more numbers of positional encoding function will contribute to much cripier rendering. Now the lego bricks are clearly visible.

Hidden dimension and Xavier weight initialization

I try to increase the hidden dimension 128 -> 256 initially, but the network does not manage to converge. To help with convergence, I employ Xavier weight initialization for all weights. Down below on the left is results with 128, right is 256. With larger hidden dimension, the network capicity is increased and results in clearer image.