16-889 Assignment 3
Yutian Lei (Andrew ID: yutianle)

In this assignment, the NeRF model proposed in NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis is implemented step by step. In the first section, differentiable renderer for emission-absorption (EA) volumes is implemented, which is used to optimize scene parameters in later section. In the second section, we tried to optimize the parameters of a square volume using the differentiable volume renderer and a given Signed Distance Fields (SDF) as implicit function. In the third section, the navie Neural Radiance Field (NerF) is implemented, and used to as a implicit volume to optimize a scene from a set of RGB images. Finally, in the last section, View Dependence, Hierarchical Sampling and High Resolution Imagery is implemented to improve the performance and accuracy of the navie NeRF.
1. Differentiable Volume Rendering
1.3. Ray sampling
In this section, theget_pixels_from_image
and get_rays_from_pixels
methods are implemented to generate pixel coordinates, ranging from [-1, 1]
for each pixel in an image and to generate rays for each pixel respectively. The visualization of the xy_grid
and rays
with the vis_grid
and vis_rays
functions is shown below.
1.4. Point sampling
In this section, the StratifiedSampler
is implemeneted to sample requried points from rays uniformly. The visualization of point samples from the first camera is shown below.

1.5. Volume rendering
In this section, the function VolumeRenderer._compute_weights
is implemented to compute weight used for rendering from transmittance and density accoridng to equation
and VolumeRenderer._aggregate
is implemented to aggregate (weighted sum of) features using weights according to equation
Finally, the depth map and color of a given volume can be rendered.
The rendering results of the provied box and its depth map is shown below.
I also implement serveral SDFVolume
classes in the recomened website. The rendering results are given as below (the sphere is the from your code).
2. Optimizing a basic implicit volume
2.1. Random ray sampling
In this section, the get_random_pixels_from_image
method is implemented to sample a subset of rays from a full image for each training iteration. Refer to the code for detail,
2.2. Loss and training
By defining loss as
loss = torch.nn.functional.mse_loss(out['feature'], rgb_gt)
The position and side lengths of the optimized box will be
Center of the Box: (0.25, 0.25, 0)
Side Lengths of the Box: (2.00, 1.50, 1.50)
2.3. Visualization
The visulization of my optimized result and the given TA’s result is shown as below.
3. Optimizing a Neural Radiance Field (NeRF)
In this part the naive low-resolution NeRF is implemented without view dependence and hierarchical sampling. Compared with the original NeRF paper, this naive NeRF is different by:
- Only 6 fully-connected layers, and the skip connection is after the third layer;
- Half latent feature channel size;
- 36 channels of positional encoding of points rather than 60
- Using LeakyReLU as activation instead of ReLU
The model is trained using exactly the same setting as given, but for only 150 epochs which I found more than enough for the model to converge. The visulizations of the rendering results with a step of 30 epoches are given as below.
4. NeRF Extras
4.1 View Dependence
In this part the view dependence is added to my naive NeRF model by concatenating the positional encoding of the input viewing direction to the final output feature of fc layers.
The model is trained using exactly the same setting as naive with adding view dependence for 150 epochs. The visulizations of the rendering results with a step of 30 epoches are given as below.
As claimed by the author of Nerf, the a model trained without view dependence (only x as input) has difficulty representing specularities. For example, the left result of navie model dosen’t capture the two small red lights well while the model with view dependence recreats them successfully. And I didn’t observe the overfitting to the training images, as mentioned in the github, in this lego example actually. The possible reason is the model is in low-resolution and the training images are omnidirectional.
4.2 Hierarchical Sampling
In this part the hierarchical sampling is added to my naive NeRF model by simultaneously optimizing the “coarse” and “fine” networks. In the training and inferencing stage, the output of the“coarse” network is used to sample more informative points along each ray. The newly sampled points together with the stratified sampled points are then fed into the “fine” networks to generate a better results.
During the training, 64 points are sampled uniformly along each ray for the“coarse” network and 128 points are sampled again using inverse transform sampling. The 192 poins are fed into the “fine” network. The other setting is exactly the same as naive one. The visulizations of the rendering results with a step of 30 epoches are given as below.
The hierarchical sampling more than doubles the training time and it’s actually training two networks compared with the naive case. The comparsion of the hierarchical sampling result and naive result is shown below. The hierarchical sampling does improve the quality of rendering, but not so benifical and not worthy to double the training time as a cost in this low-resolution case.
4.3 High Resolution Imagery (10 pts)
Finally, I put all things toghter to implement a “full” high-resolution NeRF is implemented with view dependence and hierarchical sampling. The “full” NeRF is almost the same as the original paper but with using LeakyReLU as activation instead of ReLU. The network structure is shown below. (Borrowed the figure from NeRF paper).

The model is trained for 150 epoches and the final results of the “full” high-resolution NeRF on lego example is shown below.

4.4 Experiments with Fern and Pytorch3dLogo
I further run the model on Fern and Pytorch3dLogo dataset by changing the cfg.data.dataset_name
to fern
and pt3logo
respectively
The Fern model is trained for 2000 epoches with [252, 189]
input size and the Pytorch3dLogo model is trained for 2000 epoches with [256, 128]
. (I don't use the full size input due to limited computing resources.) And both model are trained with hierarchical sampling and view dependence techenique. The results are shown below. Note that in the testing stage, the Figure-of-8 trajectory around the center of the central camera of the training dataset should be used to build the test cameras instead of circular trajectory we used in the lego case. And I borrow the code to generate such cameras from pytorch official implementation of Nerf.