16-889 Assignment 3: Volume Rendering and Neural Radiance Fields

In this assignment, I learned how to implement differential volume rendering for EA volumes. I also learned about NeRF and how to use an implicit volume to optimize a scene from RGB images in order to generate a reallistic image from new viewpoints.


Zero late days were used.

1. Differential Volume Rendering

1.3. Ray Sampling

Ray sampling required two steps. The first was building a grid of pixel coordinates from the image, and converting the x and y coordinates to a range of [-1, 1] each. The second step involved extracting rays from pixels. To do this, I had to map pixels to points onto the image plane. I then had to convert the points from the normalized coordinate system to the world coordinate system. Lastly, I found the normalized direction between each point in the world system and the camera origin.


To confirm my implementation was correct, my visualized grid and rays match the provided examples.



1.4. Stratified Point Sampling

Next, I had to implement a stratified point sampler. I followed the process as described in the NeRF paper. First, the min and max depth are partitioned into evenly spaced bins. The number of bins are the number of points that are to be sampled. Then, one point is sampled uniformally from each bin. The advantage of stratified sampling over uniform sampling is that it will later allow the MLP from Question 3 to learn a more continous scene representation as it will be trained on more continuous positions.


Again, to confirm my implementation was correct, my visualized rendered points match the provided examples.


1.5. Volume Rendering

The last step is volume rendering. This requires a density and feature for each sampled point along each ray. The density represents a differential opacity and how much radiance is accumulated by a ray passing through it (NeRF), and the feature is the colour that eventually gets weighted at each ray point determined by the density.


To perform volume rendering, the first step is to calculate the weights used to render each point. This can be done using the equations below, where σ is the density and δ is the difference in depth value between ray points.


The final colour at the point in the image is calculated using an aggregation of the weights multiplied by the feature.


We also calculate an aggregated depth for each point by using the same equation above but substituting the feature for the depth value of the sampled points.


To verify my implementation was correct, I rendered a visualization of my features and depth to confirm they matched the provided examples.



2. Optimizing a Basic Implicit Volume

2.1. Random ray sampling

It can use quite a lot of memory to used sampled points on rays from every pixel in the image. To get around this, I sampled random points from each image.


I randomly selected points with equal probability. If I were do this again or had more time, I would experiment with different random selection methods. For example, images usually have higher content in the middle rather than near the borders. I could experiment with pixel sampling methods that give higher weight to pixels in the middle, such as using a gaussian distribution.


2.2 and 2.3. Loss and training

I implemented the loss function using the mean-squared error between the predicted features (colours) and the ground-truth image. My final visualization for the rendered box matched the provided example. The final loss was 0.000016, the box centre is (0.24984551966190338, 0.2502342462539673, -0.0007037210743874311), and the box side lengths are (2.0026750564575195, 1.5004388093948364, 1.5039490461349487).


3. Optimizing a Neural Radiance Field (NeRF)

As described in the instructions, my MLP takes in a RayBundle in its forward method and outputs a colour and density for each sampled point in each ray. The loss function is the same mean-squared error loss between the aggregated features and the RGB ground-truth image as in 2.2/2.3.

To implement my MLP, I used almost the exact same architecture as NeRF. The only difference is that the first and sixth fc layers, and the 10th in 4.1, have different sizes because the harmonic positional and directional embeddings provided to us have different lengths than those in NeRF. Their positional and directional embeddings are of length 60 and 24 respectively whereas ours are 39 and 15. Other than that, I used the exact same architecture, as seen below.

MLP Architecture

I then trained NeRF on the lego bulldozer dataset for 250 epochs on 128x128 images. One thing I had to do was change my chunk_size to 8192 or else I ran out of GPU memory. Below is the training loss over epochs as well as the final result, which more or less matches the provided example.

Training Loss Over Epoch

4. NeRF Extras

4.1. View Dependance

I implemented view dependence as described in NeRF. Specifically, I concatenated the output of the 9th fc layer with the harmonic embedding of the directions as the input into the 10th fc layer. The advantage to using view dependence is the network can better represent specularities. If you look in the images below, the specular reflection off the floor at the front of the bulldozer is much better represented when view dependance is included. However, it comes at a cost of overfitting the training data. If you look at the details of the bulldozer, it is slightly worse when view dependance is included. This can be seen in the blade and the back wheels. It is also a little blurrier.

MLP Architecture
Training Loss Over Epoch




4.2. Hierarchical Sampling

NeRF actually uses two networks. This is because the authors claim that the stratified sampling method is inneficient as "free space and occluded regions that do not contribute to the rendered image are still sampled repeatedly". To remedy this, NeRF first uses a coarse network with stratified sampling as described above. It then uses the output of this coarse network to perform a more intelligent point sampling where samples are more biased towards the relevant parts of the rendered image. These points are then concatenated with the original stratified sampled points and passed to a fine network which outputs the final features and depths. The architectures for both the coarse and fine networks are the same.


To better sample the points, we use the weights used to calculate the feature of the coarse network discussed in 1.5. We normalize the weights such that they sum to 1, and use them to produce a piece-wise PDF along the ray of the previusly sampled depths. We use the PDF to approximate a CDF and perform inverse-trainsform sampling to sample the new points.


In my implementation, I used an equal number of stratified sampled points and inverse-trainsform sampled points. For fair comparison, I kept the total number of sampled points the same as my implementation without hierarchical sampling. In other words, in my base implementation, 128 points are sampled with stratified sampling. In my hierarchical sampling implementation, 64 points were sampled using stratified sampling, and an additional 64 were sampled using inverse-trainsform sampling, resulting in a total of 128 sampled points passed as input into the fine network.


The results show that hierarchical sampling performs better. The coarse network loss is slightly larger on average and rendered images worse than the network from Question 3, but this is likely because there were fewer points sampled per ray. When we get to the fine network, the loss is lower, and the rendered images look more crisp and detailed. However, this comes at a cost as training time increased significantly. It took approximately three times as long to train the coarse and fine networks. While some of that was because more images were rendered, it would still take at least twice as long, as two models with the same architectures are being optimized. So while there is an increase in quality, there is a significant increase in training time.


Averaging over the last 10 epochs of training, the average loss for the baseline network, coarse network, and fine network are 0.001338, 0.001546, and 0.001063 respectively.

Training Loss Over Epoch




4.3. High Resolution Imagery

For High Resolution Imagery, the images are 400x400 instead of 128x128. In my baseline, I used view dependance. I evaluated the effects of changing the number of sampling points per ray. I compared using 64, 128, and 180 sample points per ray (I ran out of GPU memory with more than 180). The results are what I expected. Using fewer points resulted in worse results, and using more points increased training time. The results can be seen below. Note, the 180 sample points per ray comparatively took even longer than presented. In the 64 and 128 case, I rendered every 10 epochs. However, for the 180 case, rendering was taking way too long, and I only rendered every 80 epochs. Had I kept it at 10 it would have taken significantly longer.

128 Sample Points Per Ray - Last 10 Epochs Avg Loss 0.001624 64 Sample Points Per Ray - Last 10 Epochs Avg Loss 0.002883 180 Sample Points Per Ray - Last 10 Epochs Avg Loss 0.001330




I also compared using view dependance vs not using view dependance. It was run with 128 sample points per ray.

No View Dependance




Lastly, I ran the high resolution imagery with hierarchical sampling, using 128 sample points per ray. I was not able to render using the chunk_size of 8192, and using a chunk_size of 4096 was unbearably slow. Instead, I trained networks without rendering any images while training with a chunk_size of 8192. Once training was complete, I then rendered using a chunk_size of 4096. Even without rendering while training, it took a comparitively long time.


Once again, the coarse network loss is larger and rendered images are worse looking compared to no hierarchical sampling. As well, the fine network loss is smaller and images look more crisp and detailed. Averaging over the last 10 epochs of training, the average loss for the baseline network, coarse network, and fine network are 0.001624, 0.002607, and 0.001364 respectively.

Training Loss Over Epoch




Training Visualization

The final thing I wanted to do was to just visualization the progress of each network as it trained. Below are the rendered images of the same angle every 10 epochs for all 250 epochs of training. It is interesting how it the renderings seem to get better, then worse, then better, while gradually on average getting better.



Conclusion

Through this assignment, I learned a lot about volumetric representations and neural radiance fields. I was able to implement a differentiable volume renderer and train a Neural Radiance Field for new viewpoint rendering. I was also able to examine the trade-offs among view dependance, hierarchical sampling, point sampling size, and working with higher resolution images. I would be keen to take this one step further and learn more about implicit surface representations.