[16-889] Assignment 3 Submission: Aarush Gupta (aarushg3)

Late days used: 1

1 late days image


1. Differentiable Volume Rendering

1.3. Ray sampling (10 points)


vis_grid output for xy_grid: Alt Text


vis_rays output for ray_bundle: Alt Text


1.4. Point sampling (10 points)


render_points output for point_samples: Alt Text


1.5. Volume rendering (30 points)


Depth output: Alt Text


Colour output: Alt Text


2. Optimizing a basic implicit volume

2.1. Random ray sampling (5 points)


Please see code for get_random_pixels_from_image method in ray_utils.py for this part (No output figure/visualization has been asked for this part).

2.2. Loss and training (5 points)

Centre of box:(0.25, 0.25, -0.00)

Side Lengths of box: (2.00, 1.50, 1.50)

(rounded off to 1/100 decimal place)

2.3. Visualization

Generated gif:

Learned GIF

Original gif provided in the assignment:

OG GIF

3. Optimizing a Neural Radiance Field (NeRF) (30 points)

Generated gif:

Learned GIF

Original gif provided in the assignment:

OG GIF

For the above, I have used the same network architecture as in the NeRF paper with the following changes:

  1. Reduce the number of layers in the intial part of the network from 8 to 5.
  2. Removing the view dependence by not concatenating the view information (as in the paper: after the 8 fully connected layers) (as mentioned in the question instructions)

Please refer to the code for full details of the architecture.

4. NeRF Extras (Choose at least one! More than one is extra credit)

4.1 View Dependence (10 pts)

For this part, I've used a 3 layer initial-network (as compared to the 5 layer initial-network in part 3 of the question).

Learned View dependent GIF 3 layer

Using a 5 layer inital-network, I obtain the following results:

Learned View dependent GIF 5 layer

There doesn't seem to be a significant difference in the visualizations of the view-dependent and view-independent predictions with my architecture. It seems that there are some subtle differences, such as the view-dependent predictions seem to represent specularity better, and the view-independent predictions seem to be more sharper, but I won't call my observations conclusive. In my opinion, a more detailed analysis with more sample points and other objects is required before commenting conclusively. (The authors in the paper mention a difference in the metrics to compare the view-dependent and view-independent predictions, that seems to be a good direction).

4.3 High Resolution Imagery (10 pts)

Running the NeRF model with the default nerf_lego_highres.yaml configuration with the same architecture as part 3, yields the following visualization:

Learned High-res GIF

For hyperparameter tuning, I tried changing the number of n_pts_per_ray config parameter from 128 to 64, 256 and 512. 512 n_pts_per_ray uses a lot of GPU memory, and trains very slowly. Hence, I could train it for only 25 epochs. It yields the following sub-optimized result:

Learned High-res GIF 512 pts_per_ray

For 64 n_pts_per_ray, I get the following result:

Learned High-res GIF 64 pts_per_ray

For 256 n_pts_per_ray, I get the following result:

Learned High-res GIF 64 pts_per_ray

Note that the visualization is considerabley sharper for 256 points per ray as compared to 64 points per ray and marginally sharper than the model with 128 points per ray.

I also tried changing the number of layers in the intial part of the network. Specifically, I changed the number of layers from 5 to 3 and 7.

For 3 layers, I obtain the following resutl:

Learned High-res GIF 3 layer

For 7 layers, I obtain the following resutl:

Learned High-res GIF 7 layer

Here also, the results are sharper for the 7 layer network as compared to the 3 layer network. We can conclude that the 7 layer network has more experessive power.