16-889 HW3, NeRF

0. Late Days used - 2



1.3



1.4


1.5



I had some issues here that I had to come back and fix later. Namely, I was pre-allocating an array of transmission for dynamic programming. PyTorch does not like this as it cannot propagate gradients through it, so I had to instead create a list of "rows" of transmission - each row was a torch.tensor that represented the transmission at that point for each ray, or the transmission at that distance for all rays. My final code was still quite simple, like so:


    trans_rows = [torch.ones_like(deltas[:, 0])]

    # DP for next transmittance
    for idx in range(1, deltas.shape[1]):
        row = trans_rows[idx - 1] * torch.exp(
            -rays_density[:, idx - 1] * deltas[:, idx - 1]
        )
        trans_rows.append(row)
    trans = torch.stack(trans_rows, dim=1)
    

Now, trans is an array of transmissions the same shape as the distances (delta's) and densities that can be multiplied point-wise in a straight-forward manner, while retaining gradients.

2.1 and 2.2

The code is one line for each task, so I'll just include it here. This is for 2.1, which subsamples the (nx2) input xy_grid.


    xy_grid_sub = xy_grid[torch.randperm(xy_grid.shape[0])[:n_pixels]]
    

The loss is MSE, which is straightforward both with or without PyTorch.


    loss = functional.mse_loss(rgb_gt, out["feature"])
    

2.3

TA gif (from ta_images/).


My gif

I found that the first time you load the webpage the gifs could be out of sync, but if you refresh the page after the first load they should be in sync.

3

I found that the original network in the paper appears to collapse quite easily, outputting zero's for the density everywhere. However, using the original values in nerf_lego.yaml config for the hidden layers helps with this (i.e. 64 instead of 128 for the hidden layers, 6 total layers instead of 8).

TA gif (from ta_images/).


My gif

Here are some still images from the same pose to help you evaluate it better too. The first image of the pair will be the TA's image, the second is mine. I have resized all images to be 400 by 400 on the website so it's easier to see any issues or differences.

4.1, standard

First, I'll show my gif, along with a few static shots from the same pose to help with evaluation. The first of the pair will be my own part 3 shot, the second will be my part 4 shot with view dependence.

To do this, I added an extra configuration flag of "with_view: True" in the config file. If the flag isn't present, it is false by default. Having the flag as true does two things - it changes the size of the linear layer to incorporate the concatenated features, and does the following in the forward method:


        if self.with_view:
            directions = ray_bundle.directions.tile(
                (1, ray_bundle.sample_points.shape[1])
            ).view(-1, 3)
            directions = self.harmonic_embedding_dir(directions)
            point_features = torch.cat((point_features, directions), axis=1)
    

We duplicate the directions for all the points that correspond to that ray, pass them through our harmonic embedding, and concatenate it to the point features. I think that rather than having a trade-off, we have a question of how realistic our images look based on that data we have. If we have fewer images, then adding view dependence will lead to over-fitting on those few images, and result in poorer image quality. However, if we have sufficient images, then view dependence greatly improves image quality by making the images look more realistic, including specular highlights and more realistic shadows.

For example, look at the second pair of images. The inside of the truck scooper has highlights on its interior, and the shadow is more defined. This makes it look better. However, if we didn't have enough images, the ENTIRE image quality would be degraded, and thus it would be better to not include view dependence and deal with a flatter picture.