16-889 Assignment 1: Rendering Basics with PyTorch3D

Naveen Venkat | Andrew ID: nvenkat


Table of contents


0. Setup

The project has been tested on an MSI GE62VR 7RF Apache Pro laptop with the following specifications:

OS: Ubuntu 21.10
GPU: NVIDIA GTX 1060
CPU: Intel i7 7700HQ
RAM: 16 GB DDR4 3200MHz
Versions: NVIDIA Driver=470.86, CUDA=10.2, Python=3.9, Pytorch=1.9.0, torchvision=0.10.0
Using conda environment

The setup process is as follows (most other solutions ended up in dependency issues).

1) NVIDIA Driver: 470.86

Installing from Additional Drivers panel on Ubuntu's Software & Updates GUI tool.

2) CUDA and CUDNN installation from anaconda channel

conda create -n torch3d python=3.9
conda activate torch3d
conda install -c anaconda cudatoolkit=10.2 cudnn

(The CUDA installation from pytorch channel ends up in dependency issues, but that from anaconda channel works instantly)

3) PyTorch 1.9.0

conda install pytorch==1.9.0 torchvision==0.10.0 -c pytorch

4) Other dependencies as it is from the INSTALL.md

conda install -c fvcore -c iopath -c conda-forge fvcore iopath
conda install -c bottler nvidiacub

5) Installing pytorch3d

conda install pytorch3d -c pytorch3d

1. Practicing with Cameras

1.1. 360-degree Renders (5 points)

Solution. The mesh is rendered from an elevation=0 degrees and distance=3 units. Note texture is not rendered here to maintain consistency with the requirements of the problem statement (i.e. using render_mesh). However, texture is rendered in the following parts wherever applicable.

Rendered Output. By default a 256x256 image is rendered from a distance of 3 units and an elevation of 0 degrees.

p1_1_gif

Implementation Details. The function solution_1_1() in main.py acts as an entrypoint to render the GIF. The solution is implemented in the module solution.p1_1 and executing python3 -m solution.p1_1 --help will show more info. The default output image path is solution/p1_1.gif.

1.2 Re-creating the Dolly Zoom (10 points)

Solution. The dolly zoom effect is created by varying the field of view (FoV) and accordingly the distance to the object (foreground), such that the object is always at the center and of similar size, while the background shrinks in size (due to the increasing FoV, keeping the rendered image of the same size). The relation between the Field of View and the distance to the camera can be obtained as:

distance = half_width / tan(half_fov)

where, half_width is half the width of the image, and half_fov is the half field of view, and distance is the distance of the camera to the object. To begin with, the half_width must be determined, which will be kept consistent throughout the transformation. It is identified as the half-width required to render the cow from the closest distance given a FoV=120 (as is apparent from the shown demo in this problem statement). This comes out to be around 5 units (approximately).

Rendered Output.

dolly_solution_gif

Implementation Details. The function solution_1_2() in main.py acts as an entrypoint to render the GIF. The solution is implemented in the module starter.dolly_zoom and executing python3 -m starter.dolly_zoom --help will show more info. The default output image path is images/dolly_solution.gif.

2. Practicing with Meshes

2.1 Constructing a Tetrahedron (5 points)

Solution. A tetrahedron is constructed using 4 vertices and 4 corresponding triangular faces. The vertices and faces are defined as follows:

    vertices = [
                 [0.0, 0.0, 0.0],
                 [0.0, 0.0, 1.0],
                 [1.0, 0.0, 0.0],
                 [0.0, 1.0, 0.0]
               ]
    faces =    [
                 [0, 1, 2],
                 [0, 2, 3],
                 [0, 3, 1],
                 [1, 3, 2]
               ]

In the implementation, the lengths of the sides are scaled, and the mesh is shifted by an appropriate distance according to the provided camera position (see default values in the implementation).

Rendered Output. The mesh is rendered from 6 different elevations [-30, 30), and [0, 360) azimuth values.

p2_1_gif

Implementation Details. The function solution_2_1() in main.py acts as an entrypoint to render the GIF. The solution is implemented in the module solution.p2_1 and executing python3 -m solution.p2_1 --help will show more info. The default output image path is images/p2_1.gif.

2.2 Constructing a Cube (5 points)

Solution. A cube is constructed using 8 vertices and 6*2=12 corresponding triangular faces. The vertices and faces are defined as follows:

    vertices = [
                 [1.0, 0.0, 0.0],
                 [1.0, 0.0, 1.0],
                 [0.0, 0.0, 1.0],
                 [0.0, 0.0, 0.0],
                 [1.0, 1.0, 0.0],
                 [1.0, 1.0, 1.0],
                 [0.0, 1.0, 1.0],
                 [0.0, 1.0, 0.0],
               ]
    faces =    [
                 [2, 1, 0],
                 [0, 3, 2],
                 [5, 0, 1],
                 [5, 4, 0],
                 [2, 5, 1],
                 [2, 6, 5],
                 [6, 7, 5],
                 [7, 4, 5],
                 [2, 6, 1],
                 [6, 5, 1],
                 [6, 7, 5],
                 [7, 4, 5],
                 [7, 3, 0],
                 [4, 7, 0],
                 [3, 7, 2],
                 [7, 6, 2]
               ]

In the implementation, the lengths of the sides are scaled, and the mesh is shifted by an appropriate distance according to the provided camera position (see default values in the implementation).

Rendered Output. The mesh is rendered from 6 different elevations [-30, 30), and [0, 360) azimuth values.

p2_2_gif

Implementation Details. The function solution_2_2() in main.py acts as an entrypoint to render the GIF. The solution is implemented in the module solution.p2_2 and executing python3 -m solution.p2_2 --help will show more info. The default output image path is images/p2_2.gif.

3. Re-texturing a mesh (10 points)

Solution. The cow mesh is retuxtured with colors ranging from color1=[0, 0, 1] (blue), to color2=[1, 0, 0] (red).

Suppose z_min and z_max denote the smallest and the largest z-coordinate of the mesh. Each vertex is assigned a color as:

    alpha = (z - z_min) / (z_max - z_min)
    color = alpha * color2 + (1 - alpha) * color1

Rendered Output. The mesh is rendered from 6 different elevations [-30, 30), and [0, 360) azimuth values.

cow_render_colored_gif

Implementation Details. The function solution_3() in main.py acts as an entrypoint to render the GIF. The solution is implemented in the module solution.p3 and executing python3 -m solution.p3 --help will show more info. The default output image path is images/cow_render_colored.gif.

4. Camera Transformations (20 points)

Solution. There are a few subtleties in this question, and it was challenging to find the relative rotation and translation matrices, primarily because the formulation T2 = R_rel @ T1 + T_rel is inconsistent with pytorch3d's convention (right multiply) - it must be T2 = T1 @ R_rel + T_rel. Nevertheless, here, the R_rel and T_rel that are used to obtain the given transformations using the provided starter code are mentioned.

Rendered Output.

1)

R = [[0,  1,  0],
     [-1,  0,  0],
     [0,  0,  1]]
T = [0, 0, 0]

cow_transform_1

2)

R = [[1,  0,  0],
     [0,  1,  0],
     [0,  0,  1]]
T = [0, 0, 3]

cow_transform_2

3)

R = [[1,  0,  0],
     [0,  1,  0],
     [0,  0,  1]]
T = [0.5, -0.5, 0]

cow_transform_3

4)

R = [[0,   0,  1],
     [0,   1,  0],
     [-1,  0,  0]]
T = [-3, 0, 3]

cow_transform_4

Implementation Details. The function solution_4() in main.py acts as an entrypoint to render the images. The solution is implemented in the module starter.camera_transforms and executing python3 -m starter.camera_transforms --help will show more info. The default output image path is images/textured_cow.jpg_transform{idx}_rendered.jpg, where {idx} is the index above (1-4).

5. Rendering Generic 3D Representations

5.1 Rendering Point Clouds from RGB-D Images (10 points)

Solution. The point clouds are rendered below. The union of the point clouds is taken as the union of vertex coordinates obtained from the two images.

Rendered Output.

pc1 (First image point cloud)

pc1 (Second image point cloud)

pc1 (Union of the first 2 point clouds)

Implementation Details. The function solution_5_1() in main.py acts as an entrypoint to render the images. The solution is implemented in the module starter.render_generic and executing python3 -m starter.render_generic --help will show more info. The default output image path is images/pc{idx}.gif, where {idx} is the index above (1-3).

5.2 Parametric Functions (10 points)

Solution. The sampled coordinates of points on the surface of a torus are given by:

  x = (R + r * cos(theta)) * cos(phi)
  y = (R + r * cos(theta)) * sin(phi)
  z = r * sin(theta)

where R and r correspond to the major and minor radius, and theta, phi are sampled in [0, 360) degrees.

Rendered Output. The rendered output is as follows (point colors are obtained as the normalized 3D coordinates)

torus_gif

Implementation Details. The function solution_5_2() in main.py acts as an entrypoint to render the images. The solution is implemented in the module starter.render_generic and executing python3 -m starter.render_generic --help will show more info. The default output image path is images/torus.gif. By default, 200 points are sampled and the GIF is rendered from elevations in [-30, 30] interval.

5.3 Implicit Surfaces (15 points)

Solution. The implicit equation (iso-surface) of a torus is given by:

  F(x, y, z) = (sqrt(x**2 + y**2) - R)**2 + z**2 - r**2

with, R, r are the major and minor radius, and (x, y, z) is a spatial 3D coordinate. The surface of the torus is represented by the iso-surface F(x, y, z)=0.

Rendered Output. The torus mesh is rendered as follows (vertex colors are obtained as the normalized 3D coordinates)

torus_implicit_gif

Implementation Details. The function solution_5_3() in main.py acts as an entrypoint to render the images. The solution is implemented in the module starter.render_generic and executing python3 -m starter.render_generic --help will show more info. The default output image path is images/torus_implicit.gif. By default, the GIF is rendered from elevations in [-30, 30] interval.

Discussion. Trade-offs between rendering as a mesh vs. a point cloud.

6. Do Something Fun (10 points)

It would be fair to argue that we live in a structured 3D world, whereas typical cameras reside in an imaginary 2D space. While clicking a picture, most of the 3D information is lost - which makes us want even more to capture the perfect view. We're mentally wired somehow, to already know what constitutes a perfect informative picture.

Scenario. Suppose we're given a 3D representation of the world, can we build a systematic algorithm to determine the most informative views? Here's an interesting experiment along these lines.

Approach. Let's assume a simple representation of the world - a 3D mesh of a cow at origin. Suppose we view the mesh using the following view:

  R, T = look_at_view_transform(dist, elev, azim)

i.e., viewing the origin from a distance dist, elevation elev and azimuth azim.

Now consider a perturbation in our view with some magnitude along each of the three degrees of freedom (d_dist, d_elev, d_azim) (this indicates a small change in each of the parameters). The idea is to measure the amount of change in our "rendered view" caused by this perturbation. The more the complexities in our current view (dist, elev, azim), more would be the change caused by this perturbation, and in this context the view is more informative.

Suppose we compute this change in the rendered view due to each of the three changes (d_dist, d_elev, d_azim)

  d_img_d_dist = (render(dist-d_dist, elev, azim) - render(dist+d_dist, elev, azim)).norm()
  d_img_d_elev = (render(dist, elev-d_elev, azim) - render(dist, elev+d_elev, azim)).norm()
  d_img_d_azim = (render(dist, elev, azim-d_azim) - render(dist, elev, azim+d_azim)).norm()

We now have three values G = (d_img_d_dist, d_img_d_elev, d_img_d_azim) that indicate a change in the rendered view caused by small changes in each of the parameters. A norm of the vector G would indicate the total change in the rendered view due to this perturbation.

(Note, this is similar to computing the norm of the gradient of the image with respect to the viewing parameters - dist, elev, azim)

If our current view from (dist, elev, azim) contains a lot of visible complexities, we would end up with a higher the value of G.

This is a very crude representation of information - as taking a difference in the image space can only determine change in information upto a 2D projection. Nevertheless, this could be one starting point. Let's look at how a heatmap of G looks from different views.

Visualizing information. We compute G from a grid of dist=[1.5, 3), elevation=[-89, 89), azim=[0, 360), quantized with d_dist=0.1, d_elev=5.0, d_azim=5.0. At each point in this grid, G is calculated as above - and we get a 3D matrix storing a heatmap of G-values at each combination of viewing parameters. It takes about 1.5 hours on a single GTX 1060 to search the entire grid.

For the first position dist=1.5, the heatmap of G as seen across elevation=[-89, 89), azim=[0, 360) is as follows:

drawing

(x-axis corresponds to 72 azimuth values in [0, 360) degrees, and y-axis corresponds to 36 elevation values [-89, 89) degrees)

Notice how the heatmap has high activations towards azim=0 (alternatively, azim=360), and around elev=-30. At these settings, we are able to see the legs of the cow, and along with some textural information that changes when a perturbation is caused to this view.

hmpng

Here's a plot of the heatmap of G over distances. The following GIF shows the 2D heatmap as shown above, over different values of dist=[1.5,3) quantized in steps of d_dist=0.1.

drawing

Note that the 2D grid has been rolled to get azim=0 at the center (for better visualization).

As the distance dist increases, we find the heatmap becoming smooth - less peakier around certain views. This is because, the size of the mesh in the rendered view keeps decreasing as distance increases and changes in the rendered image are small (since we're taking magnitude of the difference in the rendered images). This is expected - a better way would be to normalize this difference using a bounding box around the object.

Finally, we see that the heatmap for dist=1.5 shown above looks interesting - what if we plot it on a sphere? Will we see anything special?

cos1

cos2

cos3

Interestingly enough, it seems to look like a view of a cow facing the camera - or so would I like to believe!

Implementation Details. The function solution_6() in main.py acts as an entrypoint. The solution is implemented in the module solution.p6 and executing python3 -m solution.p6 --help will show more info.

(Extra Credit) 7. Sampling Points on Meshes (10 points)

Solution. The algorithm to generate a point cloud given a mesh is as follows:

Rendererd Output. The point clouds are rendered in a 360 degree view with elevations in [-30, 30) degrees interval.

10 points:

point_cloud_sampled_10 cow_mesh

100 points:

point_cloud_sampled_100 cow_mesh

1000 points:

point_cloud_sampled_1000 cow_mesh

10000 points:

point_cloud_sampled_10000 cow_mesh

Implementation Details. The function solution_5_3() in main.py acts as an entrypoint to render the images. The solution is implemented in the module starter.camera_transforms and executing python3 -m starter.camera_transforms --help will show more info. The default output image path is images/torus_implicit.gif. By default, the GIF is rendered from elevations in [-30, 30] interval.