16-889 Assignment 1: Rendering Basics with PyTorch3D
Naveen Venkat | Andrew ID: nvenkat
Table of contents
- 0. Setup
- 1. Practicing with cameras
- 2. Practicing with Meshes
- 3. Re-texturing a mesh (10 points)
- 4. Camera Transformations (20 points)
- 5. Rendering Generic 3D Representations
- 6. Do Something Fun (10 points)
- (Extra Credit) 7. Sampling Points on Meshes (10 points)
0. Setup
The project has been tested on an MSI GE62VR 7RF Apache Pro laptop with the following specifications:
OS: Ubuntu 21.10
GPU: NVIDIA GTX 1060
CPU: Intel i7 7700HQ
RAM: 16 GB DDR4 3200MHz
Versions: NVIDIA Driver=470.86, CUDA=10.2, Python=3.9, Pytorch=1.9.0, torchvision=0.10.0
Using conda environment
The setup process is as follows (most other solutions ended up in dependency issues).
1) NVIDIA Driver: 470.86
Installing from Additional Drivers panel on Ubuntu's Software & Updates GUI tool.
2) CUDA and CUDNN installation from anaconda channel
conda create -n torch3d python=3.9
conda activate torch3d
conda install -c anaconda cudatoolkit=10.2 cudnn
(The CUDA installation from pytorch channel ends up in dependency issues, but that from anaconda channel works instantly)
3) PyTorch 1.9.0
conda install pytorch==1.9.0 torchvision==0.10.0 -c pytorch
4) Other dependencies as it is from the INSTALL.md
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
conda install -c bottler nvidiacub
5) Installing pytorch3d
conda install pytorch3d -c pytorch3d
1. Practicing with Cameras
1.1. 360-degree Renders (5 points)
Solution. The mesh is rendered from an elevation=0
degrees and distance=3
units. Note texture is not rendered here to maintain consistency
with the requirements of the problem statement (i.e. using render_mesh
). However, texture is rendered in the following parts wherever applicable.
Rendered Output. By default a 256x256 image is rendered from a distance of 3 units and an elevation of 0 degrees.
Implementation Details. The function solution_1_1()
in main.py
acts as an entrypoint to render the GIF.
The solution is implemented in the module solution.p1_1
and executing python3 -m solution.p1_1 --help
will show more info. The default output
image path is solution/p1_1.gif
.
1.2 Re-creating the Dolly Zoom (10 points)
Solution. The dolly zoom effect is created by varying the field of view (FoV) and accordingly the distance to the object (foreground), such that the object is always at the center and of similar size, while the background shrinks in size (due to the increasing FoV, keeping the rendered image of the same size). The relation between the Field of View and the distance to the camera can be obtained as:
distance = half_width / tan(half_fov)
where, half_width
is half the width of the image, and half_fov
is the half field of view, and distance
is the distance of the camera to the
object. To begin with, the half_width
must be determined, which will be kept consistent throughout the transformation. It is identified as the
half-width required to render the cow from the closest distance given a FoV=120
(as is apparent from the shown demo in this problem statement).
This comes out to be around 5 units (approximately).
Rendered Output.
Implementation Details. The function solution_1_2()
in main.py
acts as an entrypoint to render the GIF.
The solution is implemented in the module starter.dolly_zoom
and executing python3 -m starter.dolly_zoom --help
will show more info. The
default output image path is images/dolly_solution.gif
.
2. Practicing with Meshes
2.1 Constructing a Tetrahedron (5 points)
Solution. A tetrahedron is constructed using 4 vertices and 4 corresponding triangular faces. The vertices and faces are defined as follows:
vertices = [
[0.0, 0.0, 0.0],
[0.0, 0.0, 1.0],
[1.0, 0.0, 0.0],
[0.0, 1.0, 0.0]
]
faces = [
[0, 1, 2],
[0, 2, 3],
[0, 3, 1],
[1, 3, 2]
]
In the implementation, the lengths of the sides are scaled, and the mesh is shifted by an appropriate distance according to the provided camera position (see default values in the implementation).
Rendered Output. The mesh is rendered from 6 different elevations [-30, 30)
, and [0, 360)
azimuth values.
Implementation Details. The function solution_2_1()
in main.py
acts as an entrypoint to render the GIF.
The solution is implemented in the module solution.p2_1
and executing python3 -m solution.p2_1 --help
will show more info. The
default output image path is images/p2_1.gif
.
2.2 Constructing a Cube (5 points)
Solution. A cube is constructed using 8 vertices and 6*2=12 corresponding triangular faces. The vertices and faces are defined as follows:
vertices = [
[1.0, 0.0, 0.0],
[1.0, 0.0, 1.0],
[0.0, 0.0, 1.0],
[0.0, 0.0, 0.0],
[1.0, 1.0, 0.0],
[1.0, 1.0, 1.0],
[0.0, 1.0, 1.0],
[0.0, 1.0, 0.0],
]
faces = [
[2, 1, 0],
[0, 3, 2],
[5, 0, 1],
[5, 4, 0],
[2, 5, 1],
[2, 6, 5],
[6, 7, 5],
[7, 4, 5],
[2, 6, 1],
[6, 5, 1],
[6, 7, 5],
[7, 4, 5],
[7, 3, 0],
[4, 7, 0],
[3, 7, 2],
[7, 6, 2]
]
In the implementation, the lengths of the sides are scaled, and the mesh is shifted by an appropriate distance according to the provided camera position (see default values in the implementation).
Rendered Output. The mesh is rendered from 6 different elevations [-30, 30)
, and [0, 360)
azimuth values.
Implementation Details. The function solution_2_2()
in main.py
acts as an entrypoint to render the GIF.
The solution is implemented in the module solution.p2_2
and executing python3 -m solution.p2_2 --help
will show more info. The
default output image path is images/p2_2.gif
.
3. Re-texturing a mesh (10 points)
Solution. The cow mesh is retuxtured with colors ranging from color1=[0, 0, 1]
(blue), to color2=[1, 0, 0]
(red).
Suppose z_min
and z_max
denote the smallest and the largest z
-coordinate of the mesh. Each vertex is assigned a color as:
alpha = (z - z_min) / (z_max - z_min)
color = alpha * color2 + (1 - alpha) * color1
Rendered Output. The mesh is rendered from 6 different elevations [-30, 30)
, and [0, 360)
azimuth values.
Implementation Details. The function solution_3()
in main.py
acts as an entrypoint to render the GIF.
The solution is implemented in the module solution.p3
and executing python3 -m solution.p3 --help
will show more info. The
default output image path is images/cow_render_colored.gif
.
4. Camera Transformations (20 points)
Solution. There are a few subtleties in this question, and it was challenging to find the relative rotation and translation
matrices, primarily because the formulation T2 = R_rel @ T1 + T_rel
is inconsistent with pytorch3d's convention (right multiply) - it must be
T2 = T1 @ R_rel + T_rel
. Nevertheless, here, the R_rel
and T_rel
that are used to obtain the given transformations using the provided
starter code are mentioned.
Rendered Output.
1)
R = [[0, 1, 0],
[-1, 0, 0],
[0, 0, 1]]
T = [0, 0, 0]
2)
R = [[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]
T = [0, 0, 3]
3)
R = [[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]
T = [0.5, -0.5, 0]
4)
R = [[0, 0, 1],
[0, 1, 0],
[-1, 0, 0]]
T = [-3, 0, 3]
Implementation Details. The function solution_4()
in main.py
acts as an entrypoint to render the images.
The solution is implemented in the module starter.camera_transforms
and executing python3 -m starter.camera_transforms --help
will show more
info. The default output image path is images/textured_cow.jpg_transform{idx}_rendered.jpg
, where {idx}
is the index above (1-4).
5. Rendering Generic 3D Representations
5.1 Rendering Point Clouds from RGB-D Images (10 points)
Solution. The point clouds are rendered below. The union of the point clouds is taken as the union of vertex coordinates obtained from the two images.
Rendered Output.
(First image point cloud)
(Second image point cloud)
(Union of the first 2 point clouds)
Implementation Details. The function solution_5_1()
in main.py
acts as an entrypoint to render the images.
The solution is implemented in the module starter.render_generic
and executing python3 -m starter.render_generic --help
will show more
info. The default output image path is images/pc{idx}.gif
, where {idx}
is the index above (1-3).
5.2 Parametric Functions (10 points)
Solution. The sampled coordinates of points on the surface of a torus are given by:
x = (R + r * cos(theta)) * cos(phi)
y = (R + r * cos(theta)) * sin(phi)
z = r * sin(theta)
where R
and r
correspond to the major and minor radius, and theta
, phi
are sampled in [0, 360)
degrees.
Rendered Output. The rendered output is as follows (point colors are obtained as the normalized 3D coordinates)
Implementation Details. The function solution_5_2()
in main.py
acts as an entrypoint to render the images.
The solution is implemented in the module starter.render_generic
and executing python3 -m starter.render_generic --help
will show more
info. The default output image path is images/torus.gif
. By default, 200 points are sampled and the GIF is rendered from elevations in [-30, 30]
interval.
5.3 Implicit Surfaces (15 points)
Solution. The implicit equation (iso-surface) of a torus is given by:
F(x, y, z) = (sqrt(x**2 + y**2) - R)**2 + z**2 - r**2
with, R
, r
are the major and minor radius, and (x, y, z)
is a spatial 3D coordinate. The surface of the torus is represented by the
iso-surface F(x, y, z)=0
.
Rendered Output. The torus mesh is rendered as follows (vertex colors are obtained as the normalized 3D coordinates)
Implementation Details. The function solution_5_3()
in main.py
acts as an entrypoint to render the images.
The solution is implemented in the module starter.render_generic
and executing python3 -m starter.render_generic --help
will show more
info. The default output image path is images/torus_implicit.gif
. By default, the GIF is rendered from elevations in [-30, 30]
interval.
Discussion. Trade-offs between rendering as a mesh vs. a point cloud.
- Rendering Speed: It is faster to render a point cloud, since a simple perspective projection of the points onto the image plane yields the image. This is of course, ignoring the fact that we would be plotting tiny spheres instead of just points. On the other hand, the typical way to render meshes is to perform raymarching to identify the nearest faces and render their texture onto the image plane. This is a more compute intensive approach.
- Rendering Quality: Meshes provide the best quality as they store the associated connectivity information between vertices (informally, meshes can fill the gaps that point clouds can't). We can also incorporate environment information such as lighting while rendering meshes. Point clouds on the other hand are a very crude approximation of 3D structure. They must be rendered as tiny spheres, which not only introduces gaps in the rendered image but also causes the image to be non-aesthetic (blurred edges, non-smooth textures etc.).
- Ease of use: Point clouds are easier to use in many contexts - learning to predict shapes using Neural Networks, learning arbitrary manifolds (even those that change their connectivity), etc. Meshes on the other hand introduce rigidity (e.g. connectivity must be fixed throughout optimization), but are useful in contexts when we want to learn accurate structures (e.g., when a parametric function maps a spherical mesh onto an target mesh).
- Memory usage: Structurally, point clouds take lesser memory space because they only represent vertices (and adiitionally texture information), whereas meshes store the connectivity information in addition to the vertices (and textures).
6. Do Something Fun (10 points)
It would be fair to argue that we live in a structured 3D world, whereas typical cameras reside in an imaginary 2D space. While clicking a picture, most of the 3D information is lost - which makes us want even more to capture the perfect view. We're mentally wired somehow, to already know what constitutes a perfect informative picture.
Scenario. Suppose we're given a 3D representation of the world, can we build a systematic algorithm to determine the most informative views? Here's an interesting experiment along these lines.
Approach. Let's assume a simple representation of the world - a 3D mesh of a cow at origin. Suppose we view the mesh using the following view:
R, T = look_at_view_transform(dist, elev, azim)
i.e., viewing the origin from a distance dist
, elevation elev
and azimuth azim
.
Now consider a perturbation in our view with some magnitude along each of the three degrees of freedom (d_dist, d_elev, d_azim)
(this indicates
a small change in each of the parameters). The idea is to measure the amount of change in our "rendered view" caused by this perturbation.
The more the complexities in our current view (dist, elev, azim)
, more would be the change caused by this perturbation, and in this context the
view is more informative.
Suppose we compute this change in the rendered view due to each of the three changes (d_dist, d_elev, d_azim)
d_img_d_dist = (render(dist-d_dist, elev, azim) - render(dist+d_dist, elev, azim)).norm()
d_img_d_elev = (render(dist, elev-d_elev, azim) - render(dist, elev+d_elev, azim)).norm()
d_img_d_azim = (render(dist, elev, azim-d_azim) - render(dist, elev, azim+d_azim)).norm()
We now have three values G = (d_img_d_dist, d_img_d_elev, d_img_d_azim)
that indicate a change in the rendered view caused by small changes in each
of the parameters. A norm of the vector G
would indicate the total change in the rendered view due to this perturbation.
(Note, this is similar to computing the norm of the gradient of the image with respect to the viewing parameters - dist, elev, azim)
If our current view from (dist, elev, azim)
contains a lot of visible complexities, we would end up with a higher the value of G
.
This is a very crude representation of information - as taking a difference in the image space can only determine change in information upto a 2D
projection. Nevertheless, this could be one starting point. Let's look at how a heatmap of G
looks from different views.
Visualizing information. We compute G
from a grid of dist=[1.5, 3)
, elevation=[-89, 89)
, azim=[0, 360)
, quantized with d_dist=0.1
,
d_elev=5.0
, d_azim=5.0
. At each point in this grid, G
is calculated as above - and we get a 3D matrix storing a heatmap of G
-values at
each combination of viewing parameters. It takes about 1.5 hours on a single GTX 1060 to search the entire grid.
For the first position dist=1.5
, the heatmap of G
as seen across elevation=[-89, 89)
, azim=[0, 360)
is as follows:

(x-axis corresponds to 72 azimuth values in [0, 360)
degrees, and y-axis corresponds to 36 elevation values [-89, 89)
degrees)
Notice how the heatmap has high activations towards azim=0
(alternatively, azim=360
), and around elev=-30
. At these settings, we are able to
see the legs of the cow, and along with some textural information that changes when a perturbation is caused to this view.
Here's a plot of the heatmap of G
over distances. The following GIF shows the 2D heatmap as shown above, over different values of dist=[1.5,3)
quantized in steps of d_dist=0.1
.

Note that the 2D grid has been rolled to get azim=0
at the center (for better visualization).
As the distance dist
increases, we find the heatmap becoming smooth - less peakier around certain views. This is because, the size of the mesh
in the rendered view keeps decreasing as distance increases and changes in the rendered image are small (since we're taking magnitude of the
difference in the rendered images). This is expected - a better way would be to normalize this difference using a bounding box around the object.
Finally, we see that the heatmap for dist=1.5
shown above looks interesting - what if we plot it on a sphere? Will we see anything special?
Interestingly enough, it seems to look like a view of a cow facing the camera - or so would I like to believe!
Implementation Details. The function solution_6()
in main.py
acts as an entrypoint. The solution is implemented in the module
solution.p6
and executing python3 -m solution.p6 --help
will show more info.
(Extra Credit) 7. Sampling Points on Meshes (10 points)
Solution. The algorithm to generate a point cloud given a mesh is as follows:
- Select a face with probability proportional to the area of the face. The sampling probability is given as:
p[i] = area[i] / area.sum()
- Sample a random barycentric coordinate uniformly and compute the point. Uniform sampling is done as in the lecture.
Rendererd Output. The point clouds are rendered in a 360 degree view with elevations in [-30, 30)
degrees interval.
10 points:
100 points:
1000 points:
10000 points:
Implementation Details. The function solution_5_3()
in main.py
acts as an entrypoint to render the images.
The solution is implemented in the module starter.camera_transforms
and executing python3 -m starter.camera_transforms --help
will show more
info. The default output image path is images/torus_implicit.gif
. By default, the GIF is rendered from elevations in [-30, 30]
interval.