image

Name: Chonghyuk Song

Collaborators: I discussed Q4 with MSR student Nikhil Bakshi.

1.1 360-degree Renders (5 points)

image

1.2 Re-creating the Dolly Zoom (10 points)

image

2.1 Constructing a Tetrahedron (5 points)

image

There are 4 vertices and 4 triangular faces in this mesh.

2.2 Constructing a Cube (5 points)

image

There are 8 vertices and 12 triangular faces in this mesh.

3. Re-texturing a mesh (10 points)

color1=torch.tensor([1., 0., 0.]), color2=torch.tensor([0., 0., 1.])

image

4. Camera Transformation (20 points)

The camera described by $R_0$ and $T_0$ is a camera 3 units away in the negative z-direction from the world origin (where the cow is centered at) with the same orientation as the world coordinate frame. $R_{relative}$ and $T_{relative}$ describe the rotation and subsequent translation of this "base" camera frame ("rotate-then-translate") to align with the target camera frame.

Specifically, $R_{relative}$ is the matrix that describes the right-hand rotation of the "base" camera frame (relative to its own coordinate frame) that results in the same orientation as the target camera frame. $T_{relative}$ is the matrix that describes how much the target camera has translate (relative to its own coordinate frame) to reach the base-camera frame's origin.

We note that because of the pytorch3d convention of right-multiplying the transformation matrix to points, $R_{relative}$ is transposed. To reflect this, we modify the starter code in camera_transforms.py as follows:

image

In order to produce this image, the base camera frame would have to be rotated about its z-axis by 90 degrees (right-hand rotation). Therefore, the relative camera transformation is described by:

$ R_{\text {relative }}=\left[\begin{array}{ccc} cos(\pi) & -sin(\pi) & 0 \\ sin(\pi) & cos(\pi) & 0 \\ 0 & 0 & 1 \end{array}\right]^{\top} = \left[\begin{array}{ccc} 0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \end{array}\right]^{\top}, \ \ \ T_{\text {relative }}=\left[\begin{array}{l} 0 \\ 0 \\ 0 \end{array}\right] $

image

The target camera that produced this image has the same orientation as the base camera frame, but would have to move 2 units along its z-axis to align with the base camera frame. Therefore, the relative camera transformation is described by:

$ R_{\text {relative }}=\left[\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right]^{\top}, \ \ \ T_{\text {relative }}=\left[\begin{array}{l} 0 \\ 0 \\ 2 \end{array}\right] $

image

Once again, the target camera that produced this image has the same orientation as the base camera frame, but is situated 0.5 units towards the right and 0.5 units higher than the base camera. In other words, the target camera would have to move 0.5 units along its x-axis and -0.5 units along the y-axis to align with the base camera frame. Therefore, the relative camera transformation is described by:

$ R_{\text {relative }}=\left[\begin{array}{ccc} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{array}\right]^{\top}, \ \ \ T_{\text {relative }}=\left[\begin{array}{l} 0.5 \\ -0.5 \\ 0 \end{array}\right] $

image

In this case, the target camera is viewing the cow from its left side at the same distance of 3 units. In order to align the base camera and the target camera, we have to first rotate the base camera 90 degrees about its y-axis, which is now aligned with the target camera in terms of orientation and only differs by a translation. After that, the target camera has to move 3 units along both its x-axis and z-axis. Therefore, the relative camera transformation is described by:

$ R_{\text {relative }}=\left[\begin{array}{ccc} cos(\pi / 2) & 0 & -sin(\pi / 2) \\ 0 & 1 & 0 \\ sin(\pi / 2) & 0 & cos(\pi / 2) \end{array}\right]^{\top} = \left[\begin{array}{ccc} 0 & 0 & -1 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \end{array}\right]^{\top}, \ \ \ T_{\text {relative }}=\left[\begin{array}{l} 3 \\ 0 \\ 3 \end{array}\right] $

5.1 Rendering Point Clouds from RGB-D Images(10 points)

image image image

5.2 Parametric Functions

image image

The first is a rendering from a pointcloud with 82 x 82 = 6724 points. The second is a rendering from a pointcloud with 500 x 500 = 250,000 points.

5.3 Implicit Surfaces

image

The obvious advantage of rendering a mesh as opposed to pointclouds is the superior rendering quality under the same computational budget. In order to achieve a similar rendering quality of the same scene or object, we have to store significantly more points within the pointcloud than we have to have vertices in our mesh. For instance, the mesh in 5.3., which has 6616 vertices, produces significantly denser and therefore higher quality renderings than the pointcloud in 5.2. that has 6,724 points. This naturally leads to higher memory costs for pointclouds of similar rendering quality - we had to use a pointcloud with 250,000 points to reach a similar rendering quality to the mesh.

However, this does not necessarily mean that the rendering time differs as much drastically as the memory footprint between rendering the two representations. The pointcloud comprised of 250,000 points takes 1.68 seconds to render whereas the mesh comprised of 6616 vertices takes 1.22 seconds to render on a NVIDA-RTX A5000 GPU (the pointcloud comprised of 6,724 points takes 1.06 seconds to render).

Furthermore, pointclouds are much easier to use and manipulate in a rendering pipeline as there is no connectivity information stored in between the pointclouds. This becomes valuable when using pointclouds as a representation in a learning frame as we can, for example, learn the optimal set of points that represents a scene - by backpropagating into the raw 3D coordinates of a given point. On the other hand, is it difficult to learn an optimal mesh that represents a scene (more specifically, the connectivity between the vertices) as the connectivites are discrete and therefore cannot be updated in a differentiable learning pipeline.

6. Do Something Fun

image

Instead of rendering a 360degree free-viewpoint video, I decided to render a video where the camera path follows a spiral, a setting commonly used for rendering free-viewpoint video for front-facing scenes in NeRF-related works [1]. This was implemented as part of the function render_spiral in main.py.

[1] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis, ECCV 2020.