Homework Number Two

16-889 Learning for 3D Vision
Ben Kolligs

Andrew ID: bkolligs

zero late days
Zero Late Days Used

Question 1

Q1.1

voxel loss
chair is formed

Q1.2

pointcloud loss
chair is formed

Q1.3

meshes loss
chair is formed

Question 2

Q2.1

The architecture for the Voxel Grid network was as follows:
  1. ConvTranspose3D $64 \rightarrow 128$, stride=2, kernel=4
  2. ConvTranspose3D $128 \rightarrow 256$, stride=2, kernel=4
  3. ConvTranspose3D $256 \rightarrow 128$, stride=2, kernel=4
  4. ConvTranspose3D $128 \rightarrow 64$, stride=2, kernel=4
  5. ConvTranspose3D $64 \rightarrow 1$, stride=1, kernel=3
With BatchNorm3D and ReLU following every convolution operation except the last one.
chair is formed
chair is formed
chair is formed
These results are after training the network for 600 iterations at batch size 32.

Q2.2

The architecture for the Point Cloud fitting network was as follows:
  1. Linear $512 \rightarrow 1024$
  2. Linear $1024 \rightarrow 2048$
  3. Linear $2048 \rightarrow 3n$
Where $n$ is the number of points. After each linear layer there is a parametric ReLU (PReLU).
chair is formed
chair is formed
chair is formed
The following results were obtained from training the network at a batch size of 128 for 500 iterations.

Q2.3

The architecture for the mesh fitting network was as follows:
  1. Linear $512 \rightarrow 1024$
  2. Linear $1024 \rightarrow 1024$
  3. Linear $1024 \rightarrow 3v$
Where $v$ is the number of vertices we are deforming on the starter mesh. After each linear layer there is a parametric ReLU (PReLU). The "starting" mesh ico-sphere was set at 3 subdivision depth at an attempt to save memory and prevent overflow during training.
chair is formed
chair is formed
chair is formed

Unfortunately I couldn't train this network for very long because of a memory leak that would cause my GPU to run out of memory. I wasn't able to correct this even after following instructions posted by another student on piazza.

Q2.4

The results of the F1@0.05 score for each object prediction are shown:
The results of the quantitative comparison
Data Type F1@0.05 Score
Pointcloud 91.3%
Mesh 84.1%
Voxel Grid 38.2%

I believe the pointcloud performed best since it doesn't need to worry about connectivity constraints compared to the mesh. Even with the smoothing constraint, the mesh fit struggles to properly deform. The F1@0.05 score is a bit misleading, because it only measures how many points of the resultant prediction are within 0.05m to a ground truth point. In theory I could produce a mesh that contains all vertices concentrated within 0.05m of a single ground truth vertex, and it would register 100% F1@0.05 score. That said, the resolution of the voxel grid is likely not high enough in this example to produce a competitive score. From visual inspection we can see that the voxel grid seems to overestimate the shapes of the chairs, "puffing" them up.

Q2.5

Increasing the number of vertices on the ico-sphere used as an initial mesh for the mesh fitting seems to improve the results. I wasn't able to get it to a point where the mesh stops looking jagged just by varying this parameter, which may mean this has more to do with the architecture. I tried to increase the amount of smoothing by increasing the w_smooth parameter. If I weight smoothness too heavily, we end up with this blob like chair shape, that doesn't take on the form of the target chair very well.
chair is formed
chair is formed
These results persisted if I had a smooth weight of $>0.2w_{chamfer}$.

Varying the number of points in the pointcloud meant that I needed to train for different amounts of time. Since increasing the number of points in the cloud effectively increases the amount of parameters in the model, this meant the network needed to be trained for longer in order to find the global minimum of the loss surface.

Q2.6

For my point cloud model, I was curious how general the features learned were. In order to test this I passed multiple different shapes to the model to see what it outputs.

Sphere

I wanted to first try an easy mesh for the network. I created an ico-sphere and rendered an image of it, producing this output after passing it to the network:
chair is formed
We can see that the network is sort of "inflating the chair" outwards to try and match the sphere, but is heavily biased towards the "mean shape" of the chair it learned from.

Chair Images from the Internet

Next, I wanted to see how well the network would approximate chairs from the internet, without having exposure to the ground truth. First I passed it an image of multiple chairs to see what would happen. It outputs one chair, that fits the shape of the passed chairs quite well. However, this leads me to believe the network has learned a "mean shape" that it deforms as needed.
chair is formed
The next image further supports my theory, as I passed it a wild rocking chair, and it output a variation of it's "mean shape", with a little bit of a radius near the legs.
chair is formed

Nothing

Lastly, I passed the network a blank image, which resulted in a clear representation of what the network sees the world as...versions of chairs.
chair is formed
Everything it sees is interpreted as a combination of the chair features the mean chair possess.