Assignment 2: Single View to 3D
Name: Edward Li
Andrew ID: edwardli
Late Days Used:
1. Exploring loss functions
1.1. Fitting a voxel grid (5 points)
A fitted voxel grid after using binary cross-entropy loss. We visualize the fitting process over time (sqrt timescale):
Source | Target |
---|---|
![]() |
![]() |
Run python fit_data.py --type 'vox'
to get the output. A folder named renders
should be made before running.
1.2. Fitting a point cloud (10 points)
I implement chamfer loss to fit the point cloud:
Source | Target |
---|---|
![]() |
![]() |
Run python fit_data.py --type 'point'
to get the output.
1.3. Fitting a mesh (5 points)
We define laplacian smoothing to help regularize the fitted mesh:
Source | Target |
---|---|
![]() |
![]() |
Run python fit_data.py --type 'mesh'
to get the output.
2. Reconstructing 3D from single view
2.1. Image to voxel grid (15 points)
Run python train_model.py --type 'vox'
to train this subsection, and python eval_model.py --type 'vox' --load_checkpoint
to generate visualizations.
For this section, I use a model based on 3D-R2N2. More precisely, I use the same residual decoder as 3D-R2N2 without the recurrent layer used for multi-view reconstruction.
The resulting model uses 12 3D convolution layers, as well as BatchNorm and trilinear upscaling to reconstruct the voxel grid. Residual connections are used to aid training.
We train over 25000 iterations with batch size 8. Learning rate starts at 4e-4, decaying every 10000 iterations by a factor of 0.3. We reweight the loss to penalize missing voxels by a factor of 2 more than no voxels.
Input RGB | Prediction | Ground Truth |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Qualitatively, the model is able to reconstruct voxel grids well, although it contains strong biases about chair legs that cause it to fail on differently shaped chair legs.
2.2. Image to point cloud (15 points)
Run python train_model.py --type 'point'
to train this subsection, and python eval_model.py --type 'point' --load_checkpoint
to generate visualizations.
This section's model is much simpler. We use a 4 layer MLP decoder, with ReLU activations. The neuron counts are $512\to 1024\to 2048 \to 4096 \to 3n$.
Training is done with the same procedure and for the same duration as the voxel grid section.
Input RGB | Prediction | Ground Truth |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Qualitatively, the model performs well, although strong priors exist on the rectangularness of the back of the chair.
2.3. Image to mesh (15 points)
Run python train_model.py --type 'mesh'
to train this subsection, and python eval_model.py --type 'mesh' --load_checkpoint
to generate visualizations.
This section uses a very similar model as our point cloud, with 4 fully connected layers $512\to 1024\to 2048\to 2048\to 3n$. Some experimentation was done with the starting mesh prediction (a torus was attempted), but results had worse F1 score after a few epochs.
We use $w_{smooth}=0.1$ to regularize the mesh, with the same training recipe as above.
Input RGB | Prediction | Ground Truth |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Qualitatively, the mesh a little bit spiky, but performs decently. It however fails spectacularly on weird chair cases.
2.4. Quantitative comparisons (10 points)
We have the following F1 scores:
Mesh | Pointcloud | Voxel grid |
---|---|---|
92.724 | 96.628 | 88.107 |
We find that point clouds perform the best, followed by mesh and voxel grid. Intuitively, point clouds perform the best due to their lack of connectivity restrictions, allowing points to be located with any structure, and arbitrary holes are allowable in the predicted 3D structure. Our mesh has added connectivity and no-hole constraints, resulting in a harder to learn function. Voxel grids perform the worst due to their lack of expressivity (only $32\times 32\times 32$ grid), as well as the high number of outputs $32768$ voxels compared to $5000$ points.
Additionally, our evaluation metric is slightly unfair, as they require sampling pointclouds from both meshes and voxel grids. As the pointcloud network is able to directly optimize for F1 score with chamfer loss, it is expected it does better than networks that have no control over the sampled point locations.
2.5. Analyse effects of hyperparams variations (10 points)
For this section, I chose to vary $w_{smooth}$ for mesh predictions. This is mostly because I saw that the PyTorch3D paper used $w_{smooth}=19$ for their experiments, which was orders of magnitude greater than the value I used. This could be explained by a different loss scale with their Earth Mover Loss, although is worth exploring. We explore both variation in F1 score, as well as qualitative mesh variation. Run ./gridsearch.sh
to train all attempts, and ./grideval.sh
to evaluate all trained models.
We use the same mesh network as part 2.3, but vary $w_{smooth}$ from $0.01, 0.03, 0.1, 0.3, 1.0, 3.0, 10.0, 30.0, 100.0$ (a roughly logarithmic scale). We train each network for 25000 iterations, which took quite some time. Luckily, increasing num_workers
leads to significant speedup.
First, we look at the effect of $w_{smooth}$ on F1 score:
Unsurprisingly, F1 score decreases with higher values of $w_{smooth}$, precipitously dropping off near the higher end of $w_{smooth}>3$. This is expected, as $w_{smooth}$ dominates the loss with higher values, leading to a smooth mesh being more optimal than the correct mesh. However, this does not mean we should use no smoothing. Let's have a look at some qualitative results:
GT | 0.01 | 0.03 | 0.1 | 0.3 | 1.0 | 3.0 | 10.0 | 30.0 | 100.0 |
---|---|---|---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
We can see that small values of $w_{smooth}$ fit our shape quite well, which is not surprising. However, they include many instances of mesh weirdness, with some parts looking like self-intersection, which is perhaps not desirable for both rendering and for watertightness of the mesh.
On the other extreme. large values of $w_{smooth}$ fit the shape very badly, looking like weird spiky messes. Interestingly, larger values lead to smaller predicted meshes, which is likely because the limit case would lead to all points being in the same place. The spikiness is also because having a single non-average point is the cheapest way to decrease chamfer distance while not increasing laplacian loss too much.
In general, it seems like $w_{smooth}$ should probably be either $0.3$ or $1.0$, as there is not significant F1 score degradation while looking significantly smoother (although $1.0$ might look too smooth for some).
2.6. Interpret your model (15 points)
I thought it would be interesting to get some insight both on the stability and smoothness of the learned latent space, as well as the stability of individual points/vertices in the point cloud and mesh network. Run python latent_model.py --type vox|mesh|point --load_checkpoint
to generate visualizations.
To do this, we pick two examples from our evaluation set. We find the latent vectors for each of these examples, and linearly interpolate between, generating a 3D shape for each interpolation step. We pick these images:
Start | End | |
---|---|---|
RGB | ![]() |
![]() |
Mesh | ![]() |
![]() |
Ideally, if our latent space is nice and smooth, we should see a continuous and constant speed transformation between the two chairs for each 3D representation type:
Pointcloud | Voxel | Mesh |
---|---|---|
![]() |
![]() |
![]() |
There are a few conclusions we can draw from this. First, we find that the latent space appears to be quite smooth for all of our 3D representation types, even without expliticly training to enforce smoothness (with something like a VAE). Additionally, each interpolated chair is reasonably realistic, especially so for the pointcloud representation. The voxel representation has a few floating voxels at certain points, while the mesh representation has some slight rearrangement of faces during the interpolation.
This difference in performance is expected, as the pointcloud representation is by far the least constrained representation, allowing for smoother interpolations. Honestly, I feel like the interpolatability of the mesh representation is pretty unexpected.
One final conclusion to draw is the observation that points and vertices are fairly stable. Foot points and vertices stay in the feet of the chair through an interpolation, while vertices in other parts of chairs are also fairly static. I assume the network outputs points in this way as it was easier to learn to fix approximate point positions during training.
3. (Extra Credit) Exploring some recent architectures.
Unfortunately, no extra credit this time :(