16-889 Assignment 2: Single View to 3D
In this assignment, I explored multiple types of loss and decoder functions for regressing to voxels, point clouds, and mesh representations from single view RGB input.
Three late days were used.
1. Exploring Loss Functions
1.1. Fitting a voxel grid
To fit a voxel grid, I used binary cross entropy loss. Below is the result.
1.2. Fitting a point cloud
To fit a point cloud, I used the chamfer loss. Below is the result.
1.2. Fitting a mesh
To fit a mesh, I used a combination of chamfer loss on sampled points along with uniform laplacian smoothing. Below is the result.
2. Reconstructing 3D from single view
2.1. Image to voxel grid
The architecture I chose for the decoder was inspired by Pix2Vox. However, I could not simply re-implement their architecture. Their encoder outputs a 2048 length feature vector, whereas the one provided to us outputs a 512 length one. I did try adding an FC layer to resize the 512 length feature vector to 2048 before replicating their architecture, however it failed to perform well. My suspision for this is because the training set is different.
After much experimentation, below is the network I cam up with. First, the feature vector is reshaped into 512 channels of a 1x1x1 volume. It is then upsampled to a 2x2x2 volume. This result is similar to the Pix2Vox decoder input, except we now have 512 channels instead of 256. The remaining architecture is almost identical. We use five transconvolutional layers. The first four have a filter size of 4 and use a stride of 2 and a padding of 1. They are following by batch normalization and a RELU activation function. The final layer has a filter size of 1 and uses a sigmoid activation to predict if each voxel is occupied or not.
Image to Voxel Architecture
I trained my network using the cross entropy loss from section 1.1. However, I noticed that using this loss function biased the network towards predicting unnocupied voxels. The reason for this is because in the training set there is approximately a 20:1 ratio of unoccupied to occupied voxels, and as a result the network learned to classify more voxels as unoccupied. To fix this, I added more weight to the positive class in the loss function. However, while there was a 20:1 ratio of unoccupied to occupied voxels, through experimentation I found that weighing the occupied voxels with a 5:1 ratio performed the best.
Before training the network, I overfit the network on a small training set of two images. I did this as a proof of concept to make sure there were no fundamental flaws in my model architecture, hyperparameters, or loss function. Below are the results on one of the two overfit images.
When training on the full dataset, the network was trained in batches of 16 for 12,000 iterations. This is approximately 30 epochs. The learning rate originally was 1e-4 and decreased to 1e-5 after 5000 iterations. Below is a plot of the learning rate over iteration.
Voxel Loss over Iteration
Below are the results on three examples. The three examples are from the evaluation set and not the training set.
Chair 1
Chair 2
Chair 3
2.2. Image to point cloud
The architecture I chose for the decoder was an MLP. It required a fair amount of trial and error to find an architecture that worked. Too small of an architecture failed to capture finer details of the point clouds, and too large of an architecture failed to train well. In the end, I settled for the below network. I needed a deep network or else the network would over generalize and would fail to fit any of the fine details of the shapes. There are seven fully connected layers. The first six are followed by a ReLU activation, while the last one uses a tanh activation.
Image to Point Cloud Architecture
I also experimented with different weights of the two summations in the chamfer distance (src to target and target to source distances). I found in the end that using equal weights performed the best.
Once again, before training the network, I tried to overfit the model on a small training set of two images. Here are the results.
When training on the full dataset, the network was trained in batches of 16 for 3000 iterations, which is approximatley 8 epochs. This was significantly fewer iterations than when training the voxel network, but it appeared to converge quicker. The learning rate was a consistant 4e-5. Unlike when training the voxel network, I found decreasing it after several iterations of training had no noticeable effect on performance. Below is the plot of the learning rate over iteration.
Chamfer distance over Iteration
Below are the results on three examples. They are the same evaluation set examples as demonstrated in 2.1.
Chair 1
Chair 2
Chair 3
2.3. Image to mesh
The architecture I chose for this decoder was another MLP. At first, I tried the same MLP used in 2.2, both with an without the final tanh activation. However, the loss stopped decreasing very early, even after experimenting with different learning rates and after thousands of iterations. I ended up having to reduce the size of the hidden layers. I also found an F1@0.05 score improvement by removing one of the hidden layers. The final architecture that yielded the best results can be seen below. All fully connected layers except the last have a ReLU activation. I also found that no tanh activation after the final layer performed slightly better than with a tanh activation.
Image to Mesh Architecture
I also experimented with differents weights of the two loss functions. Too little smoothing created an oddly shaped mesh. Too much smoothing and the mesh failed to fit well to the training set. I ended up using a 2:10 weighting ratio (w_smooth=0.2, w_chamfer=1) for smoothing to chamfer loss.
Again, before training the network on the entire dataset, I tried to overfit the model on a small training set of two images. Here are the results.
Note that it is a little sharp and spiky in some areas. I did try fixing this by adjusting the smoothing loss factor and by modifying the network architecture, but it never truly went away. Either the result got noticeably worse, or the F1 score went down with no noticeable result. I was somewhat dissapointed because I knew that if this was the result on just a training set of size 2, then it would not get much better on a training set of 6000 chairs.
When training on the full dataset, the network was trained in batches of 16 for 10,000 iterations, which is approximately 25 epochs. The learning rate started at 4e-4 and went down to 4e-5 after 5000 iterations, and then down to 4e-6 after 9000 iterations. Below are two plots of the learning rate over the iterations. The second plot starts 100 iterations in and is needed because otherwise the loss would not really be visible on account of the large drop across the first several iterations.
Mesh loss over Iteration (Chamfer Weight: 1, Smoothing weight: 0.2)

Below are the results on three examples. The meshes are not as smooth as I would like them to be. I did really try to make them smoother. I went to a couple office hours and was told to increase the network architecture to 7 layers and use larger hidden layer size, but it had no success.
Chair 1
Chair 2
Chair 3
2.4. Quantitative Comparison
Below are the F1@0.05 scores for the architectures described above.
Network | F1@0.05 |
---|---|
Image to Voxel | 82.223 |
Image to Point Cloud | 93.706 |
Image to Mesh | 93.725 |
The first noticeable difference is that the F1@0.05 score for the voxel network is less than the other two. This is because in a voxel grid we are predicting the occupency of each voxel, regardless of how much of the voxel is truly occupied. Therefore, it is more difficult to accurately represent shapes. The point cloud and mesh networks performed comparably well. The advantage for the mesh network is that it has connectivity information, and points can be sampled on its faces. However, this can also be a disadvantage as it is unable to change its face connectivity. The advantage for the point cloud is that it is not limited by connections, and therefore can accurately represent holes and general shapes. But its lack of connectivity information limits points on the entire surface from being sampled. The results of these trade offs aret two comparable scores that are both much better than the voxel network.
2.5. Analyse effects of hyperparms variations
The hyperparameter I played around the most with by far was the learning rate. Too high of a learning rate the network loss failed to converge early on. Too low of a learning rate and the network took forever to train. This was particularly noticeable for the voxel and mesh networks, where I had to use an annealing learning rate.
Next, for the point cloud, I experimented with different number of points in the point cloud. I trained a network using only 1,000 points and wanted to see how that would affect the visualizations and the F1@0.05 score. To my surpise, the visualizations looked better, but the F1@0.05 score was slightly worse at 93.563. I think the visualizations look better because there are fewer points around the detailed areas, given them a more fine shape, whereas for the higher point network there are more points in those areas at not enough density, making the detail stand out less. I think this is more noticeable in the chair legs, as seen in the results below. However, this is just my opinion, and quantitatively fewer points resulted in a smaller F1@0.05 score.
Chair 1
Chair 2
Chair 3
Another hyperparameter I played around with was the smoothing loss weight when training the mesh network. I tried at smaller smoothness weights (0.05) and larger ones (5). For the smaller weights, there were just too many sharp edges. For the larger weights, either there was no noticeable different other than a smaller F1@0.05 score, or the shape appeared to be just an average of different chairs and could not capture fine shapes. I ended up using a value of 0.2 because that got me the best F1@0.05 score. Below are results at w_smooth = 0.05, 0.2, and 5 for Chair 2.
As one can see, w_smooth at 0.05 has too many sharp edges. While w_smooth at 5 appears to represent the chair shape better at the bottom, the F1@0.05 score was slightly worse at 91.803.
I also tried modifying the number of levels in the ico_sphere used to create the mesh to deform. The number of vertices increases exponentially, and when I tried using 5 my GPU would run out of memory. When I tried decreasing it to 3, the F1@0.05 score was worse at 91.435 and the visualizations are worse, as can be seen below. As a result the default value of 4 appeared to be the best.
Chair 1
Chair 2
Chair 3
2.6. Interpret the model
One thing I noticed while training is that outputs for similar looking chair inputs appear to be extremely similar. All networks are able to capture general chair structure but do not do as well capturing the final details. I would like to understand if this because the network has learned to output a select number of chairs with noticable feature differences (tall, wide, detached base, etc.) or if it can infer the presence and absence of chairs in an image.
To test this, I first pass an empty image to each network. Originally, I would have expected the voxel network to output an empty render since it is supposed to predict occupancy, and the point cloud and mesh network to output something random since they need to output vertices. However, this was the case for none of them as the all outputted a chair. What is interesting is that each network outputs a chair of a similar shape. It is as though each network has independantly learned what an average chair is, and adjusts the output based on features from the input image. While this was unexpected, it gives me confidence that while each network is different, they are learning to interpret the encoder output in the same way.
Knowing this, I wanted to see what would happen when I passed a sphere. Again, each network outputted a similar shape, one of a round circular sofa. This gave me more confidence that each network learned to represent similar features in an image, and that it was most likely classifying general chair structures as opposed to recognized where in space a chair may be.
To further test my hypotheses, I decided to pass a weird shape I made, which is a square on top of three lightning bolts. I wanted to see if the networks would output something similar to a chair with legs as in Chair 3 above. And that is somewhat what happened. Each outputted rendering is a shape with a somewhat rounnd top, despite there being a square in the image, with distance legs.
Lastly, while we trained the networks with different viewpoints of different chairs, none of them had the chairs on its side. I wanted to see what would happen if I rotate the image by 90 degrees. Below are the results. It seems as though each network is outputting a large wide sofa, install of the tall one rotated.
To conclude, I think it is safe to say that each network has generalized chairs into a few different shapes, as opposed to being able to recongized when a chair is and is not present. What I find really fascinating is that each network seems to have learn very similar generalizations despite being trained independantly with different architectures.
3. Exploring some recent architectures
3.1. Occupancy Network
The architecture I chose for the decoder was inspired by Occupancy Networks. However, I added a few more residual layers. Each residual layers is followed by batch normalizations.
Occupancy Network Architecture
Similar to the voxel network, I trained this network using the cross entropy loss from section 1.1. I again used a 5:1 positive to negative class ratio. I used the voxel grid as the ground truth when training the network. For each image, 1000 3D points were sampled to be classified as occupied or non-occupied. I used 1000 because the number significantly affected inference time. The network was trained in batches of 8 for 6,500 iterations. This is approximately 9 epochs. The learning rate was a constant 0.0001. Below is a plot of the loss function over iterations.
Occupancy Loss over Iteration
Below are the results on three examples. The three examples are from the evaluation set and not the training set. One thing I want to mention is that occasionally marching cubes would fail on some images in the training set. This is why one of the images below is different than the others I have visualized. I believe training for a little longer would resolve the issue, or possibly adjusting the number of samples used to generate the meshes.
Chair 1
Chair 2
Chair 3
3.2. Implicit Network
The architecture I chose for the decoder was inspired by AtlasNet. However, I added a few more fully connected layers.
Implicit Network Architecture
Similar to the point cloud network, I trained this network using chamfer loss from section 1.2. I used the same point cloud as ground truth from 2.2. The network was trained in batches of 8 for 6,000 iterations. This is approximately 9 epochs. The learning rate was 0.0001 for the first 3000 iterations, and then dropped to 0.00001. As in AtlasNet, the number of sampled points was 2500. Below is a plot of the loss function over iterations
Implicit Loss over Iteration
Below are the results on three examples. The three examples are from the evaluation set and not the training set. I think this network needed just a tiny bit more time to train but I already used 3 late days and did not want to risk using another.