16-889 HW2, RGB to 3D

0. Late Days used - 5



I originally received a 2-day extension due to not having GPU access. In the end, I still wasn't approved for AWS, and ended up just using Lambda Labs Cloud. It cost some money but hey tuition is like $25,000 for my program so no worries.

1.1


1.2


1.3


Note - for fair comparison, I use the same 3 *random* examples for every method. I do not cherry-pick examples with good F1 scores. Thus, some renders may not look excellent, but they are representative of the achieved F1 scores.

2.1






I originally was training the network with a scaled binary cross entropy loss weighting the positive class by 20, as that was roughly the ratio calculated across the dataset. However, this resulted in chairs that were very large and exaggerated. I took the same network (to avoid training from scratch) and reduced the positive weighting term to 6 - this improved the F1 scores and overall appearance of the chairs.

2.2






I didn't do anything special for the point network aside from some blocks with skip connections, although I do feel that this network is probably over-parameterized. Because of the massive 15,000 unit output (with a 128 layer before), my decoder is over 2 million parameters. The layers prior to the output layer are all (128 x 128) which add relatively little complexity.

2.3






This network has the same decoder as for the point network (with different output dimension).

2.4

j
Method F1@0.05
Voxel 73.1
Point Cloud 92.0
Mesh 81.5

The point cloud does the best. This makes sense, as it's unconstrained, and thus does not have to solve an implicit connectivity problem. Additionally, the point cloud directly optimizes its objective function (in contrast to the mesh), and has no problem of class imbalance. The voxels have similar advantages but also have a larger space over which to optimize (32 ** 3) for which positive examples are sparse. This can be alleviated slightly by weighting the positive class, but does not change that the majority of voxels are empty (sparse learning with high variability is difficult as the function must be very precise).

2.5

Instead of simply changing a hyperparameter, I decided to experiment with a drastic architecture and training change. As I noted above, the vast majority of parameters in my point network come from the final layer, which is a linear layer of (128, 15000) for 5000 3D points. This is a huge 1,920,000 parameters, a moderate percentage of the entire encoder and dwarfing the rest of the decoder.

My idea to solve this is to first train the network with only 250 points, or an output of 750. Now, the output dimension is (128, 750), which is still large, but brings the parameters of the entire decoder down to 245,749 (a fraction of JUST the last linear layer in the prior network).

The hope is that the second to last layer (of dimension 128) will learn a good representation still even with fewer points. Then, we can freeze the entire network, and just train those last 1,920,000 parameters with a linear layer on the frozen features. This is valuable as it makes the initial training faster, and shows that we can have a variable number of points by training on the frozen features (i.e. if we want 10,000 points, we don't need to train the entire network end to end again, we just need to add 5,000 linear output units on top of the representation).

This code isn't implemented very cleanly but if you use `--custom 1` as an argument the final layer is replaced with only 250 points and the ground truth is randomly sampled for that many points. If you use `--custom 2` immediately after the resulting network is loaded, has its final layer replaced with the full 5000 points, and has the rest of the layers frozen (no gradient, weights not passed to optimizer). The results are below.

Method F1@0.05
(A) End-to-end (5000 points) for 8k steps 92.0
(B) End-to-end (250 points) for 5k steps 86.3
(C) Frozen linear (5000 points) for 3k steps 90.7

I believe the performance of the 250 point model saturated at a lower F1 score because with fewer points, the network must randomly guess where the GT points are with more precision - having more points and a denser output will always result in a lower F1 score. Although the F1 score is slightly lower for the frozen network trained over the features created from a network trained on less points, I think it's within ballpark and still valuable.

Thus, we see that the described way of training is feasible.

2.6

Based on the last experiments, I wanted to see how representative the latent space of my network is, considering we can use the final feature vector as a decent representation to train a network over. In this visualization, I am using network (C) from above. I first take two random images, shown here:

I compute the features of both of these images, then linearly interpolate between them, passing the linearly interpolated feature vector to my network's final linear layer. The code looks similar to this:

    
        # Model has been modified to only output features with the args
        f1 = model(rgb1, args)
        f2 = model(rgb2, args)

        point_clouds = []
        for alpha in range(0, 100, 5):
            # Create the interpolated feature vector
            alpha = alpha / 100
            fx = f1 * (1 - alpha) + f2 * alpha
            point_clouds.append(model.decoder.final_layer(fx))
    
    

This, with some visualization code, creates the interesting graphic for the above two chairs. We can see the point cloud morph linearly between the two representations that makes a lot of semantic sense. I render the chair from multiple viewpoints as it interpolates.