I originally received a 2-day extension due to not having GPU access. In the end, I still wasn't approved for AWS, and ended up just using Lambda Labs Cloud. It cost some money but hey tuition is like $25,000 for my program so no worries.
I originally was training the network with a scaled binary cross entropy loss
weighting the positive class by 20, as that was roughly the ratio calculated across
the dataset. However, this resulted in chairs that were very large and exaggerated.
I took the same network (to avoid training from scratch) and reduced the positive
weighting term to 6 - this improved the F1 scores and overall appearance of the chairs.
I didn't do anything special for the point network aside from some blocks with skip
connections, although I do feel that this network is probably over-parameterized.
Because of the massive 15,000 unit output (with a 128 layer before), my decoder is
over 2 million parameters. The layers prior to the output layer are all (128 x 128)
which add relatively little complexity.
This network has the same decoder as for the point network (with different output
dimension).
Method | F1@0.05 |
---|---|
Voxel | 73.1 |
Point Cloud | 92.0 |
Mesh | 81.5 |
The point cloud does the best. This makes sense, as it's unconstrained, and thus does not have to solve an implicit connectivity problem. Additionally, the point cloud directly optimizes its objective function (in contrast to the mesh), and has no problem of class imbalance. The voxels have similar advantages but also have a larger space over which to optimize (32 ** 3) for which positive examples are sparse. This can be alleviated slightly by weighting the positive class, but does not change that the majority of voxels are empty (sparse learning with high variability is difficult as the function must be very precise).
Instead of simply changing a hyperparameter, I decided to experiment with a drastic
architecture and training change. As I noted above, the vast majority of parameters
in my point network come from the final layer, which is a linear layer of (128, 15000)
for 5000 3D points. This is a huge 1,920,000 parameters, a moderate percentage of the
entire encoder and dwarfing the rest of the decoder.
My idea to solve this is to first train the network with only 250 points, or an output
of 750. Now, the output dimension is (128, 750), which is still large, but brings the
parameters of the entire decoder down to 245,749 (a fraction of JUST the last linear layer
in the prior network).
The hope is that the second to last layer (of dimension 128) will learn a good
representation still even with fewer points. Then, we can freeze the entire network,
and just train those last 1,920,000 parameters with a linear layer on the frozen features.
This is valuable as it makes the initial training faster, and shows that we can have
a variable number of points by training on the frozen features (i.e. if we want 10,000
points, we don't need to train the entire network end to end again, we just need to
add 5,000 linear output units on top of the representation).
This code isn't implemented very cleanly but if you use `--custom 1` as an argument
the final layer is replaced with only 250 points and the ground truth is randomly
sampled for that many points. If you use `--custom 2` immediately after the
resulting network is loaded, has its final layer replaced with the full 5000 points,
and has the rest of the layers frozen (no gradient, weights not passed to optimizer).
The results are below.
Method | F1@0.05 |
---|---|
(A) End-to-end (5000 points) for 8k steps | 92.0 |
(B) End-to-end (250 points) for 5k steps | 86.3 |
(C) Frozen linear (5000 points) for 3k steps | 90.7 |
Based on the last experiments, I wanted to see how representative the latent space of my network is, considering we can use the final feature vector as a decent representation to train a network over. In this visualization, I am using network (C) from above. I first take two random images, shown here:
I compute the features of both of these images, then linearly interpolate between them, passing the linearly interpolated feature vector to my network's final linear layer. The code looks similar to this:
# Model has been modified to only output features with the args
f1 = model(rgb1, args)
f2 = model(rgb2, args)
point_clouds = []
for alpha in range(0, 100, 5):
# Create the interpolated feature vector
alpha = alpha / 100
fx = f1 * (1 - alpha) + f2 * alpha
point_clouds.append(model.decoder.final_layer(fx))
This, with some visualization code, creates the interesting graphic for the above two chairs. We can see the point cloud morph linearly between the two representations that makes a lot of semantic sense. I render the chair from multiple viewpoints as it interpolates.