Four late days were used
Note that for all of the following experiments, the following ground truth mesh was used.
For this, I used binary cross-entropy loss between the two tensors. I used the Pytorch
function BCELossWithLogits
as a more numerically stable version of running sigmoid and then binary cross-entropy. Sigmoid or a similar activation is required because the outputs of the network need to be constrained between $[0, 1]$. An example of fitting the voxel grid to an example is below
I implemented Chamfer loss using knn_points
. The first item returned by this function is the squared distance for each point in the first set to the closest point in the second set. The Chamfer Distance defined in "A Point Set Generation Network for 3D Object Reconstruction from a Single Image" would simply take the sum of both quantities. However, I found that to match the values returned by the pytorch3d
implementation, I had to normalize by the number of points in each set. Below is a fit point cloud.
Here I simply used the mesh_laplacian_smoothing
implemented in pytorch3d
. Below is a fit point cloud with w_chamfer=1
and w_smooth=1
.
Here I used an architecture with eight layers of transposed convolution, which is also known as deconvolution. I started with a 1x1x1
spatial resolution and then expanded the spatial resolution while decreasing the number of channels in the following manner: (512,384,256,192,128,64, 32, 16, 1)
.
Intuitively, I began with singleton spatial dimensions because the features in the image embedding do not necessarily correspond to individual regions of the 3D scene. So we must reason globally at the early stages of the network.
The results were acceptable and were generally more blobby similar to the mean than they should have been. Three results can be seen below.
Prediction, ground truth, and image
For this model, I used a fairly simple architecture consisting of fully connected layers with ReLU activation. The hidden layer sizes were $(512, 1024, 2048, 4096)$ and the output size was $8193$, which produced $2731$ output points. I trained with a learning rate of $1e^{-4}$.
The results were generally quite good qualitatively. Three results can be seen below.
Prediction, ground truth, and image
For this task, I used a conceptually similar model to the point cloud task. However, I was having issues with underfitting so I chose to increase the model capacity. The final model had seven fully connected layers with the following dimensionality: (512, 1024, 2048, 4096, 4096, 2562 * 3, 2562 * 3, 2562 * 3)
. As the final nonlinearity, I used tanh
to avoid very large offsets.
The results didn't look great visually due to odd spikes in the mesh. But they matched the global structure fairly well.A
Prediction, ground truth, and image
Voxel grid | Point cloud | Mesh | |
---|---|---|---|
F1@0.05 | 74.750 | 96.194 | 85.684 |
With the voxel grid, I noticed that the predictions consistently over-predicted the true volume of the actual structure. I think that this may have been because this prediction task requires the network to predict whether each grid cell is occupied independently of the other ones. Since the loss is biased toward the positive points to account for the class imbalance, this causes the predictions to be spread out through the plausible region. Finally, there is a lot of information lost in the voxelization process. If you visualize the ground truth voxels as an isosurface, a number of strange artifacts are introduced.
I think the mesh-based prediction is a challenging task because the network needs to balance the spatial consistency objective and the accuracy one. This means that it is hard for it to capture fine detail. This is especially difficult because the original spherical mesh is genus zero.
My intuition is that each output neuron of the point cloud network is able to predict a given feature. However, different from the mesh, there is less constraint that these features be predicted perfectly, or consistent with neighbors. This intuition is supported by the visualizations in section 2.6, the neurons learn to predict a given semantic part of the mesh. Especially for more complex meshes that have a non-standard global structure, this seems to allow the network to capture the local structure more effectively.
This plot shows that the learning rate has a very substantial effect on the final F1@0.05 metric. I was curious about this because in my initial experiments I was struggling with severe underfitting in my model. The model seemed to learn a mean shape chair and did not adapt the shape to the encoded feature vector. Given these trends, I believe that the model had sufficient capacity to fit this function but the issue was in the optimization strategy.
Two examples with the highest learning rate 0.001
can be seen below. Note that they do not capture the variation in these interesteresting examples.
The ones trained with the lower learning rate of 0.0000125
capture the variation much better, as seen below.
All of these experiments were trained for 14000 epochs which took roughly 3 hours on my machine. One issue I saw was for the lower learning rates, the model did not fully converge during this time. Therefore, in my subsequent experiments, I used a learning rate decay which seemed to help quickly learn the general structure and then adapt to the individual instances.
I was curious whether each output neuron predicted a consistent part of the object. To test this theory, I implemented two visualization strategies, one for point clouds and one for meshes. Both involved coloring the predicted vertices in a manner that was consistent across predictions.
For the case of the mesh, I simply transformed the icomesh xyz
vertices into an rgb
vertex color. This allowed me to confirm that different neurons did correspond to different semantic features. Further, when I had too little smoothing, it allowed me to infer complex mesh geometries where one portion of the mesh was comprised by predictions from multiple regions of the icoshere.
I struggled a bit more with the point cloud. While I could have just consistently textured the across predictions, it is important that they are also consistent with their neighbors. Otherwise, with thousands of points, it will be challenging to track the individual ones between objects. Therefore, I ran predictions over the entire test set and averaged the xyz
locations of each individual point across all the predictions. Then I used these averaged xyz
points, which represent the mean prediction for each neuron, to create an rgb
colormap. When I used this on the predictions, it showed that indeed different neurons roughly correspond to different regions, as seen below. It also shows that in general predictions in each instance are close to the mean prediction, because regions of color don't mix much. An unsurprising exception to this is the stem of the office chair, which is much narrower in the x-y plane than most other chairs at that point. This causes multiple predictions to be lumped together.
This effect is not terribly pronounced on normal-looking chairs, since it looks similar to just coloring them by their predicted xyz
points. However, for nonstandard instances such as the one below, it showed that semantics are still maintained. In other instances, the orange color corresponds to the front right corner of the chair seat. Despite the different structures, that semantic relation is still maintained.