Late days used: 4
Optimized voxel grid:
Ground truth voxel grid:
Optimized point cloud:
Ground truth point cloud:
Optimized mesh:
Ground truth mesh:
Input RGB image:
Ground truth Voxel grid:
Predicted Voxel grid:
Input RGB image:
Ground truth Voxel grid:
Predicted Voxel grid:
Input RGB image:
Ground truth Voxel grid:
Predicted Voxel grid:
Input RGB image:
Ground truth point cloud:
Predicted point cloud:
Input RGB image:
Ground truth point cloud:
Predicted point cloud:
Input RGB image:
Ground truth point cloud:
Predicted point cloud:
Input RGB image:
Ground truth mesh:
Predicted mesh:
Input RGB image:
Ground truth mesh:
Predicted mesh:
Input RGB image:
Ground truth mesh:
Predicted mesh:
Point cloud representation performs the best among all the three representations in terms of F1 scores (also apparent from visualizations above). This is expected as there are no constraints on the point cloud representation whatsoever, and it can learn the given ground truth relatively easily. On the contrary, both the voxel and mesh representations have constraints on them, namely grid resolution for voxel grid, and connectivity between points and mode of initialization for meshes. Furthermore, predicting per voxel occupancy seems to be a more difficult task intutively as compared to predicting the deformations to a pre-initialized mesh, making the lower F1 score for voxel sound reasonable. Increasing the resolution of the voxel grid can probably boost its F1 score.
Representation | F1@0.05 |
---|---|
Voxel | 85.875 |
Point Cloud | 94.423 |
Mesh | 91.832 |
n_points | F1@0.05 |
---|---|
1000 | 91.253 |
2000 | 93.399 |
5000 | 94.423 |
10000 | 95.119 |
The baseline here is the experiment with 5000 points in the point cloud. Increasing the number of points to 10000 leads to a small increase in the F1 score. Similarly decreasing the number of points leads to a slight drop in the F1 score. This is expected as the model's learning capacity should vary proportionally with the number of points. Visually, 10000 points produce a slightly richer representation as compared to a 5000 or 2000 point representation as visible in the samples below. Further, example 2 of 1000 points representation doesn't match properly with the ground truth, indicative of this model's poor capacity.
Input RGB image:
Ground truth mesh:
Predicted PC with 1k points:
Predicted PC with 2k points:
Predicted PC with 5k points:
Predicted PC with 10k points:
Input RGB image:
Ground truth mesh:
Predicted PC with 1k points:
Predicted PC with 2k points:
Predicted PC with 5k points:
Predicted PC with 10k points:
Input RGB image:
Ground truth mesh:
Predicted PC with 1k points:
Predicted PC with 2k points:
Predicted PC with 5k points:
Predicted PC with 10k points:
Representation | F1@0.05 |
---|---|
0 | 92.901 |
0.5 | 92.196 |
1 | 91.587 |
5 | 91.446 |
Changing the w_smooth hyperparameter doesn't have a huge effect on the F1 scores as such. From the visualizations, we can see that lower w_smooth values such as 0 have too many sharp edges/abnormalities, and a higher value seems to removes the bigger edges/peaks/abnormalities in the mesh (although this doesn't seem to be that consistent). Further, in example 3, w_smooth=5 generates a shape that is a significant deviation from the ground truth mesh. Therefore, a mediocre value of w_smooth, say 0.1 or 0.5 seems to work best.
Input RGB image:
Ground truth mesh:
Predicted Mesh with w_smooth 0:
Predicted Mesh with w_smooth 0.5:
Predicted Mesh with w_smooth 1:
Predicted Mesh with w_smooth 5:
Input RGB image:
Ground truth mesh:
Predicted Mesh with w_smooth 0:
Predicted Mesh with w_smooth 0.5:
Predicted Mesh with w_smooth 1:
Predicted Mesh with w_smooth 5:
Input RGB image:
Ground truth mesh:
Predicted Mesh with w_smooth 0:
Predicted Mesh with w_smooth 0.5:
Predicted Mesh with w_smooth 1:
Predicted Mesh with w_smooth 5:
For a unique visualization, I try to pass black images, i.e., no chair in the input image, image tensors filled with ones and rotated/inverted versions of one of the input images and see what the model outputs.
For the given input:
we get the following outputs:
For model predicting voxels:
For model predicting points:
For model predicting meshes:
For the given input:
we get the following outputs:
For model predicting voxels:
For model predicting points:
For model predicting meshes:
For rotations of the form shown in this figure:
The outputs are as follows:
For model predicting voxels:
For model predicting points:
For model predicting meshes:
For inversions of the form shown in this figure:
The outputs are as follows:
For model predicting voxels:
For model predicting points:
For model predicting meshes:
The models predicting point clouds and meshes seem to learn an intermediate representation of chairs (which is what they predict when fed with black images). The model predicting voxels seems to output a chair (rather a part of it) when fed with a black image. Further, on feeding in a white image, the voxel model predicts gibberish (or part of a chair which we can't seem to comprehend) whereas the other two models predict chairs different from their black image counterparts. The models should be ideally variant to inversion (since they are trained on upright chair images) but don't predict gibberish (except for the voxel model) and rather output reasonable chairs. A similar case is seen with rotated input images. Surprisingly, the voxel model predicts a reasonable chair when fed with a rotated image (suggesting that it didn't produce gibberish in the upright black image case). Another interesting observation is that for almost all the images (in this section and previous sections), the point cloud model's predictions are missing one of the legs of the chair indicating that the model is a little biased. Better training losses/hyperparameter tuning should help alleviate this problem.