yenchich
Source voxel v.s. Ground truth voxel
Source point cloud v.s. Ground truth point cloud
Source mesh v.s. Ground truth mesh
Qualitative results of image2voxel
on three examples.
Qualitative results of image2point_cloud
on three examples.
Qualitative results of image2mesh
on three examples are shown below.
Quantitative results on three models are summarized in this table:
vox |
point |
mesh |
|
---|---|---|---|
Avg. F1@0.05 |
89.7964 | 96.6206 | 94.5228 |
image2point_cloud
model, it performs better than image2mesh
since it has additional smoothing terms for the mesh.
image2vox
is optimizing using binary cross-entropy loss for each voxel, and it needs to be converted to mesh first such that we can sample points from it.
I analyze how the w_chamfer
will affect the performance of image2mesh
. The table below shows the quantitative comparison with w_chamfer = [0.1, 10, 1000]
:
0.1 |
10 |
1000 |
|
---|---|---|---|
Avg. F1@0.05 |
91.6955 | 94.5228 | 94.9293 |
We also show the qualitative comparison (top to bottom: w_chamfer = [0.1, 10, 1000]
)
When the w_chamfer
is lower, the average F1 score worse. Without properly penalizing the prediction on pointclouds, the predicted mesh is not correct and has obvious artifacts.
For instance, in the top row, the predicted voxels have four weird triangular legs compared to other predictions. Increasing w_chamfer
to 1000 gives us similar results when w_chamfer=10
, meaning that a little regularization on the smoothness of a mesh is enough in this case.
First, I visualize the trained image2vox
model by varying the isovalue = [0.1, 0.2, ..., 0.9]
threshold when converting the voxels into meshes:
0.1 |
0.2 |
0.3 |
0.4 |
0.5 |
0.6 |
0.7 |
0.8 |
0.9 |
|
---|---|---|---|---|---|---|---|---|---|
Avg. F1@0.05 |
68.875 | 80.779 | 86.250 | 89.050 | 90.089 | 88.151 | 84.506 | 76.218 | 60.773 |
The plot shows that the performance is at the peak when isovalue=0.5
. So the threshold we choose during training and evaluation is reasonable.
Second, I visualize the qualitative results by feeding the same chairs but with different views to the model. For instance, below are different predictions of image2vox
given different views of inputs:
When the views are clear (e.g., first and second rows in the first column), the conditions are less ambiguous and thus the predictions are more correct visually.
However, if we are seeing from the side or the back of the sofa (e.g., second and third rows in the second column), the model predicts a solid base for the sofa instead of a hollow base.
This makes sense since the back or side view does not provide information about what the sofa's base looks like. This shows that the model is not memorizing the 3d models.
I implement an implicit decoder based on occupance networks, which takes the the 3D locations (in x, y, z
coordinates) and outputs the occupancy value at that location. In particular, I use the fourier features of the positions proposed by Tancik et. al.
Here are some qualitative evaluation:
And the average F1 score is 84.7045
. Unfortunately, I did not obtain a better performance in my implementation. I think the two potential reasons are -- (1) I did not incorporate the conditional batchnorm in the network successfully, (2) the network layers are not properly tuned yet due to the time constraint.