16-889 Assignment 2: Single View to 3D

1. Exploring loss functions
1.1. Fitting a voxel grid (5 points)
The pytroch3d
cubify
function is used to convert from voexel to cube meshes.
1.2. Fitting a point cloud (10 points)
1.3. Fitting a mesh (5 points)
2. Reconstructing 3D from single view
2.1. Image to voxel grid (15 points)
To fit voxel model, I used the following network architecture:
nn.Sequential(
nn.ConvTranspose3d(1, 6, kernel_size=1, stride=1),
nn.GELU(),
nn.ConvTranspose3d(6, 256, kernel_size=3, stride=3, padding=1),
nn.GELU(),
nn.ConvTranspose3d(256, 384, kernel_size=5, stride=1),
nn.GELU(),
nn.ConvTranspose3d(384, 256, kernel_size=7, stride=1),
nn.GELU(),
nn.ConvTranspose3d(256, 96, kernel_size=3, stride=1, padding=1),
nn.GELU(),
nn.ConvTranspose3d(96, 48, kernel_size=1, stride=1),
nn.GELU(),
nn.Conv3d(48, 1, kernel_size=1, stride=1),
);
2.2. Image to point cloud (15 points)
Point cloud architecture:
nn.Sequential(
nn.Linear(512, 1024),
nn.GELU(),
nn.Linear(1024, 2048),
nn.GELU(),
nn.Linear(2048, 4096),
nn.GELU(),
nn.Linear(4096, self.n_point * 3),
);
2.3. Image to mesh (15 points)
Mesh architecture (similar to point cloud):
nn.Sequential(
nn.Linear(512, 1024),
nn.GELU(),
nn.Linear(1024, 2048),
nn.GELU(),
nn.Linear(2048, 4096),
nn.GELU(),
nn.Linear(4096, num_verts * 3),
)
2.4. Quantitative comparisons(10 points)
Voxel | Point Cloud | Mesh | |
F1@0.05 | 71.103 | 96.692 | 96.831 |
Explanation: from the F1 score, we can see that mesh and point cloud has similar performance while the voxel representation has the lowest score. This observation is corroborated with visual observation where the voxel has the worst visual results.
This can be a result of the relatively low voxel resolution (32x32x32) where the model does not have the ability to model fine geometries. Also, the chamfer loss used in point cloud and mesh representation can be more effective and flexible in conforming to the 3D structure of the meshes.
2.5. Analysis effects of hyperparms variations (10 points)
n_points - point cloud
For this study, we vary the number of point clouds.
1000 | 5000 | 10000 | |
Point cloud F1@0.05 | 92.6 | 96.6 | 96.5 |
Samples:
To study the effect of n_points
, I trained three models varying the number of points used for the point cloud without changing the model size. Intuitively, more points mean that the model has more freedom and flexibility for fitting a mesh though it exerts more strain on the learning process and neural network. Less points may have the advantage of prevent overfitting and capture higher level structures better.
From both the quantitative and qualitative result we can see that having less points (1000) makes the model perform slightly worse due to its modeling capacity, while having 10000 points does not make a significant change. It shows that our network’s capacity has not yet saturated in modeling power. However, 10000 points model takes significantly more time to train and evaluate than the 1000 points model.
w_smooth - mesh
For this test, we keep w_chamfer the same and vary w_smooth.
w_smooth | 0.01 | 0.1 | 1 |
Mesh F1@0.05 | 93.122 | 96.831 | 86.8 |
From the F1 score, we can see that the weight w_smooth
is quite important as too big or small a smoothness will cause the model learning differently. Intuitively, too big a smoothness constraint would make the model overly smooth, while too small will cause the model to be jagged and irregular. Both case should make the F1 score worse.
From the result, we can see that too big or small w_smooth degrade the performance. With a 0.01 smoothness, the model become more jagged and irregular than 0.1 smoothness weight. With a 1 smoothness, the model become overly smooth as expected, and the performance also degraded.
2.6. Interpret your model (15 points)
Study of the different views on prediction accuracy.
Since the the dataset contains multiple views of the same instance, here we study which view has highest accuracy.
Instance ID: 395868f297f01d2d492d9da2668ec34c
From this instance, we can see that the model performs the best when having the frontal or back view of the chair, while the side view provides less information. Notice that when provided with the frontal view (the second and seventh’s item), the model can create reconstructions that have holes in front since it is observed.
Instance ID: bf01483d8b58f0819767624530e7fce3
From this instance, we can see that the best performance happens on the side views when most of the chair can be seen.
Instance ID: 3925bf96c05c49d362e682c9809bff14
For this instance, we can see that the view with the most information (side and frontal) have the best performance, while the back view has the worst performance (when you can not see the sitting part)
Overall, we can see that the chosen view does play an important role on the performance of the reconstruction result, where the views that captures more geometric information will yield better reconstruction results.