Question 2: Reconstructing 3D from a single view¶
For each of the 3D reconstruction methods, I provide the input RGB, its rendered GIF followed by an image and a GIF of the resultant mesh/pointcloud
Image to Voxel Grid¶
- Loss and Learning Rate graphs:
Image to Point Cloud¶
- Loss and Learning Rate graphs:
Image to Mesh¶
- Loss and Learning Rate graphs:
Quantitative Comparisions¶
- This model architecture achieves an average F1@0.05 score of 0.73
- This model architecture achieves an average F1@0.05 score of 0.81
- This model architecture achieves an average F1@0.05 score of 0.78
Through experiments and results, I observe that point clouds are the most optimal representation for the single view to 3D generation task as evidenced by the highest F1@0.05 score (a measure of how the 'precision' and 'recall' of the model stack up against each other within a neighbourhood of 0.05 units). Moreover, this model was the fastest to train in terms of time taken and the total number of epochs iterated through. We use the F1 score as the metric for quantiative comparision as the losses (Chamfer, Smoothening and Binary Cross Entropy) all have different output sclaes.
From the understanding gained while doing this assignment and in the class, I would attribute this performance disparity down to the following reasons:
The voxel grid prediction is inherently limited by resolution, the 32 * 32 * 32 discrete structure of the output means that we cannot represent finer details of the ground truth mesh. Additionally, the marching cubes algorithm that we use to convert our voxels to a mesh will inherently introduce some noise as for 3D space, there is ambiguity in the lookup table for certain configurations of occupied and unoccupied voxels.
A pointcloud is inherently an unordered structure. This allows us to process each point individually, or in our case, come up for a representation for each point individually via linear layers which do not have any inductive bias and thus can capture global information. This gives exceptional pliability in what we can represent and this is reflected in the highest F1 score.
Representation via mesh regression is inherently limited by the number of vertices and faces in the mesh that one starts out with. This also affects the connectivity of faces in the mesh and could cause an issue if the desired mesh does not have the same connectivity as the source mesh. This explicit conditioning on what a mesh can and cannot represent leads to certain artifacts in some of the test cases but overall this method does train reasonably quickly and achieves a high F1 score.
Effect of Hyperparameter variation¶
To analyse the effect of the hyperparameter variation on mesh generation, I varied the weights given to the Chamfer and Smoothness losses. The methods here were trained using a cosine annealing learning rate scheduler. For this section, the chamfer loss implementation that I used invovled taking the mean of the distances between vertices instead of summing them up so that the chamfer loss would be in the same range as the smoothing loss and thus they could be weighted against each other.
Weight Smoothing: 1.0, Weight Chamfer: 0.0
Final F1@0.05: 71.0
Weight Smoothing: 0.8, Weight Chamfer: 0.2
Final F1@0.05: 73.7
Weight Smoothing: 0.5, Weight Chamfer: 0.5
- Final F1@0.05: 72.5
I notice that increasing while the results quantiatively are similar, we can notice the effect of the smoothness loss in the rendered 3D gif. However, according to the F1 metric, there isn't much to choose from between these models and that's likely because in the ballpark F1@0.05 range, only the general structure is created but finer details are missing and thus there isn't much to choose from across the three ablations visually.
Model Interpretation¶
Having skimmed the suggested readings mentioned in the course schedule for this topic as well as the papers referenced for parametric and implicit networks, I think that visualising the model evaluations at various stages in training (as given by the number of epochs elapsed) would help us get a decent idea of what the model is trying to do on the backend. We will evaluate the pix2point model at various stages in training (ideally evaluate all 3 but because of compute constraints, I could not).
- pix2vox at various checkpoints
@200 epochs¶
@1200 epochs¶
@2200 epochs¶
@5700 epochs¶
Through these visualizations, I can notice how over multiple iterations, the structure of the points regresses towards the pointcloud sampled from the desired mesh. We see how while we do get a good representation very quickly, the model looks to concentrate mass in the center of the unit cube first and then focusses on refining texture and details. However, it must be noted that the pointcloud representation isn't the most optimal choice of representation to observe fine grained features but unfortunately I had to train this section of the assignment on a CPU after facing countless issues compiling pytorch3d with GPU support and running into several AWS issues too (such as running out of memory and having to wait over a day to get an increase in memory quota and over 5 days to get an increase in G instance quota).
Question 3: Exploring¶
Implicit Network¶
- Loss and Learning Rate graphs:
Unfortunately, the final F1 score was rather poor, I would only get an F1@0.05000 score of 0.41
Parametric Network¶
- Loss and Learning Rate graphs:
As was the case with the Implicit network, my parametric network (which follows a structure similar to AtlasNet) would have a low loss, but on evaluation, the F1@0.05000 score was only 0.38 and the resulting render is not of high quality. For the sake of visualization, I converted the representation to a mesh for both of the papers and visualized that.
Extended Dataset for Training¶
- Results:
It is noteworthy to see that the highest F1 score was achieved when the entire dataset was used to train the mesh prediction network. I used the cosine annealing scheduler with an initial learning rate of 4e-4 and chamfer weight as 0.6 and smoothing weight as 0.4. This result indicates that the networks used have a much greater learning power than what learning a single class representation would need. Moreover, seeing improved results on some of the chair renders leads me to believe that my inital models possibly overfit the single class dataset leading to low loss values but mostly average reconstruction.
- My Loss and Learning Rate graphs
- Some good visualizations:
- Some poor visualizations: