16-889 HW5, PointNet

Late Days used - 0



Part 1

The final train loss oscillates around ~1.5, with a final best test accuracy of 97.48% at epoch 206. The test accuracy very quickly reaches around ~0.95 within the first 10 epochs. I display a few predictions that the network gets correct - here's an example of a chair, vase, and lamp object which the network predicts correctly.


With 97.48% accuracy, the network does pretty well, but here's some examples of objects that it gets wrong.

Predicts: Lamp, Correct: Vase
Predicts: Vase, Correct: Lamp
Predicts: Vase, Correct: Chair

Overall, I believe the first failure case is reasonable, as certain floor lamps look quite similar to the vase / tree. Additionally, the plant is significantly larger than the vase containing it. Viewing the second correct example, we can see some similarities to correct examples of lamps.
The second failure case is less reasonable, as we know vases must contain plants, and the shape of the lamp is sparse and unable to contain a plant. However, based on the symmetrical, upside down cone-like shape, it's possible that if the point cloud was sparse, it could represent a vase.
Finally, the last failure case looks like an out of domain issue - the chair is clearly folded, and it's unlikely many chairs have similar representations. There is no place to sit on the chair, and what appears to be two legs. Without knowing a prior it's a folded chair, it's not clear to me what object that represents either (I would think a section of a fence).
Based on the large amount of intra-class variety for these three classes, I think 97.48% is a good accuracy. We see that most cases of failure are reasonable given the representation or the data we have.
I believe adding features like color could help pretty significantly with classification, particularly for the vase case where we expect green colored plants.

Part 2

The final train loss is ~12.5 at epoch 250 but continuing to drop, although test accuracy reaches slightly above 0.89 within the first 100 epochs. The final best test accuracy is 90.21% at epoch 214. Below we include 3 examples of good predictions with high accuracy. The first gif of the pair is the ground truth, the second is the predictions.

Accuracy: 98.0%
Accuracy: 96.6%
Accuracy: 98.6%

We can clearly see that all of these chairs are very regular - there's a well defined base, back, and legs of the chairs. Because of the way their parts are joined, there's little ambiguity in what part each of the points are in.

Below we include 2 examples of medium accuracy predictions, between 80% and 90%.

Accuracy: 81.8%
Accuracy: 87.1%

These chairs are now much less regular, having more parts in more complex designs. There is some inherent ambiguity - for example, in the 81.8% accuracy example, the ground-truth of the seat of the chair extends to the ground, which is not how I would personally label that part. In the second chair, there are no legs, and the seat extends to the ground, which confuses the network which predicts legs.

Below we include 2 examples of poor accuracy predictions, under 60%.

Accuracy: 48.6%
Accuracy: 55.0%

The worst scoring example is 48.6%. I personally believe this is due to a huge amount of label ambiguity, where the arms extend to the ground, with no legs. I think the network's predictions are actually very reasonable, as the back, seat, arms, and legs are what I would personally label the chair with. The second case of 55.0% is also very difficult, as it's not clear whether the arms should be delimited as arms or part of the sofa. I would think the network is providing semantically reasonable predictions, with a significant amount of label ambiguity due to the diverse nature of chairs.

Part 3

I first experiment with sparsity, reducing the number of points for each model. This is done by using the command line prompt --num_points for each, from 10,000 (the prior reported accuracy) down to 1,000 points in increments. Results are reported below.

Number of Points Classification Accuracy Segmentation Accuracy
10,000 97.48 90.21
8,000 97.38 90.22
6,000 97.38 90.25
4,000 97.38 90.23
2,000 97.17 90.23
500 96.22 88.55
200 95.49 85.26
50 76.92 79.76

We see an incredible result that in the case of our classification network, we can drop the number of points to 2,000 (1/5 of our original) without seeing an appreciable decrease in performance, and maintaining high accuracy down to 200 points.

For our segmentation network, we similarly retain accuracy up down to 2,000 points, and only see a noticeable drop at 500. In fact, from 10,000 to 2,000, accuracy stays exactly the same with no drop.

This indicates that our network only needs a small number of points to reason about the structure of the object, indicating that information is contained in a few "core" points, something that PointNet++ capitalizes on.

I now experiment with change from the canonical orientation by rotation of the objects. I report accuracy as a function of the degrees of rotation - note, at 90-degree rotation, certain symmetrical objects will be identical. I do this by rotating the objects around the z-axis (keeping them level, but just rotating them as if they were on the floor, pivoting on their base).

The code for rotation is straightforward, especially with pytorch3d, which will create the 3D rotation matrix for us. I just include it below for completion.

        
        def rotate_points_batch(points, degrees):
            radians = degrees / 180 * pi
            rot_mat = pytorch3d.transforms.euler_angles_to_matrix(
                torch.tensor([0, radians, 0]), "XYZ"
            )
            return points.matmul(rot_mat)
        
    

To insure correctness, I include two images, of the rendering before and after rotating by 45 degrees for the exact same object.

The results are below:

Rotation around Z-Axis Classification Accuracy Segmentation Accuracy
0 97.48 90.21
30 94.44 81.19
60 62.01 73.41
90 66.84 67.52

I am surprised to find that the accuracy is severely impacted, considering there are many objects which are fairly symmetrical - it's clear that orientation provides a strong cue, and without augmentation, performance is poor on non-canonical objects. I'm curious as to whether there's even more severe impact with rotation along the y-axis, as this would be very clearly out of domain, even for symmetrical objects.

Rotation around Y-Axis Classification Accuracy Segmentation Accuracy
0 97.48 90.21
30 80.06 72.95
60 40.29 57.25
90 35.57 45.64

Clearly, the symmetry of the objects does indeed help performance a bit with rotation around the z-axis, as rotation around the y-axis has even worse performance. A separate network which assists with predicting a transformation to a canonical frame would likely help a lot, or a T-Net in the architecture and data augmentation.