This assignment was submitted 2 days late.
I implemented the PointNet model, complete with batch norm and dropout. Here's my final test accuracy:
Test Accuracy | 0.97 |
Here are some examples of correctly classified points:
Point Cloud: | ![]() |
![]() |
![]() |
![]() |
![]() |
Pred Label: | Chair,0 | Lamp,2 | Chair,0 | Lamp,2 | Vase,1 |
Here are some examples of incorrectly classified points:
Point Cloud: | ![]() |
![]() |
![]() |
![]() |
![]() |
Pred Label: | Lamp,2 | Lamp,2 | Lamp,2 | Vase,2 | Lamp,2 |
In general, the incorrect predictions are pretty out-of-distribtion for their class. The things predicted as lamps are tall and thin, and the things predicted as vases are short and squat. This makes sense for this model, since it can't really assess local features, and so can only really look at the distribution of points across the space.
I implemented the PointNet segmentation network, as described in the appendix of the original paper. My final performance is:
Test Accuracy | 0.90 |
Here are some examples of good performance:
id | gt | pred | acc |
0 | ![]() |
![]() |
0.94 |
1 | ![]() |
![]() |
0.98 |
2 | ![]() |
![]() |
0.88 |
And here are some bad examples:
id | gt | pred | acc |
163 | ![]() |
![]() |
0.61 |
351 | ![]() |
![]() |
0.48 |
In general, the segmentation works best when the chairs are a 'regular' shape. That is, chairs that have well-defined sides, backs, seats, and legs. Because teh pointnet model can only mode global features, the actual boundaries of these segmentation are not clean/accurate; for instance, in object 0 the model predicts that the top of the legs are part of the seat, which isn't an unreasonable prediction just given the geometry. The poorly-performing objects are very irregular, one with a detached ottomoan, and the other with a weird cylindrical shape. Because the network doesn't use local features, the model can't really work on these outliers.
The experiment is as follows: I will uniformly rotate the objects around the Z axis by a number of degrees, , and evaluate how the performance changes as the rotation is performed.
# degrees | 0° | 15° | 30° | 60° | 90° | 180° |
accuracy | 0.97 | 0.90 | 0.44 | 0.19 | 0.23 | 0.44 |
# degrees | 0° | 15° | 30° | 60° | 90° | 180° |
accuracy | 0.90 | 0.83 | 0.70 | 0.52 | 0.43 | 0.34 |
As expected, performance was dramatically worse when the test items were rotatd. The training dataset had a specific rotation, and the architecture isn't rotation invariant. Plus, the architectures are completely local. Therefore performance dropped off a cliff, and at large angles the classification abilities dropped to roughly random (or worse than random!).
The experiment is as follows: we sample n points for all objects in each trial, and vary n across trials. The idea here is to see how important density of points is to the classification accuracy of the architecutre when trained on the max number of points.
# points | 1 | 10 | 100 | 1000 | 10000 |
accuracy | 0.64 | 0.50 | 0.94 | 0.97 | 0.97 |
# points | 1 | 10 | 100 | 1000 | 10000 |
accuracy | 0.20 | 0.49 | 0.81 | 0.90 | 0.90 |
As expected, having only a couple of points in the scene dramatically reduces the accuracy of both networks. However, that is only true to a point. At roughly ~1000 points per object, the accuracy of the predictions saturates, up to and including the 10000 points per object that the model was trained on. This lines up with my expectation that the maxpool operation washes out any higher-density signal (that is, points that are near one another have such similar predictions that their contribution to the model is the same as a single point in that region). Funny enough, classifying on only 1 point also yields reasonably good results; this shows how the network is kind of learning a spatial distribution over the objects in their canonical poses, rather than actually looking at the shapes or incorporating local information.