Assignment 5: Point Cloud Classification and Segmentation
Name: Edward Li
Andrew ID: edwardli
Late Days Used:
1. Classification Model (40 points)
I implemented the standard PointNet classification architecture presented in the paper, using BatchNorm on each layer. I also use a learning rate schedule that drops learning rate by a factor of 2 every 20 epochs with 0.3 dropout.
My best performing model achieved a test accuracy of $0.9801$ at epoch 242.
I first visualized two correctly classified objects from each class:
Chair | Vase | Lamp |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
And one incorrectly classified object from each class:
Chair (predicted lamp) | Vase (predicted lamp) | Lamp (predicted vase) |
---|---|---|
![]() |
![]() |
![]() |
Overall, the model is quite accurate, with only one (!) chair misclassification. These misclassifications seem to be due to strangely shaped objects in a certain class, not due to an underperforming model. More concretely:
- The ground truth chair is quite strange, and seems to be out of domain when compared to the training data. The misclassification, while it doesn't exactly make sense, is also not unreasonable - this shape doesn't fit well into any of the trained classes.
- The ground truth vase is strange, especially as the vase is quite shallow, and a plant exists in the point cloud. Most training vases don't have a plant, so the model could easily mistake the plant as the neck/light of a lamp.
- The ground truth lamp is quite short, and almost rotationally symmetric, which is approximately what a vase would look like. Perhaps PointNet, in its max-pooling step, is able to ignore the weird lampy shape sticking out sideways from the rotationally symmetric part.
To train the model and generate visualizations, run python train.py --task cls
, then python eval_cls.py --load_checkpoint best_model --save_preds
, and finally python vis_cls.py
.
2. Segmentation Model (40 points)
I implemented the part segmentation model described in the PointNet paper. More specifically, this is illustrated in Figure 9 and is distinct from the semantic segmentation network. This network uses "skip connections" from earlier layers for hopefully better performance.
My best performing model achieved a test accuracy of $0.9056$ at epoch 150.
I first visualized the two best segmented objects:
Accuracy | Ground Truth | Prediction |
---|---|---|
0.9959 | ![]() |
![]() |
0.9957 | ![]() |
![]() |
I speculate that these perform well because they are the easiest chairs to segment. There is a clear back/seat/leg distinction (easily distinguishable by height). Additionally, no humang labeling ambiguity exists, so the network can be confident in its prediction.
Next, I chose two random chairs to visualize:
Accuracy | Ground Truth | Prediction |
---|---|---|
0.9386 | ![]() |
![]() |
0.9165 | ![]() |
![]() |
These perform reasonably well, with the majority of the ambiguity coming from the seat/leg separation. We see that the network tends to predict more seat than it should, especially a problem in the bottom image as the seat and legs are not separated well.
One other possibility could come from the top chair being relatively tall - the tallness of the seat could have been slightly out-of-domain for the network, causing it to predict slightly lower points as part of the seat.
Finally, I visualized the two worst chairs, which will be my two bad predictions:
Accuracy | Ground Truth | Prediction |
---|---|---|
0.4634 | ![]() |
![]() |
0.4260 | ![]() |
![]() |
Generally, we see the network mispredicting the legs and back of the two chairs. This is caused both by labeling ambiguity and by small features.
Our first chair's legs are very small, which is not represented well in a point cloud with lack of connectivity information. This causes our PointNet to attempt to imagine some arbitrary boundary between seat and "leg" of the couch, which understandably hurts prediction accuracy. Additionally, I think the seat back/arm/headrest distinction is just ambiguous, and the network/labeler's interpretation are both valid.
Our second chair's low accuracy is primarily due to labeling ambiguity as well. The ground truth labeler does not include any legs at all, while the network treats everything under the seat as a leg. While this is likely due to class bias (most chairs have legs under the seat), I don't think this interpretation is necessarily wrong.
In general, we just find that the worst performing cases are mostly due to strange and ambiguous labels.
To train the model and generate visualizations, run python train.py --task seg
, then python eval_seg.py --load_checkpoint best_model --save_preds
, and finally python vis_seg.py
.
3. Robustness Analysis (20 points)
I conduct three experiments here: varying number of points, rotations, and varying object scale.
Number of Points
First, I vary the number of points by using the existing code to randomly sample a subset of points from each point cloud. I choose to sample a logarithmically decreasing number of points from 10000 downwards. Run this section with python eval_{cls|seg}.py --load_checkpoint best_model --num_points {num_points}
.
# of points | Classification Acc | Segmentation Acc |
---|---|---|
10000 | 0.9801 | 0.9056 |
3000 | 0.9790 | 0.9050 |
1000 | 0.9738 | 0.9010 |
300 | 0.9706 | 0.8708 |
100 | 0.8930 | 0.8160 |
30 | 0.4208 | 0.7592 |
Bolded numbers are the original results in Q1/Q2. In general, we find that accuracy remains quite high, before dropping precipitously for classification at 30 points. Accuracy remaining high is likely because PointNet relies only on max pool at the end, and has no local point communication beforehand. This means that as long as some point exists with a strong "chair" (or other class) class vote, the accuracy will remain high.
Our segmentation accuracy remains high as well, likely because of similar reasons, although decreases due to more ambiguity over part distinctions as number of points decreases.
Rotations
Next, I rotate the point clouds in various ways. More concretely, I generate random rotations for each object. First, I select random $y$-axis rotations (rotations about the vertical axis), which is naturalistic, as objects tend to be rotated about the vertical axis in the real world. As another test, I select random unconstrained SO(3) rotations by taking the $Q$ matrix from a QR-decomposition of a random $3\times 3$ matrix sampled from a normal (some paper claims this generates random unitary matrices).
Run this section with python eval_{cls|seg}.py --load_checkpoint best_model --random_{y_rotation|so3}
.
Rotation Type | Classification Acc | Segmentation Acc |
---|---|---|
None | 0.9801 | 0.9056 |
$y$-axis | 0.5614 | 0.6928 |
SO(3) | 0.2802 | 0.3106 |
Bolded numbers are original results in Q1/Q2. I find that $y$-axis rotations perform significantly worse than no rotation, which is expected, as PointNet is not rotation invariant (no transform module or vector neurons) and was not trained on augmented data.
However, SO(3) rotations perform much much worse overall. I think this is because quite a lot of information is captured along the $y$ axis, especially in segmentation (ex: legs are usually under the seat). Without one axis that always remains the same, it is much harder to predict correctly.
Scale Variation
Finally, I thought it would be interesting to explore the robustness of PointNet to various object scales. To do this, I scale all objects by some fixed scaling factor, ranging from 0.3 to 3.0. Run this section with python eval_{cls|seg}.py --load_checkpoint best_model --rescale {scale_factor}
Scale | Classification Acc | Segmentation Acc |
---|---|---|
0.3 | 0.2455 | 0.5842 |
0.5 | 0.5876 | 0.7779 |
0.7 | 0.9140 | 0.8486 |
1.0 | 0.9801 | 0.9056 |
1.5 | 0.9161 | 0.6768 |
2.0 | 0.7398 | 0.5683 |
3.0 | 0.6485 | 0.5526 |
Bolded numbers are original results in Q1/Q2. Overall, unsurprisingly, accuracy drops quite a lot as we leave original scale. However, it seems like some minor scale variation is quite tolerable, as the training set of chairs likely had some variance in scale, leading our networks to be somewhat scale invariant.
However, scales significantly different from 1 are very out-of-domain, and cause the network to fail due to lack of local communication (absolute x/y/z coordinates are all that matter).