In this project, we aim to study the behaviour of a view synthesis algorithm (namely, SynSin). We probe the success and failure modes of the algorithm, and propose potential solutions to improve the algorithm.
Single-scene view synthesis aims to synthesize views of a single scene. As shown below, the model only takes a camera pose as input and synthesize a image corresponds to the input view of the scene that it trains on. Once the model is trained, it produces view of a single scene.
Many algorithms can be categorized as single-scene view synthesis methods. For example, the implicit function based method NeRF, deep feature voxels based method DeepVoxels, and many others
Another class of method is called single view view synthesis algorithms. These method aims to synthesize views of a given image. Therefore, these models take not only the new pose but also the image of a scene as inputs. Then the algorithm aims to synthesize a new view, conditioning on the given pose and scene. Therefore, these methods could synthesize views of multiple scene (that are suffciently similar to the training scenes). Arguably, single-view view synthesis is much more difficult than single-scene view synthesis. In this project, we study a single-view view synthesis algorithm called SynSin.
The method we study is a single view view synthesis method called SynSin (Wiles et al., CVPR 2020). The method takes a input image and a camera pose as input. Then two sub-networks compute a depth map and a set of 2D feature maps from the input image. The depth map then reprojects the feature maps into 3D point cloud, using the given camera intrinsics. These 3D point clouds consist of points of features. Next, the points are transformed with the camera pose and reprojected back into 2D. Finally, a decoder network decodes the reprojected 2D feature maps to the new view.
The trained model could produce cool photorealistic view synthesis results, for example:
Staircase | Outdoor | Door | Bedroom | Living room | ||
---|---|---|---|---|---|---|
We use a pipeline to analyze the potential problems. Specifically, we train the model on RealEstate10K dataset, and test on 5K scenes of the testing set. We use ground truth camera trajectories provided by the dataset for view synthesize, therefore yielding a reconstruction of the original sequence. These reconstructed sequences are compared against the true video sequence, and a L1 loss is computed between them. We rank the sequence by the loss and analyze the worst performing sequences.
After analyzing the worst sequences, we obtains three modes of failures. Specifically, they are
Next, we introduce each of the problem classes in the following sections.
Geometry-induced problems could be the most comprehensible problem - it occurs when the camera goes to unseen areas (e.g. occluded and out-of-frame regions). For example, in the following sequence, the view synthesis result in the middle start to fall apart when the camera goes to the occluded region where the chair and table locate.
Input | Synthesized views | True views |
---|---|---|
Similarly, in the next example, the camera rotates to the right - an out-of-frame region. Because the model is not capable of inferring what's to the right of the input image, it simply renders an unfortunate large blank region.
Input | Synthesized views | True views |
---|---|---|
We note that most camera motions (including all rotation and translation, except moving forward, in some cases) could reveal unseen regions. We unit test whether each of these camera motions could produce unsatisfactory results. Specifically, we translate the camera in each direction for at most 100 pixels or rotate the camera for at most 40 degrees. We interpolate between zero and the maximum rotation or translation number to generate a video corresponds to the motion. We call each one of such motion (e.g. moving forward, rotating left) a unit motion.
Moving back | Moving Left | Moving Right |
---|---|---|
Moving Up | Moving Down | Rotate Left |
---|---|---|
Rotate Right | Rotate Up | Rotate Down |
---|---|---|
As shown above, except motions in the first row (which shows acceptable results), most motions exhibit strong artefacts and/or large blank areas. Similar results can be found in the second scene shown below.
Moving back | Moving Left | Moving Right |
---|---|---|
Moving Up | Moving Down | Rotate Left |
---|---|---|
Rotate Right | Rotate Up | Rotate Down |
---|---|---|
We could use an impainting module to resolve this problem. Because the empty area is relative cheap to obtain (computed by the areas not reprojected by any 3D points), a hole-aware impainting module, such as [1], could be used to fill in the blank areas. Note that there are multiple possible fills to a missing region. Therefore, a probablistic model could be used to sample possible solution to the missing region.
[1]: Image Inpainting for Irregular Holes Using Partial Convolutions, Liu et al., ECCV 2018
Representation-induced problems occurs when the estimated 3D representations is bad or insufficient to capture the fine details of the input scene. This happens often because estimating 3D structure of a scene from a 2D image is a ill-posed problem - there are infinite possible 3D structure that corresponds to a single 2D image. Although we use learning methods to learn a prior knowledge from training data, estimating 3D representation could still fail.
If we look at the failure case shown below, we would be confused at first sight: the camera doesn't go anywhere unseen, so why does it fail?
Synthesized views | True views |
---|---|
However, it would become clearer if we look at the estimated depth map (remember the model use the depth map for projecting 2D points into 3D). As shown below, the estimated depth is completed wrong. The closeby staircases (shown in red box) does not have a smaller depth value, whereas the sky through the door (shown in green box) does not have a high depth value. Therefore, the model couldn't synthesize the correct geometry after the camera is moved.
As another example, see the predicted depth map below. The area in the red box should be mostly empty. However, the predicted depth map does not reflect this empty region.
It is easy to guess the view synthesis result: the whole area to the left will be considered as a flat patch. As the camera moves, the patch would undergo a single projective transformation and the generated view does not reflect the geometry of the real scene.
Synthesized views | True views |
---|---|
To alleviate this problem, we note that a recent work[2] proposes that a explicit 3D model is not required to do view synthesis. This provides a fascinating possibility: remove the explicit 3D representation and directly synthesize the view. Without the need to estimate the intermediate 3D representation, the risk of a bad 3D representation causing a bad view synthesis result is greatly reduced. We believe this is a promising future direction for further research.
[2]: Geometry-Free View Synthesis: Transformers and no 3D Priors, Rombach et al., Arxiv, 2021
The last problem we identify is called Learning-induced problems. These problems occurs from the machine learning procedures we apply. Specifically, we identified two types of out-of-distribution testing data that does not show often in the training set. We find testing on these images yields inferior results.
When I was examining the worst predictions, a situation strikes me as very odd: almost all stair-related sequences yields bad results (e.g. walking up/down stairs). Three examples are shown below.
Stair 1 | Stair 2 | Stair 3 |
---|---|---|
What might be the reason for this? We note that because the dataset is constructed from house viewing videos, which consists of mostly horizontal and forward-backward movement, movement in y-axis is almost non-existent. As shown in the left figure below, the mean vertical motion in the training set is very small. Thus, we conjecture that the model performs worse when the motion is moving up and down, because these motions are out of training distributions. We verify this hypothesis by ploting the relationship between loss and translation scale in each axis. As shown in the right figure below, as the translation scale increase, movement in all axes causes the loss to increase. However, the loss is particularly sensitive to translation in y-axis. With a very small amount of translation in y-axis, the loss increases by a lot. This means that the model does not respond well to vertical movement.
Motion statistics | Loss sensitivity |
---|---|
Finally, we note that because the training set consists of mostly indoor scene, the model does not adapt well to outdoor scenes. As an example, the estimated depth map for the outdoor image contains no meaningful information. The sky does not produce high depth value and the bush does not produce low depth value.
As one can imagine, the generated view synthesis results are poor.
Synthesized views | True views |
---|---|
The easiest solution to this problem is adding more diverse data into the training set. For example, collecting more videos that involves vertical movement and more sequences of outdoor scenes would help. Furthermore, data augmentation (e.g. color jittering) could also mitigate this problem.
In this project, we identified and analyzed three failure modes of a view synthesis algorithm. We now have a better understanding of the shortcomings of the method and potential solutions can be proposed in the future.