In recent years, research communities within fields like embodied AI, computer vision and natural language processing have envisioned designing agents that operate in environments that may be partially known and where human collaboration may be required. To achieve this, it is important that such agents are equipped with mechanisms that allow them to understand semantic contexts of an environment. One way to extract and understand semantics includes leveraging and aligning information from sensory inputs, including: visual images, masks, audio, language, motion, etc. Toward this challenging objective, we are interested in studying how we can align explicit semantic information with visual and motion information to learn to represent 3D structure and ego-motion from monocular vision.
Specifically, we aim to explore the task of view-synthesis as a methodology for learning to represent 3D structure and ego-motion from monocular vision. Previous research in view synthesis has shown that models can be trained to generate reasonable ego motion outputs given video inputs from car dashboards (i.e. the KITTI dataset [7]) [6]. In this study, a pose and explainability network in conjunction with a depth network was used in order to generate a 3D video prediction of the dashboard video. Other research employs semantics-driven unsupervised learning in order to train their ego-motion estimation models [8].
Our project, SemSynSin (Semantics-driven ViewSynthesis from a Single Image), builds on these ideas in order to generate ego motion videos from indoor scenes.
The sections that follow are organized as follows:
Approach
section, we discuss the approach we take in order to generate indoor ego motion
videos. For this project, indoor scenes from the MatterPort3D dataset were used. Additionally, trajectory
information from the Vision-Language Navigation dataset was used.
Experiments
section, we describe the ablation experiments under which we test SemSynSin.
Results
section, we show the results generated by the SemSynSin model under the different
experimentation conditions.
Discussion
section, the results of SemSynSin are discussed, as well as potential avenues of
future research.
In this project, we are interested in learning the 3D structure of complex indoor environments via view-synthesis. View synthesis allows generating images of a scene from different viewpoints. This is a highly challenging task as it requires developing algorithms capable of understanding the nature of 3D environments, e.g., the semantic information in a scene, the relationship between objects in a scene, and the layout of environments and occlusions.
As mentioned above, we build on prior work which focused on learning Structure-from-Motion (SfM) from single-view images in outdoor environments. We first asses the model's performance on complex indoor environments, and then explore methods for improving the results. Particularily, we are interested in explicitly incorporating semantic knowledge since it is crucial for scene understanding.
In the following sub-sections, we further describe the procedure followed in this project. Specifically, in Part 1, we described our methodology for obtaining the training data. Then, in Part 2, we describe the model and the loss functions we used.
Here's the link to our project code.
We use the Matterport3D (MP3D) dataset for our project and the Habitat simulation environment to generate egocentric trajectories for training, validation and testing. This section describes in more detail the procedure followed to generate said dataset.
Matterport3D (MP3D) [1] is a large-scale dataset introduced in 2017, featuring over 10k images of 90 different building-scale indoor scenes. The dataset provides annotations with surface reconstructions, camera poses, color and depth images, as well as semantic segmentation images. For our project, we use a different version of this dataset which can be obtained through the Habitat Simulation environment described in Section 1.2. It is important to note that one of the major differences between this version of the dataset and the original one is that in the former, images have lower resolution and quality. As it can be observed in Figure 1, the images exhibit visual artifacts, making the task of 3D learning more challenging.
As such, this particular version of the dataset has been generally used for training Embodied agents in various multi-modal navigation tasks [2, 3, 4]. We explore this version of the dataset since we are interested in equipping Embodied agents with 3D learning and understanding skills within this simulation platform.
In order to generate the data for training, validation and testing, we used the Vision-and-Language (VLN) dataset presented in [2]. This dataset consists of instructions provided in natural language that describe a particular path to follow in a MP3D indoor environment. These instructions correspond with a trajectory in the environment which can be obtained by using a Shortest-Path-Follower (SPF) from the start and goal locations associated with the instruction. For this project, we are not interested in the language instructions. Thus, we do not provide details on how this dataset has been used to train instruction-following agents. However, we leverage the visual trajectories associated to such instructions for creating our dataset.
The VLN dataset described above was designed for the Habitat [5] simulation platform. Thus, we use [5] to collect the data for training. Briefly, Habitat is a highly efficient and flexible platform intended for embodied research. It allows researchers to easily design and configure agents and sensors, as well as AI algorithms for a diverse set of navigation tasks [2, 3, 4].
Specifically, we use a SPF, as described above, to get the sensor information
from the simulator based on the VLN dataset. We specifically, extract
RGB, depth and color images for each trajectory, as well as, relative
pose information. The SPF uses has an action space consisting
of four possible actions: MOVE_FORWARD 0.25m
,
TURN_LEFT 15deg
, TURN_RIGHT 15deg
,
and STOP
. An example of a resulting trajectory is shown
in Figure 1.
Table 1. shows statistical information about the dataset we obtained.
Data Split | Num. Environments | Avg. Num. Trajectories per Environment | Total Num. Trajectories | Avg. Num. Steps per Trajectory | Total Num. Steps per Trajectory |
---|---|---|---|---|---|
Train |
33 | 65 | 2,169 | 55 | 119,976 |
Val |
33 | 5 | 142 | 54 | 7,750 |
Test |
11 | 55 | 613 | 54 | 33,412 |
As mentioned before, we focus on learning the 3D structure of an indoor environment from video sequences. We follow prior work [6], which focuses on learning Structure-from-Motion (SfM) in outdoor environments. This model achieves the latter purely from training on unlabeled color images and through a view-synthesis objective function as their main supervisory signal.
In our project, we explore if explicitly incorporating semantic information, in the form of masks, enables the model to better understand and learn to model the 3D structure of a given scene. Our model jointly trains two neural networks; one in charge of predicting depth from a single-view image represented both in RGB and in semantic labels, and the other in charge of predicting the pose transformaton between two images. To train the model, we also use a view-synthesis objective for both the color images and the segmentations and a multi-scale smoothness loss. Section 2.1 and Section 2.2 provide more details on the model implementation, and Section 2.3 dives into the details of the objective functions.
The first component of the model is the Depth Network, a CNN-based model which takes as input a target image represented in color information and semantic masks and outputs the corresponding depth information. As shown in Figure 2, the Depth Network is comprised by two encoders, one for each input modality, i.e., color and semantic masks, and one decoder which uses the concatenated embeddings of each of the encoders to predict the corresponding depth.
The second component of the model is the Pose Network, which is also a CNN-based network. This module takes as input a short sequence of N images also represented in color and semantic masks. Here, one of the images in the sequence is the target image It and all other images are the sources Is. The model then outputs the pose transformation between all source images and the target image. Like the Depth Network, the Pose Network is comprised by two encoders, one for each input modality. Then, the final embeddings of each encoder are concatenated together and used to predict the pose transformations between the images. The model is shown in Figure 3.
The main objective function in this project comes from a view synthesis task: given one input view of a scene, \(I_t\), the goal is to synthesize a new image of the scene from a different camera pose. In [6], the synthesis process is achieved by predicting both the depth information of a target viewpoint, \(D_t\), and the pose transformations between the target view, \(T_{t \rightarrow n}\), where \(n\) is the sub-script of the nearby view, \(I_n\). Here, the depth and pose information is learned through the CNN-based modules, which were explained in the previous sections.
The view-synthesis objective is given by the following equation: $$ L_{vs} = \sum_{n} \sum_{p} | I_t(p) - \hat{I}_n(p) | $$ where \(p\) is a pixel index, and \(\hat{I}_n\) is a nearby image warped into the target's coordinate frame. To warp the nearby image to the target frame, we can project \(p_t\), a homogeneous coordinate of a pixel in the target image onto the nearby image by following the equation below, $$ p_n \sim K \cdot T_{t \rightarrow n} \cdot D_t(p_t) \cdot K^{-1} \cdot p_t $$ where \(p_t\) represents the homogeneous coordinates of a pixel in the target image, \(K\) is the intrinsics matrix, \(D_t\) and \(T_{t \rightarrow n}\) are the predicted depth and pose, respectively.
Now, the coordinate from the previous equation correspond to continous values. To obtain the value \(I_n(p_n)\) to represent \(\hat{I}_s(p_t)\), we follow two interpolation methods: 1) bilinear interpolation for color images, which linearly interpolates the top-left, top-right, bottom-left and bottom-right pixel neighbors, and 2) nearest interpolation for the semantic masks, to preserve the original label values.
Thus, in summary, the view-synthesis objective is applied to both the color images and the semantic masks by warping the source image into the target frame using the predicted depth and poses, as well as the corresponding interpolation method.
Unlike [6], the scenes in our indoor environments are assumed to be static, i.e., there are no dynamic objects at any point in a given sequence. However, existing challenges with the dataset include 1) occluding objects and 2) visual artifacts at certain viewpoints resulting from the low quality reconstructions of the images.
To deal with this, the Pose Network is coupled with an Artifact Network which is trained to predict a pixel mask, \(E_n(p)\) that represents whether a pixel contributes to modeling the 3D structure of a given environment. This mask is used to weigh each pixel coordinate in the view-synthesis loss as, $$ L_{vs} = \sum_{n} \sum_{p} E_n(p) | I_t(p) - \hat{I}_n(p) $$ To avoid the network to predict an all-zeroes mask, the objective is coupled with a regularization term, \(L_{reg}(E_n)\).
The final objective function is used to explicitly allow to propagate gradients from large spatial regions in the image, as opposed to only considering the 4 local neighbors of the pixel, as explained in Section 2.3.1. To do this, depth maps are predicted at different scales and the \(L_1\) loss of their second-order gradients is minimized as in [6].
The final loss then becomes: $$ L = \sum_{l} L_{vs}^{l} + \lambda L_{ms}^{l} + \beta \sum_{n} L_{reg}(E^{l}_n)$$ here, \(l\) is the index of the image scale, \(L_{ms}\) is the multi-scale loss and \(\lambda\) and \(\beta\) are weighing hyper-parameters.
Our experiments were conducted in a server equipped with 4 GeForce RTX 2080 GPUs, each with 10GB of memory, Ubuntu 18.04, Pytorch 1.11 and CUD4 11.4. With this setup, each experiment took around 4-5 days to complete. For this reason, we were only able to define and train three main experiments without the possibility to run experiments for hyper-parameter search. Below, we list the conducted experiments and their corresponding hyper-parameters:
Experiment | Epochs | Learning Rate | Optimizer | Batch Size | VS | SVS | MS | MR |
---|---|---|---|---|---|---|---|---|
1 |
115 | 0.0002 | Adam | 8 | 1 | N/A | 0.1 | N/A |
2 |
115 | 0.0002 | Adam | 8 | 1 | N/A | 0.1 | 0.2 |
3 |
115 | 0.0002 | Adam | 8 | 1 | 1 | 0.1 | N/A |
In this section, we compare the results obtained on each of our proposed experiments. Because the approach is modular, both the depth network and pose network can be independently assessed. As such, we designed both qualitative and quantitative experiments for these modules. In the paragraphs that follow, we explain each experiment in more detail and show according results for each experiment.
Before diving into the qualitative results, we briefly show and discuss, the trainining curves for each experiment in Figure 4. First, we can observe that Experiment 1 had an underfitting behavior: it converged to a high loss and its validation curve had the highest loss values out of all three experiments. Then, we can see that Experiment 2 exhibits the lowest loss values in both curves which is expected since this experiment makes use of the Artifact Mask. However, the model barely improves from its initial loss value. Finally, Experiment 3 has the highest loss value during training due to the additional semantic loss. Nonetheless, the validation loss achieves lower loss values as compared to Experiment 1.
For this qualitative experiment, we simply ran depth inference on a set of trajectories from the test set. It should be noted that, in contrast to the validation set, the environments in the test set were never seen by the networks. The environments in the validation set are the same as those found in the training set. The only difference are the trajectories.
In Figure 5 we show depth predictions for Experiment 1 (left-most) Eperiment 2 (center) and Experiment 3 (right-most). First, analyzing the depth results from Experiment 1 (1st and 2nd columns), we can observe that only using RGB images on this dataset, the model markedly underfitted which is consistent with the curves shown in Figure 4. Here, regardless of the environment and trajectory, the model always predicts very similar results that look like two bright vertical rectangles to the left and right and a darker rectangle to the center. The model also does not seem to understand depth, as it tends to predict darker spots on the bottom center, whereas it should be the opposite since in these visualizations darker means further away.
In the second experiment (3rd and 4th columns), which studies the effect of the Artifact Mask, it seems that the model better understands depth. It tends to predict ceilings to be further away (a darker color) and floors at the bottom-center to be closer (a lighter color). Nonetheless, depth predictions are generally more blurry than the other two experiments.
Finally, in the third experiment (5th and 6th columns) we observe that with semantic information taken into account, the predicted depth better captures the structure of the environments. For example, features like door frames, stairs, etc., appear much clearer. However, every few frames, the model predicts highly blurry images that follow the same diagonal gradient pattern. There is a bright patch at the top-left corner and a dark patch at the bottom-right corner. We posit that this strange behavior may be due to several reasons, for instance:
For this evaluation, we ran pose inference on the same set of trajectories as we did with the depth network. Specifically, we provide the network with a viewpoint at some time step \(t\) and another viewpoint at time step \(t+1\). Then, we predict the pose transformation between them using either the ground truth depth or the predicted depth. Finally, we warp the image at time-step \(t+1\) to the coordinate frame of the image at time-step \(t\) and display the corresponding result.
We show the resulting warps for Experiment 1 in Figure 6. In each row, the left-most image sequence always show the image at time step \(t\), the right-most show the image a time step \(t+1\), and the ones at the center are the warps. Here, we used the corresponding ground truth depth. Figure 7 shows one example comparing the resulting warps when using ground truth depth (top) vs predicted depth (bottom).
As you can see below in Figure 7, pose warping based on ground truth and predicted depth both give somewhat similar results. However, with predicted depth, the results tend to be slightly more bent and crooked. This is the same case for all three experiments.
The resulting warps for Experiment 2 are shown in Figure 8. Since this experiment considers the Artifact Mask, we also display the corresponding mask for each time-step. Like the previous experiment, we also compare the results when using ground truth depth vs predicted depth in Figure 9.
Analyzing the predicted masks, we conclude that this method did not achieve what we intended: the predicted masks do not predict occluding objects nor visual artificats. In fact, in several cases, the masks predict large parts of the image. This may explain why the loss for the artifact mask during training and validation was the lowest, and why it did not improve significantly throughout the entire process. As you can see in the third column, there is significant warping in the outputs generated using the artifact mask.
The resulting warps for Experiment 3 are shown in Figure 10. Since this experiment leverages semantic information, we also display the corresponding warps for the semantic masks. Finally, Figure 11 compares the results when using ground truth depth vs predicted depth.
As you can see, there is pretty significant warping in some of the generated outputs. We acknowledge that, this could be due to the relatively small number of epochs we were able to train the model and that we were also unable to run hyper-parameter search experiments. We expect that these results should improve as more experiments are carried and the effect of each hyper-parameter is more throughly understood.
Similar to Fig. 7, Fig. 11 compares the results of warping based on ground truth and based on predictions of the next frame. Again, warping based on ground truth produces slightly better results than those that were based on predictions.
In Table 3 we compare the ground truth depth and the predicted depth for each experiment and report three error metrics also reported in [6]:
These results show that Experiment 1 acheived significantly lower error as compared to the other two experiments. Nonetheless, from visual inspection in Section 4.2.1 we observed that Experiment 1 underfit to the training process. The reported errors for the remaining two experiments are similar. As we discussed before, these experiments showed various limitations: in Experiment 2 the predicted depths are very blurry, and in Experiment 3 the repeated diagonal pattern may have drastically affected the predictions.
Experiment | Abs Rel | Sq Rel | RMSE |
---|---|---|---|
1 |
0.706 | 0.113 | 0.146 |
2 |
2.091 | 1.841 | 0.325 |
3 |
2.300 | 4.900 | 0.451 |
In Table 4 we compare the ground truth poses obtained using the Habitat simulator and the predicted pose information for each experiment and report three error metrics also reported in [6]:
These results show that Experiment 2 acheived slighlty lower pose error compared to the other two experiments. Nonetheless, from visual inspection in Section 4.2.2 it is difficult to assess whether this model's predicted poses are in fact "better" since the warps are more marked for the latter two experiments.
Experiment | ATE | RE |
---|---|---|
1 |
0.0203 | 0.2006 |
2 |
0.0164 | 0.1429 |
3 |
0.0219 | 0.1554 |
Through the 3 experiments conducted, we discovered the following:
Despite these interesting quirks and all the challenges we faced, e,g., having a low quality dataset, to having limited resources for training more experiments and better analyzing the effect of our hyper-parameters and loss functions, etc, we got interesting results on the indoor scenes and trajectories given by the MatterPort3D dataset. We believe that the model would be even more accurate at portraying these indoor trajectories if they were given more time and resources for training more experiments to better understand the hyper-parameters.
Some things that we can try going forward include: