Single View to 3D: Assigment 2 (16-889)

Name: Shefali Srivastava
Andrew ID: shefalis

Goals:
In this assignment, you will explore the types of loss and decoder functions for regressing to voxels, point clouds, and mesh representation from single view RGB input.

Late Days:

1. Exploring Loss Functions

This section will involve defining a loss function, for fitting voxels, point clouds and meshes.

1.1. Fitting a voxel grid (5 points)

In this subsection, we will define binary cross entropy loss that can help us fit a 3D binary voxel grid. Define the loss functions here in losses.py file. For this you can use the pre-defined losses in pytorch library.

Run the file python fit_data.py --type 'vox', to fit the source voxel grid to the target voxel grid.

Visualize the optimized voxel grid along-side the ground truth voxel grid using the tools learnt in previous section.

Command:

python main.py --command "python fit_data.py --type 'vox'"

Code:

# Voxel Loss
def voxel_loss(voxel_src,voxel_tgt):
    loss = torch.nn.BCEWithLogitsLoss()
    prob_loss = loss(voxel_src, voxel_tgt)
    return prob_loss

Visualisation:

1.2. Fitting a point cloud (10 points)

In this subsection, we will define chamfer loss that can help us fit a 3D point cloud . Define the loss functions here in losses.py file. We expect you to write your own code for this and not use any pytorch3d utilities. You are allowed to use functions inside pytorch3d.ops.knn such as knn_gather or knn_points

Run the file python fit_data.py --type 'point', to fit the source point cloud to the target point cloud.

Visualize the optimized point cloud along-side the ground truth point cloud using the tools learnt in previous section.

Command:

python main.py --command "python fit_data.py --type 'point'"

Code:

# Point Cloud Loss
def chamfer_loss(point_cloud_src,point_cloud_tgt):
    dists_st, _, _ = knn_points(
        point_cloud_src, 
        point_cloud_tgt,
        K=1
    )
    dists_ts, _, _ = knn_points(
        point_cloud_tgt, 
        point_cloud_src,
        K=1
    )
    loss_chamfer = torch.mean(torch.mean(dists_st) + torch.mean(dists_ts))
    return loss_chamfer

Visualisation:

1.3. Fitting a mesh (5 points)

In this subsection, we will define an additional smoothening loss that can help us fit a mesh. Define the loss functions here in losses.py file.

For this you can use the pre-defined losses in pytorch library.

Run the file python fit_data.py --type 'mesh', to fit the source mesh to the target mesh.

Visualize the optimized mesh along-side the ground truth mesh using the tools learnt in previous section.

Command:

python main.py --command "python fit_data.py --type 'mesh'"

Code:

# Mesh Loss
def smoothness_loss(mesh_src):
    loss_laplacian = mesh_laplacian_smoothing(mesh_src)
    return loss_laplacian

Visualisation:

2. Reconstructing 3D from single view

This section will involve training a single view to 3D pipeline for voxels, point clouds and meshes. Refer to the save_freq argument in train_model.py to save the model checkpoint quicker/slower.

2.1. Image to voxel grid (15 points)

In this subsection, we will define a neural network to decode binary voxel grids. Define the decoder network here in model.py file, then reference your decoder here in model.py file

Run the file python train_model.py --type 'vox', to train single view to voxel grid pipeline, feel free to tune the hyperparameters as per your need.

After trained, visualize the input RGB, ground truth voxel grid and predicted voxel in eval_model.py file using: python eval_model.py --type 'vox' --load_checkpoint

You need to add the respective visualization code in eval_model.py

On your webpage, you should include visuals of any three examples in the test set. For each example show the input RGB, render of the predicted 3D voxel grid and a render of the ground truth mesh.

Training

Command:

python main.py --command "python train_model.py --type 'vox' --log_freq 100 --batch_size 32 --num_workers 4 --lr 1e-5 --save_freq 100 --tensorboard_freq 100 --cuda_device 0"

Code:

# Voxel Decoder:

self.voxel_size = 32
self.decoder = torch.nn.Sequential(
    nn.Linear(512, 1024),
    nn.ReLU(),
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, 4096),
    nn.ReLU(),
    nn.Linear(4096, 8192),
    nn.ReLU(),
    nn.Linear(8192, self.voxel_size * self.voxel_size * self.voxel_size)
)

Hyperparameters:

--lr '1e-5'
--max_iter 10000
--log_freq 100
--batch_size 32
--num_workers 4
--type 'vox'
--n_points 5000
--w_chamfer 1.0
--w_smooth 0.1
--save_freq 100
--tensorboard_freq 100
--cuda_device 2

Training Tensorboards:

Q2.1 Tensorboard

Evaluation

Command:

python main.py "python eval_model.py --type 'vox' --load_checkpoint --pred_gif_name 'q21_source' --pred_path 'outputs' --gt_gif_name 'q21_target' --gt_path 'outputs' --vis_freq 100"

Code:

# Voxel Visualisation:

def visualise_voxels(
    voxels, 
    gif_name, 
    path,
    device=None,
    image_size=256,
    save_gif_=True
):

    '''
    Visualise & Save Incoming Voxels.

    voxels: Single incoming voxel of shape (length, width, height). 
    device: Running Device.

    return: Voxel Representation
    '''

    if device is None:
        device = get_device()

    min_value = 0.0
    max_value = 1.0

    voxel_size = voxels.shape[0]

    vertices, faces = mcubes.marching_cubes(
        mcubes.smooth(voxels.detach().cpu().numpy()), 
        isovalue=0
    )

    vertices = torch.tensor(vertices).float()
    faces = torch.tensor(faces.astype(int))

    vertices = (vertices / voxel_size) * (max_value - min_value) + min_value

    textures = (vertices - vertices.min()) / (vertices.max() - vertices.min())

    textures = pytorch3d.renderer.TexturesVertex(
        vertices.unsqueeze(0)
    )

    mesh = pytorch3d.structures.Meshes(
        [vertices], 
        [faces], 
        textures=textures
    ).to(device)

    lights = pytorch3d.renderer.PointLights(
        location=[[0, 0.0, -4.0]], 
        device=device,
    )

    renderer = get_mesh_renderer(
        image_size=image_size, 
        device=device
    )

    num_views = 15
    _, _, many_cameras = render_360(num_views)

    images = renderer(
        mesh.extend(num_views), 
        cameras=many_cameras, 
        lights=lights
    )

    if(save_gif_):
        gif_name = f'{gif_name}.gif'
        save_gif(path, gif_name, images.cpu().numpy())

    return images.detach().cpu().numpy()

Visualisation:

2.2. Image to point cloud (15 points)

In this subsection, we will define a neural network to decode point clouds. Similar as above, define the decoder network here in model.py file, then reference your decoder here in model.py file

Run the file python train_model.py --type 'point', to train single view to pointcloud pipeline, feel free to tune the hyperparameters as per your need.

After trained, visualize the input RGB, ground truth point cloud and predicted point cloud in eval_model.py file using: python eval_model.py --type 'point' --load_checkpoint

You need to add the respective visualization code in eval_model.py.

On your webpage, you should include visuals of any three examples in the test set. For each example show the input RGB, render of the predicted 3D point cloud and a render of the ground truth mesh.

Training

Command:

python main.py --command "python train_model.py --type 'point' --log_freq 100 --batch_size 32 --num_workers 4 --lr 1e-5 --save_freq 100 --tensorboard_freq 100 --cuda_device 0"

Code:

# Point Cloud Decoder:

self.n_point = args.n_points
self.decoder = torch.nn.Sequential(
    nn.Linear(512, 1024),
    nn.ReLU(),
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, 4096),
    nn.ReLU(),
    nn.Linear(4096, 8192),
    nn.ReLU(),
    nn.Linear(8192, self.n_point * 3)
)

Hyperparameters:

--lr 1e-5
--max_iter 10000
--log_freq 100
--batch_size 32
--num_workers 4
--type 'vox'
--n_points 5000
--w_chamfer 1.0
--w_smooth 0.1
--save_freq 100
--tensorboard_freq 100
--cuda_device 2

Training Tensorboards:

Q2.2 Tensorboard

Evaluation

Command:

python main.py --command "python eval_model.py --type 'point' --load_checkpoint --pred_gif_name 'q22_source' --pred_path 'outputs' --gt_gif_name 'q22_target' --gt_path 'outputs' --vis_freq 100"

Code:

def visualise_point_clouds(
    point_cloud,
    gif_name, 
    path,
    image_size=512,
    background_color=(1, 1, 1),
    device=None,
    save_gif_=True
):

    '''
    Renders a point cloud.
    '''

    if device is None:
        device = get_device()

    renderer = get_points_renderer(
        image_size=image_size, 
        background_color=background_color
    )

    verts = point_cloud.to(device).unsqueeze(0)
    rgb = torch.rand(verts.shape).to(device)

    point_cloud = pytorch3d.structures.Pointclouds(
        points=verts, 
        features=rgb
    )

    num_views = 15
    _, _, many_cameras = render_360(num_views)

    images = renderer(point_cloud.extend(num_views), cameras=many_cameras)[:, :, :, :3]

    if(save_gif_):
        gif_name = f'{gif_name}.gif'
        save_gif(path, gif_name, images.detach().cpu().numpy())

    return images.detach().cpu().numpy()

Visualisation:

2.3. Image to mesh (15 points)

In this subsection, we will define a neural network to decode mesh. Similar as above, define the decoder network here in model.py file, then reference your decoder here in model.py file

Run the file python train_model.py --type 'mesh', to train single view to mesh pipeline, feel free to tune the hyperparameters as per your need. We also encourage the student to try different mesh initializations here

After trained, visualize the input RGB, ground truth mesh and predicted mesh in eval_model.py file using: python eval_model.py --type 'mesh' --load_checkpoint

You need to add the respective visualization code in eval_model.py.

On your webpage, you should include visuals of any three examples in the test set. For each example show the input RGB, render of the predicted mesh and a render of the ground truth mesh.

Training

Command:

python main.py --command "python train_model.py --type 'mesh' --log_freq 100 --batch_size 32 --num_workers 4 --lr 1e-5 --save_freq 100 --tensorboard_freq 100 --cuda_device 0"

Code:

# Mesh Decoder:

mesh_pred = ico_sphere(4,'cuda')
self.mesh_pred = pytorch3d.structures.Meshes(mesh_pred.verts_list()*args.batch_size, mesh_pred.faces_list()*args.batch_size)
self.decoder = torch.nn.Sequential(
    nn.Linear(512, 1024),
    nn.ReLU(),
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, 4096),
    nn.ReLU(),
    nn.Linear(4096, 8192),
    nn.ReLU(),
    nn.Linear(8192, self.mesh_pred.verts_list()[0].shape[0] * self.mesh_pred.verts_list()[0].shape[1])
)

Hyperparameters:

--lr 1e-5
--max_iter 10000
--log_freq 100
--batch_size 32
--num_workers 4
--type 'vox'
--n_points 5000
--w_chamfer 1.0
--w_smooth 0.1
--save_freq 100
--tensorboard_freq 100
--cuda_device 2

Training Tensorboards:

Q2.3 Tensorboard

Evaluation

Command:

python main.py --command "python eval_model.py --type 'mesh' --load_checkpoint --pred_gif_name 'q23_source' --pred_path 'outputs' --gt_gif_name 'q23_target' --gt_path 'outputs' --vis_freq 100"

Code:

def visualise_mesh(
    mesh,
    gif_name, 
    path,
    image_size=512,
    background_color=(1, 1, 1),
    device=None,
    save_gif_=True
):
    '''
    Renders a Mesh.
    '''

    device = get_device()

    mesh.textures = pytorch3d.renderer.TexturesVertex(
        torch.rand((mesh.verts_list()[0].shape)).unsqueeze(0).to(device)
    )

    mesh = mesh.to(device)

    lights = pytorch3d.renderer.PointLights(
        location=[[0, 0.0, -4.0]], 
        device=device
    )

    renderer = get_mesh_renderer(
        image_size=image_size, 
        device=device
    )

    num_views = 15
    _, _, many_cameras = render_360(num_views)

    images = renderer(
        mesh.extend(num_views), 
        cameras=many_cameras
    )[:, :, :, :3]

    if(save_gif_):
        gif_name = f'{gif_name}.gif'
        save_gif(path, gif_name, images.detach().cpu().numpy())

    return images.detach().cpu().numpy()

Visualisation:

2.4. Quantitative comparisions(10 points)

Quantitatively compare the F1 score of 3D reconstruction for meshes vs pointcloud vs voxelgrids. Provide an intutive explaination justifying the comparision.

For evaluating you can run: python eval_model.py --type voxel|mesh|point --load_checkpoint

On your webpage, you should include the average test F1 score at 0.05 threshold for voxelgrid, pointcloud and the mesh network.

Command:

python main.py "python eval_model.py --type voxel|mesh|point --load_checkpoint --pred_gif_name 'q23_source' --pred_path 'outputs' --gt_gif_name 'q23_target' --gt_path 'outputs' --vis_freq 1000"

Qualitative Comparisons:

Representation	Average F1@0.05
Voxels	78.573
Point Clouds	96.062
Meshes	91.385

Intuitive Explanation:

The Average F1 score for point clouds is the maximum at 96.062 while that for Voxels is at the minimum with 78.573. This is intuitive because voxel reconstructions limit the region to a cube of 323232 space which is a restrictive resolution and might not be enough to model the entire reconstruction accurately. Training for meshes are at a score of 91.385 which is the next highest after point cloud. Meshes are limited by the initialisation (starting mesh) that we begin with and are restrictive in that sense. They also explicitly model connectivity using faces and that is why I feel it is a little difficult to reconstruct every image in the dataset with vast intra-class variations. Point clouds perform the best since they are free to take any position in the 3D space and can be compressed and expanded as needed depending on the number of points to best capture the shape for 3D reconstruction.

2.5. Analyse effects of hyperparameter variations (10 points)

Analyse the results, by varying an hyperparameter of your choice. For example n_points or vox_size or w_chamfer or initial mesh(ico_sphere) etc. Try to be unique and conclusive in your analysis.

python main.py --command "python train_model.py --type 'point' --log_freq 100 --batch_size 32 --num_workers 4 --lr 1e-5 --n_points 2500 --save_freq 100 --tensorboard_freq 100 --cuda_device 3"

1. Number of Points in a Point Cloud to 2500 (Halved)

Training

Command:

python main.py --command "python train_model.py --type 'point' --log_freq 100 --batch_size 32 --num_workers 4 --lr 1e-5 --save_freq 100 --tensorboard_freq 100 --cuda_device 0 --n_points 2500"

Hyperparameters:

--lr 1e-5
--max_iter 10000
--log_freq 100
--batch_size 32
--num_workers 4
--type 'vox'
--n_points 2500
--w_chamfer 1.0
--w_smooth 0.1
--save_freq 100
--tensorboard_freq 100
--cuda_device 2

Training Tensorboards:

Q2.2 Tensorboard

Evaluation

Command:

python main.py --command "python eval_model.py --type 'point' --load_checkpoint --pred_gif_name 'q22_source' --pred_path 'outputs' --gt_gif_name 'q22_target' --gt_path 'outputs' --vis_freq 100"

Visualisation:

Quantitative Comparison:

Representation	Hyperparameter	Average F1@0.05
Point Clouds	--n_points 5000	96.062
Point Clouds	--n_points 2500	95.746

Analysis:

On decreasing the number of points in the point cloud to 2500 from 5000 (half) we don’t see a drop in performance as the F1 score is nearly the same. Also, the network trains much faster. The visualisations for 5000 points in a point cloud were very dense while the ones for 2500 points in a point cloud are visually appealing. So it is safe to say that the network suffices with 2500 points in the point cloud for this problem.

2. Smoothening Loss in Mesh Prediction (w_smooth) to 0.5

Training

Command:

python main.py --command "python train_model.py --type 'mesh' --log_freq 100 --batch_size 32 --num_workers 4 --lr 1e-5 --save_freq 100 --tensorboard_freq 100 --cuda_device 0 --w_smooth 0.5"

Hyperparameters:

--lr 1e-5
--max_iter 10000
--log_freq 100
--batch_size 32
--num_workers 4
--type 'vox'
--n_points 5000
--w_chamfer 1.0
--w_smooth 0.5
--save_freq 100
--tensorboard_freq 100
--cuda_device 2

Training Tensorboards:

Q2.3 Tensorboard

Evaluation

Command:

python main.py --command "python eval_model.py --type 'mesh' --load_checkpoint --pred_gif_name 'q23_source' --pred_path 'outputs' --gt_gif_name 'q23_target' --gt_path 'outputs' --vis_freq 100"

Visualisation:

Quantitative Comparison:

Representation	Hyperparameter	Average F1@0.05
Mesh	--w_smooth 0.1	91.385
Mesh	--w_smooth 0.5	85.719

Analysis:

Loss Computation:

loss_reg = losses.chamfer_loss(sample_pred, sample_trg)
loss_smooth = losses.smoothness_loss(predictions)

loss = args.w_chamfer * loss_reg + args.w_smooth * loss_smooth

On increasing w_smooth to 0.5 from 0.1, we see a drop in performance. The quality of visualisations has also dropped. It is safe to say that smoothening loss is less important than chamfer loss for the network to learn better meshes.

3. Changing Initial Mesh from Sphere to Torus

Training

Command:

python main.py --command "python train_model.py --type 'mesh' --log_freq 100 --batch_size 32 --num_workers 4 --lr 1e-5 --save_freq 100 --tensorboard_freq 100 --cuda_device 0"

Code:

# Mesh Decoder:

r = 2
R = 5
sides = 100
rings = 100
mesh_pred = torus(r=r, R=R, sides=sides, rings=rings, device='cuda')

self.mesh_pred = pytorch3d.structures.Meshes(mesh_pred.verts_list()*args.batch_size, mesh_pred.faces_list()*args.batch_size)
self.decoder = torch.nn.Sequential(
    nn.Linear(512, 1024),
    nn.ReLU(),
    nn.Linear(1024, 2048),
    nn.ReLU(),
    nn.Linear(2048, 4096),
    nn.ReLU(),
    nn.Linear(4096, 8192),
    nn.ReLU(),
    nn.Linear(8192, self.mesh_pred.verts_list()[0].shape[0] * self.mesh_pred.verts_list()[0].shape[1])
)

Hyperparameters:

--lr 1e-5
--max_iter 10000
--log_freq 100
--batch_size 32
--num_workers 4
--type 'vox'
--n_points 5000
--w_chamfer 1.0
--w_smooth 0.1
--save_freq 100
--tensorboard_freq 100
--cuda_device 2

Training Tensorboards:

Q2.3 Tensorboard

Evaluation

Command:

python main.py --command "python eval_model.py --type 'mesh' --load_checkpoint --pred_gif_name 'q23_source' --pred_path 'outputs' --gt_gif_name 'q23_target' --gt_path 'outputs' --vis_freq 100"

Visualisation:

Quantitative Comparison:

Representation	Hyperparameter: Initial Mesh	Average F1@0.05
Mesh	Sphere	91.385
Mesh	Torus	86.269

Analysis:

Starting with a Torus instead of a Sphere leads to degradation in performance as is apparent from the F1 score and the visualisations.

2.6. Interpret your model (15 points)

Simply seeing final predictions and numerical evaluations is not always insightful. Can you create some visualizations that help highlight what your learned model does? Be creative and think of what visualizations would help you gain insights. There is no right answer - although reading some papers to get inspiration might give you ideas.

Idea:

To interpret the model, we can see how the model performs if we try to predict on merged features from two images.

Command:

python main.py --command "python eval_model_merge.py --type 'point' --load_checkpoint --pred_path 'eval_merged/' --gt_path 'eval_merged/' --vis_freq 100 --checkpoint_path 'checkpoint_point.pth' --merge_images_ws 10"

Code:

def merge_images(self, images_1, images_2, weights, args):
    results = dict()

    total_loss = 0.0
    start_time = time.time()

    B = images_1.shape[0]

    images_1_normalize = self.normalize(images_1.permute(0, 3, 1, 2))
    images_2_normalize = self.normalize(images_2.permute(0, 3, 1, 2))
    encoded_feat_1 = self.encoder(images_1_normalize).squeeze(-1).squeeze(-1) # Output Shape = (batch_size, 512)
    encoded_feat_2 = self.encoder(images_2_normalize).squeeze(-1).squeeze(-1) 

    preds = []
    for weight in weights:
        merged_encoded_features = weight * encoded_feat_1 + (1 - weight) * encoded_feat_2 # Output Shape = (batch_size, 512)

        if args.type == "vox":
            pred = self.decoder(merged_encoded_features).reshape((B, 1, 32, 32, 32))           

        elif args.type == "point":
            pred = self.decoder(merged_encoded_features).reshape((B, self.n_point, 3))            

        elif args.type == "mesh":
            deform_vertices_pred = self.decoder(merged_encoded_features)   
            pred = self.mesh_pred.offset_verts(deform_vertices_pred.reshape([-1,3]))
            return pred  
        preds.append(pred)

    return preds

Visualisation:

Visualisation 1:

Visualisation 2:

Visualisation 3:

Visualisation 4:

Analysis:

To try and understand what the model learns, I have assessed how the model performs on a weighted average from the feature vectors of two images. The results are shown on varying the contribution of w from 1 to 0 in w * feat_1 + (1 - w) * feat_2 form such that the weighted average lets the feature vectors move from being more descriptive of the first image to slowly being more descriptive of the second image. This is the analysis that has been plotted above. We see that the model does learn a good feature encoding-decoding system and is able to smoothly move in the latent space.

3. (Extra Credit) Exploring some recent architectures.

3.1 Implicit network (10 points)

Implement a implicit decoder that takes in as input 3D locations and outputs the occupancy value. Some papers for inspiration [1,2]

Training

Command:

python main.py --command "python occupancy_net.py --type 'vox'"

Code:

# Single View To Occupancy Net Prediction Network
class SingleViewToOccupancyNet(nn.Module):
    def __init__(self, args):
        '''
        Predict Occupancy Network from Images.
        '''

        super(SingleViewToOccupancyNet, self).__init__()

        self.device = "cuda"
        vision_model = torchvision_models.__dict__[args.arch](pretrained=True)
        self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])

        self.image_encoder = torch.nn.Sequential(
            *(list(vision_model.children())[:-1])
        )

        self.image_encoder_linear = torch.nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU()
        )

        self.point_encoder = torch.nn.Sequential(
            nn.Linear(3, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
        )

        self.decoder = torch.nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 1)
        )             


    def forward(self, images, points):
        '''
        Forward Pass through the network.

        images: (B, H, W, 3)
        points: (B, N, 3)
        '''

        # Image Features
        images = self.normalize(images.permute(0,3,1,2))                     # (H, W, C) --> (C, H, W)
        image_feat_ = self.image_encoder(images).squeeze(-1).squeeze(-1)     # (B, 512)
        image_feat = self.image_encoder_linear(image_feat_)                  # (B, 128)
        image_feat = image_feat.unsqueeze(1)

        # Point Features
        points_feat = self.point_encoder(points)                             # (B, N, 128) 

        # Added Features
        feat = image_feat + points_feat                                      # (B, N, 128)

        # Network Output
        outputs = self.decoder(feat)                                         # (B, N, 1)
        return outputs                                                       # (B, N, 1)


# Loss Function 
def get_loss(outputs, ground_truths, step, writer):
    '''
    Return Binary Cross Entropy Loss

    outputs: (B, N, 1)
    ground_truths: (B, N, 1)
    '''

    criterion = torch.nn.BCEWithLogitsLoss()
    loss = criterion(outputs, ground_truths)
    writer.add_scalar('Loss/train', loss, step)
    return loss


# Construction of Dataset
def construct_dataset(images, voxels, N=2500):
    '''
    Create Dataset for SingleViewToOccupancyNet

    images: (B, H, W, 3)
    voxels: (B, 1, 32 * 32 * 32)

    return:
    images: (B, H, W, 3)
    points: (B, N, 3)
    ground_truths: (B, N, 1)
    '''

    num_batches = images.shape[0]
    voxel_range = voxels[0][0].shape[0]

    points = []
    ground_truths = []
    for idx in range(num_batches):
        image, voxel = images[idx], voxels[idx][0]

        # Sample points from the voxel.
        x_indices, y_indices, z_indices = np.random.choice(voxel_range, N), np.random.choice(voxel_range, N), np.random.choice(voxel_range, N) 
        ground_truth = voxel[x_indices, y_indices, z_indices].cpu().numpy()

        points.append(list(zip(x_indices, y_indices, z_indices)))
        ground_truths.append(ground_truth)

    return images.to(args.cuda_device), torch.tensor(points, dtype=torch.float32).to(args.cuda_device), torch.tensor(ground_truths).unsqueeze(2).to(args.cuda_device)

Hyperparameters:

--lr 1e-5
--max_iter 10000
--log_freq 100
--batch_size 32
--num_workers 4
--type 'vox'
--n_points 5000
--w_chamfer 1.0
--w_smooth 0.1
--save_freq 100
--tensorboard_freq 100
--cuda_device 2

Training Tensorboards:

Evaluation

Command:

python main.py --command "python occupancy_net_predict.py --type 'vox' --load_checkpoint --vis_freq 100"

Visualisation:

Analysis and Explanation:

For the network:

Images are encoded in a feature vector of size 128.
3D Points (x, y, z) are encoded into a feature vector of size 128.
The two feature vectors are added.
The resulting feature vector of size 128 is passed through the decoder to regress the occupancy for the 3D structure.

Although the reconstructions look sufficiently okay, the loss curve does depict that the network can still learn more. Also, the visualised voxels are not oriented with the right viewpoint.

3.2 Parametric network (10 points)

Implement a parametric function that takes in as input sampled 2D points and outputs their respective 3D point. Some papers for inspiration [1,2]

Training

Command:

python main.py --command "python parametric_net.py --type 'vox'"

Code:

# Parametric Network Prediction 
class ParametricNet(nn.Module):
    def __init__(self, args):
        '''
        Predict Occupancy Network from Images.
        '''

        super(ParametricNet, self).__init__()

        self.device = "cuda"
        vision_model = torchvision_models.__dict__[args.arch](pretrained=True)
        self.normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])

        self.image_encoder = torch.nn.Sequential(
            *(list(vision_model.children())[:-1])
        )

        self.image_encoder_linear = torch.nn.Sequential(
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Linear(256, 128),
            nn.ReLU()
        )

        self.point_encoder = torch.nn.Sequential(
            nn.Linear(2, 32),
            nn.ReLU(),
            nn.Linear(32, 64),
            nn.ReLU(),
            nn.Linear(64, 128),
            nn.ReLU(),
        )

        self.decoder = torch.nn.Sequential(
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 3)
        )              


    def forward(self, images, points):
        '''
        Forward Pass through the network.

        images: (B, H, W, 3)
        points: (B, N, 3)
        '''

        # Image Features
        images = self.normalize(images.permute(0,3,1,2))                     # (H, W, C) --> (C, H, W)
        image_feat_ = self.image_encoder(images).squeeze(-1).squeeze(-1)     # (B, 512)
        image_feat = self.image_encoder_linear(image_feat_)                  # (B, 128)
        image_feat = image_feat.unsqueeze(1)

        # Point Features
        points_feat = self.point_encoder(points)                             # (B, N, 128) 

        # Added Features
        feat = image_feat + points_feat                                      # (B, N, 128)

        # Network Output
        outputs = self.decoder(feat)                                         # (B, N, 3)
        return outputs                                                       # (B, N, 3)

# Loss Function 
def get_loss(point_clouds, ground_truths, step, writer):
    '''
    Return Chamfer Loss Between Ground Truth Point Cloud and Predicted Point Cloud 

    outputs: (B, N, 1)
    ground_truths: (B, N, 1)
    '''

    loss = chamfer_loss(point_clouds, ground_truths)
    writer.add_scalar('Loss/train', loss, step)
    return loss


def construct_dataset(images, args, range_x=100, range_y=100, N=2500):
    '''
    Create Dataset for SingleViewToOccupancyNet

    images: (B, H, W, 3)

    return:
    images: (B, H, W, 3)
    points: (B, N, 2)
    '''

    num_batches = images.shape[0]

    points = []
    for idx in range(num_batches):
        # Randomly Sample Points for the Given Range (2D Surface)
        x_indices, y_indices = np.random.choice(range_x, N), np.random.choice(range_y, N)
        points.append(list(zip(x_indices, y_indices)))

    return images.to(args.cuda_device), torch.tensor(points, dtype=torch.float32).to(args.cuda_device)

Hyperparameters:

--lr 1e-5
--max_iter 10000
--log_freq 100
--batch_size 32
--num_workers 4
--type 'vox'
--n_points 2500
--w_chamfer 1.0
--w_smooth 0.1
--save_freq 100
--tensorboard_freq 100
--cuda_device 2

Training Tensorboards:

Evaluation

Command:

python main.py --command "python parametric_net_predict.py --type 'point' --load_checkpoint --vis_freq 100"

Visualisation:

The number of points varied from 2500, 5000, 7500 and 10000 in order. The source image is viewed in succession with these visualisations and ground truth is plotted at the end.

Analysis and Explanation:

For the network:

Images are encoded in a feature vector of size 128.
2D Points random sampled points from a surface (x, y) are encoded into a feature vector of size 128.
The two feature vectors are added.
The resulting feature vector of size 128 is passed through the decoder to regress the 3D point cloud for the 3D structure.

The reconstructions work well and are perfectly oriented!