16-825: Learning for 3D Vision

Assignment 4 - 3D Gaussian Splatting and Diffusion Guided Optimization

Long Vân Tran Ha (ltranha)

Monday, April 1st, 2024

Please watch this page on Wi-Fi, as it contains many GIFs.

Table of Contents

1.1 3D Gaussian Rasterization
1.2 Training 3D Gaussian Representations
1.3.1 Rendering Using Spherical Harmonics
1.3.2 Training On a Harder Scene
2.1 SDS Loss + Image Optimization
2.2 Texture Map Optimization for Mesh
2.3 NeRF Optimization
2.4.1 View-dependent text embedding
2.4.3 Variation of implementation of SDS loss

1.1 3D Gaussian Rasterization

python render.py --out_path ._output/q11

Colors, depth and mask (view-independent)

1.2 Training 3D Gaussian Representations

The learning rates used were taken from the official implementation of Gaussian Splatting, accessible here.

Hyperparameter	Learning Rate
Opacities	0.05
Scales	0.005
Colours	0.0025
Means	0.00016

The model was trained for 1000 iterations.

Final PSNR	Final SSIM
28.553	0.959

python train.py --out_path ._output/q12

Training progress	Final render

1.3.1 Rendering Using Spherical Harmonics

python render.py --out_path ._output/q11 (without spherical harmonics)
python render.py --out_path ._output/q13 (with spherical harmonics)

	Colors, depth and mask
View-independent
View-dependent

	View 0	View 4	View 19
View-independent
View-dependent
Differences	Sweater (darker, more detailed)	Pompom of the beanie (darker, more detailed)	Wheels (reflexions)

1.3.2 Training On a Harder Scene

First, the model is trained on the materials dataset with the same hyperparameters as in 1.2. The training progress and final render after 1000 iterations are shown below.

Final PSNR	Final SSIM
17.838	0.658

python train_harder_scene.py --out_path ._output/materials --num_itrs 1001

Hyperparameter learning rate	Training progress	Final render (all views)	Final render (one view)
Opacities: 0.05 Scales: 0.005 Colours: 0.0025 Means: 0.00016

The previous model didn't manage to capture all objects in the scene (e.g. the spheres in the corners). I retrained the model for 1000 iterations with the following modifications:

Used anisotropic Gaussians instead of isotropic Gaussians, so I added pre_act_quats hyperparameter to the model.
Increased the learning rates.
Added a learning rate scheduler for the means.
Used SSIM loss as well to encourage performance (with a weight of 0.2): $$loss = 0.8 * loss_{L1} + 0.2 * loss_{ssim}.$$

Final PSNR	Final SSIM
18.084	0.672

python train_harder_scene.py --out_path ._output/materials_better --num_itrs 1001

Hyperparameter learning rate	Training progress	Final render (all views)	Final render (one view)
Opacities: 0.05 Scales: 0.01 Colours: 0.01 Rotations: 0.005 Means (init): 1e-3 Means (final): 1e-5

All the modifications only led to a slight improvement in the final PSNR and SSIM, and in the final render (three spheres are missing instead of four). Maybe the training needs to be longer (up to 7000 iterations or even 30000 like in the original paper).

2.1 SDS Loss + Image Optimization

The model was always trained for 1000 iterations.

Without guidance (any prompts)

I had some fun with the prompts. Also, inspired from Anonymous Atom, I wanted to see what happend if I swapped uncond and default.
The results are quite interesting. It feels like swapping uncond and default leads to

either a more realistic image (e.g. "the solar system" inverted gives a more realistic castle than the regular "a castle" prompt).
or some orange soup with broccoli.

Prompts	Final output	Evolution	Final output (`uncond` and `default` swapped)	Evolution (`uncond` and `default` swapped)
"a hamburger"
"the solar system"
"a galaxy"
"the big bang"
"pikachu"
"a standing corgi dog"
"black"
"black horse"
"3d rendering"
"a castle"
"flag of France"
"tennis in France"
"a wood house"
"amazing house"
"beautiful flowers"

In the swapped case, the model sometimes attempted to generated something (after 300 iterations), before falling into the orange soup.

Prompts	Final output	Output after 300 iterations (`uncond` and `default` swapped)	Final output (`uncond` and `default` swapped)
"a standing corgi dog"
"black"
"3d rendering"
"a castle"
"tennis in France"

2.2 Texture Map Optimization for Mesh

The model was always trained for 1000 iterations.

Initial Mesh (any prompts)

Prompts	Final output	Evolution
"a hamburger"
"a blue and red cow"
"zebra cow"
"a cow mesh with black and white colors"

2.3 NeRF Optimization

The learning rates used were taken from the official implementation of DreamFusion, accessible here.

Hyperparameter	Value
`lambda_entropy`	0.001
`lambda_orient`	0.01
`latent_iter_ratio`	0.2

Initial RGB (any prompts)	Initial Depth (any prompts)

Prompts	Final RGB	Final Depth	Evolution RGB	Evolution Depth
"a standing corgi dog"
"a car"
"a baby bunny sitting on top of a stack of pancakes"

2.4.1 View-dependent text embedding

Due to limited time, I only run the view-dependent text embedding for the "a standing corgi dog" prompt.

Prompts	Final RGB	Final Depth	Evolution RGB	Evolution Depth
"a standing corgi dog"

For reference, this is what the official implementation of DreamFusion produces for the "a standing corgi dog" prompt:

DreamFusion	This implementation (view-dependent text embedding)	This implementation (view-independent text embedding)

When training without view-dependent text embeddings for the "a standing corgi dog" prompt, the model generates a dog with five legs, three ears and two faces. This is because the model is not aware of the viewing angle, so it tries to generate a dog that seems to face the camera from all angles.

When training with view-dependent text embeddings, the model generates a more consistent dog (four legs, two ears, one face).

2.4.3 Variation of implementation of SDS loss

Instead of using the latent space directly and passing the MSE loss to it, I first decode the latent space to get the RGB image and then computed the MSE loss in the pixel space.

# Latent space
target = (latents - grad).detach()
loss = 0.5 * F.mse_loss(latents, target)

# Pixel space
target = self.vae.decode((latents - grad) / self.vae.config.scaling_factor)['sample'].detach()
img = self.vae.decode(latents / self.vae.config.scaling_factor)['sample']
loss = 0.5 * F.mse_loss(img, target)

Moreover, I also experimented with swapping the uncond and default prompts to see if the model would generate similar results to 2.1.

Prompts	Final output	Evolution	Final output (swapped `uncond` and `default`)	Evolution (swapped `uncond` and `default`)
"a hamburger" (pixel space)
"a hamburger" (latent space)
"a standing corgi dog" (pixel space)
"a standing corgi dog" (latent space)
"the solar system" (pixel space)
"the solar system" (latent space)

We can see that pixel space optimization leads to more realistic images than latent space optimization (in most cases), but it is a bit blurry, which is expected with the L2 loss. However, the training time is longer for pixel space optimization, and requires more GPU memory.
The swapped uncond and default prompts lead to similar results in both pixel and latent space optimization, but more realistic images are generated with pixel space optimization (with some blurriness).

I didn't run the NeRF optimization with the pixel space optimization due to memory issues and time constraints.