16-825: Learning for 3D Vision
Assignment 4 - 3D Gaussian Splatting and Diffusion Guided Optimization
Long Vân Tran Ha (ltranha)
Monday, April 1st, 2024
Please watch this page on Wi-Fi, as it contains many GIFs.
Table of Contents
- 1.1 3D Gaussian Rasterization
- 1.2 Training 3D Gaussian Representations
- 1.3.1 Rendering Using Spherical Harmonics
- 1.3.2 Training On a Harder Scene
- 2.1 SDS Loss + Image Optimization
- 2.2 Texture Map Optimization for Mesh
- 2.3 NeRF Optimization
- 2.4.1 View-dependent text embedding
- 2.4.3 Variation of implementation of SDS loss
1.1 3D Gaussian Rasterization
python render.py --out_path ._output/q11
Colors, depth and mask (view-independent) |
---|
![]() |
1.2 Training 3D Gaussian Representations
The learning rates used were taken from the official implementation of Gaussian Splatting, accessible here.
Hyperparameter | Learning Rate |
---|---|
Opacities | 0.05 |
Scales | 0.005 |
Colours | 0.0025 |
Means | 0.00016 |
The model was trained for 1000 iterations.
Final PSNR | Final SSIM |
---|---|
28.553 | 0.959 |
python train.py --out_path ._output/q12
Training progress | Final render |
---|---|
![]() |
![]() |
1.3.1 Rendering Using Spherical Harmonics
python render.py --out_path ._output/q11
(without spherical harmonics)
python render.py --out_path ._output/q13
(with spherical harmonics)
Colors, depth and mask | |
---|---|
View-independent | ![]() |
View-dependent | ![]() |
View 0 | View 4 | View 19 | |
---|---|---|---|
View-independent | ![]() |
![]() |
![]() |
View-dependent | ![]() |
![]() |
![]() |
Differences | Sweater (darker, more detailed) | Pompom of the beanie (darker, more detailed) | Wheels (reflexions) |
1.3.2 Training On a Harder Scene
First, the model is trained on the materials
dataset with the same hyperparameters as in 1.2. The training progress and final render after 1000 iterations are shown below.
Final PSNR | Final SSIM |
---|---|
17.838 | 0.658 |
python train_harder_scene.py --out_path ._output/materials --num_itrs 1001
Hyperparameter learning rate | Training progress | Final render (all views) | Final render (one view) |
---|---|---|---|
Opacities: 0.05 Scales: 0.005 Colours: 0.0025 Means: 0.00016 |
![]() |
![]() |
![]() |
The previous model didn't manage to capture all objects in the scene (e.g. the spheres in the corners). I retrained the model for 1000 iterations with the following modifications:
- Used anisotropic Gaussians instead of isotropic Gaussians, so I added
pre_act_quats
hyperparameter to the model. - Increased the learning rates.
- Added a learning rate scheduler for the means.
- Used SSIM loss as well to encourage performance (with a weight of 0.2): $$loss = 0.8 * loss_{L1} + 0.2 * loss_{ssim}.$$
Final PSNR | Final SSIM |
---|---|
18.084 | 0.672 |
python train_harder_scene.py --out_path ._output/materials_better --num_itrs 1001
Hyperparameter learning rate | Training progress | Final render (all views) | Final render (one view) |
---|---|---|---|
Opacities: 0.05 Scales: 0.01 Colours: 0.01 Rotations: 0.005 Means (init): 1e-3 Means (final): 1e-5 |
![]() |
![]() |
![]() |
All the modifications only led to a slight improvement in the final PSNR and SSIM, and in the final render (three spheres are missing instead of four). Maybe the training needs to be longer (up to 7000 iterations or even 30000 like in the original paper).
2.1 SDS Loss + Image Optimization
The model was always trained for 1000 iterations.
Without guidance (any prompts) |
---|
![]() |
I had some fun with the prompts. Also, inspired from Anonymous Atom, I wanted to see what happend if I swapped uncond
and default
.
The results are quite interesting. It feels like swapping uncond
and default
leads to
- either a more realistic image (e.g. "the solar system" inverted gives a more realistic castle than the regular "a castle" prompt).
- or some orange soup with broccoli.
Prompts | Final output | Evolution | Final output (uncond and default swapped) |
Evolution (uncond and default swapped) |
---|---|---|---|---|
"a hamburger" | ![]() |
![]() |
![]() |
![]() |
"the solar system" | ![]() |
![]() |
![]() |
![]() |
"a galaxy" | ![]() |
![]() |
![]() |
![]() |
"the big bang" | ![]() |
![]() |
![]() |
![]() |
"pikachu" | ![]() |
![]() |
![]() |
![]() |
"a standing corgi dog" | ![]() |
![]() |
![]() |
![]() |
"black" | ![]() |
![]() |
![]() |
![]() |
"black horse" | ![]() |
![]() |
![]() |
![]() |
"3d rendering" | ![]() |
![]() |
![]() |
![]() |
"a castle" | ![]() |
![]() |
![]() |
![]() |
"flag of France" | ![]() |
![]() |
![]() |
![]() |
"tennis in France" | ![]() |
![]() |
![]() |
![]() |
"a wood house" | ![]() |
![]() |
![]() |
![]() |
"amazing house" | ![]() |
![]() |
![]() |
![]() |
"beautiful flowers" | ![]() |
![]() |
![]() |
![]() |
In the swapped case, the model sometimes attempted to generated something (after 300 iterations), before falling into the orange soup.
Prompts | Final output | Output after 300 iterations (uncond and default swapped) |
Final output (uncond and default swapped) |
---|---|---|---|
"a standing corgi dog" | ![]() |
![]() |
![]() |
"black" | ![]() |
![]() |
![]() |
"3d rendering" | ![]() |
![]() |
![]() |
"a castle" | ![]() |
![]() |
![]() |
"tennis in France" | ![]() |
![]() |
![]() |
2.2 Texture Map Optimization for Mesh
The model was always trained for 1000 iterations.
Initial Mesh (any prompts) |
---|
![]() |
Prompts | Final output | Evolution |
---|---|---|
"a hamburger" | ![]() |
![]() |
"a blue and red cow" | ![]() |
![]() |
"zebra cow" | ![]() |
![]() |
"a cow mesh with black and white colors" | ![]() |
![]() |
2.3 NeRF Optimization
The learning rates used were taken from the official implementation of DreamFusion, accessible here.
Hyperparameter | Value |
---|---|
lambda_entropy |
0.001 |
lambda_orient |
0.01 |
latent_iter_ratio |
0.2 |
Initial RGB (any prompts) | Initial Depth (any prompts) |
---|---|
![]() |
![]() |
Prompts | Final RGB | Final Depth | Evolution RGB | Evolution Depth |
---|---|---|---|---|
"a standing corgi dog" | ![]() |
![]() |
![]() |
![]() |
"a car" | ![]() |
![]() |
![]() |
![]() |
"a baby bunny sitting on top of a stack of pancakes" | ![]() |
![]() |
![]() |
![]() |
2.4.1 View-dependent text embedding
Due to limited time, I only run the view-dependent text embedding for the "a standing corgi dog" prompt.
Prompts | Final RGB | Final Depth | Evolution RGB | Evolution Depth |
---|---|---|---|---|
"a standing corgi dog" | ![]() |
![]() |
![]() |
![]() |
For reference, this is what the official implementation of DreamFusion produces for the "a standing corgi dog" prompt:
DreamFusion | This implementation (view-dependent text embedding) | This implementation (view-independent text embedding) |
---|---|---|
![]() |
![]() |
![]() |
When training without view-dependent text embeddings for the "a standing corgi dog" prompt, the model generates a dog with five legs, three ears and two faces. This is because the model is not aware of the viewing angle, so it tries to generate a dog that seems to face the camera from all angles.
When training with view-dependent text embeddings, the model generates a more consistent dog (four legs, two ears, one face).
2.4.3 Variation of implementation of SDS loss
Instead of using the latent space directly and passing the MSE loss to it, I first decode the latent space to get the RGB image and then computed the MSE loss in the pixel space.
# Latent space
target = (latents - grad).detach()
loss = 0.5 * F.mse_loss(latents, target)
# Pixel space
target = self.vae.decode((latents - grad) / self.vae.config.scaling_factor)['sample'].detach()
img = self.vae.decode(latents / self.vae.config.scaling_factor)['sample']
loss = 0.5 * F.mse_loss(img, target)
Moreover, I also experimented with swapping the uncond
and default
prompts to see if the model would generate similar results to 2.1.
Prompts | Final output | Evolution | Final output (swapped uncond and default ) |
Evolution (swapped uncond and default ) |
---|---|---|---|---|
"a hamburger" (pixel space) | ![]() |
![]() |
![]() |
![]() |
"a hamburger" (latent space) | ![]() |
![]() |
![]() |
![]() |
"a standing corgi dog" (pixel space) | ![]() |
![]() |
![]() |
![]() |
"a standing corgi dog" (latent space) | ![]() |
![]() |
![]() |
![]() |
"the solar system" (pixel space) | ![]() |
![]() |
![]() |
![]() |
"the solar system" (latent space) | ![]() |
![]() |
![]() |
![]() |
We can see that pixel space optimization leads to more realistic images than latent space optimization (in most cases), but it is a bit blurry, which is expected with the L2 loss. However, the training time is longer for pixel space optimization, and requires more GPU memory.
The swapped uncond
and default
prompts lead to similar results in both pixel and latent space optimization, but more realistic images are generated with pixel space optimization (with some blurriness).
I didn't run the NeRF optimization with the pixel space optimization due to memory issues and time constraints.