GANs to Understand How the Human Brain Makes Sense of Natural Scenes

This endeavor is a part of the Algonauts 2023 competition, which aims to evaluate computational models that anticipate human brain activity when individuals view images of objects. The comprehension of how the human brain functions is a significant obstacle that both science and society confront. With every blink, we are exposed to a vast amount of light particles, yet we process the visual world as organized and meaningful. The central focus of this undertaking is to forecast human brain responses to intricate natural visual environments by employing the most comprehensive brain dataset accessible for this intention.

Overview:

Dataset overview:

Train Dataset: For each of the 8 subjects, there are [9841, 9841, 9082, 8779, 9841, 9082, 9841, 8779] different images.
fMRI: The corresponding fMRI visual responses of both the left and right hemispheres. The fMRI data is z-scored within each NSD scan session and averaged across image repeats, resulting in 2D arrays with the number of images as rows and as columns, a selection of the vertices that showed reliable responses to images during the NSD experiment. The left (LH) and right (RH) hemisphere files consist of 19,004 and 20,544 vertices.
Test Dataset: For each of the 8 subjects, there are [159, 159, 293, 395, 159, 293, 159, 395] different images.

Region-of-Interest (ROI):
The visual cortex is divided into multiple areas with different functional properties, referred to as regions-of-interest (ROIs).

Early retinotopic visual regions (prf-visualrois): V1v, V1d, V2v, V2d, V3v, V3d, hV4.
Body-selective regions (floc-bodies): EBA, FBA-1, FBA-2, mTL-bodies.
Face-selective regions (floc-faces): OFA, FFA-1, FFA-2, mTL-faces, aTL-faces.
Place-selective regions (floc-places): OPA, PPA, RSC.
Word-selective regions (floc-words): OWFA, VWFA-1, VWFA-2, mfs-words, mTL-words.
Anatomical streams (streams): early, midventral, midlateral, midparietal, ventral, lateral, parietal.

Proposed Methods:

Fine-tune stable-diffusion
Generative models with multi discriminator approach
Minimizing Correlation as GAN loss
Customizing Patch Discriminator for fMRI data
Vision-Transfomer as generator

Fine-tune stable-diffusion:

I took inspiration from a related study Link for my initial approach to tackle this problem. In the mentioned study, the authors reconstructed visual images based on the fMRI responses of subjects. They used a pre-trained diffusional model on specific ROIs of the brain, as stated earlier, and conditioned it on the remaining ROIs to infer the subject's thoughts. By examining the following diagrams, one can gain insight into their methodology and findings:

I attempted to fine-tune the diffusional model in order to predict fMRI responses based on a given image and subject ID. The initial results of the experiment demonstrated promise, as the correlation score between the predicted and actual fMRI data increased with each epoch. However, due to computational limitations, I was unable to fully explore this approach. The final correlation score achieved using the diffusional model was 0.24

Generative Adversarial Networks (GANs):

One benefit of GANs over diffusion or autoregressive models is their quicker training and inference times. As a result, I was able to explore several different approaches.

To keep things brief, I will only discuss the most potential approach for fMRI prediction through GAN training:

Multi discriminator approach: In contrast to images, fMRI signals have high dimensionality. To better differentiate between generated and actual fMRI data, I hypothesized that training separate discriminators for each brain region of interest (as mentioned earlier) could enhance the generation process. To test this assumption, two discriminators were used: one for the right hemisphere fMRI and another for the left hemisphere. This approach led to a noteworthy enhancement in the correlation score.
Customizing Patch Discriminator for fMRI data: I adapted the Patch Discriminator to handle fMRI data, where each input vertex corresponds to a response from a different brain region. In contrast to its use in image analysis, the Patch Discriminator is customized to focus on the local semantics of the fMRI data. To achieve this, the output dimension (HxW) is set to be the same as the input dimension.
Minimizing Correlation as GAN loss: As our evaluation metric relies on the correlation score, I initially considered making the Pearson correlation differentiable. However, since this loss function lacks convexity or concavity, training becomes unstable. Despite this, the final optimum yielded significantly better results, and model convergence was achieved in fewer epochs.
Vision-Transfomer as a generator: The experiments mentioned earlier involved using a generator with a very shallow architecture. However, replacing this generator with a VIT (Vision Transformer) architecture has been shown to yield impressive results in the literature. As a result, using a VIT has led to a final correlation score of 0.54, which is still increasing.

Model	Loss	Epochs	Pre-trained	Config	More training	Time per epoch	Final Correlation Score (Max: 1)
Dreambooth-Stable-Diffusion	Mean Squared Error	2	Yes	Standard	Can improve further	1 day	0.24
Vanilla GANs	L1 + GAN loss	25	No	Spectral Norm	Cannot be improved	5 min	0.15
Vision Transformer GANs	L1 + GAN loss	50	Yes	Single VIT discriminator	Can improve further	2 hrs	0.54
Vanilla GANs	Correlation + L1 + GAN loss	6	No	Two discriminators Spectral Norm	Cannot be improved further	8 min	0.30
U-Net	Correlation + L1 + GAN loss	1	No	Two discriminators Spectral Norm	Need to stabalize correlation loss	20 min	0.45
Vanilla GANs	Correlation + L1 + GAN loss	25	No	Custom Patch Multi-discriminators Spectral Norm	Running...	23 min	__
Vision Transformer GANs	Correlation +L1 + GAN loss	25	Yes	Custom Patch Multi-discriminators	Running...	3 hrs	__
SOTA	Unknown	Unknown	Unknown	Unknown	Unknown	Unknown	0.61

The results of the above experiment imply that incorporating multiple discriminators, including correlation loss, and utilizing a vision transformer can each lead to a significant improvement in the output. At present, I am conducting an experiment to combine all three enhancements into a single architecture. Additionally, I am experimenting with customized Patch Multi-discriminators.

Currently, the state-of-the-art (SOTA) for the Algonaut competition achieves a correlation score of 0.61, and my VIT-based approach achieved a score of 0.54. However, with the potential integration of all three techniques and a few more epochs of training, the results may surpass the SOTA. The outcome should be available in a week.

Result Visualization:

The regions that appear darker indicate a low correlation score.

References:

https://sites.google.com/view/stablediffusion-with-brain/
The Algonauts Project 2023