Brain-Conditional Multinodal Synthesis: A Survey and Taxonomy
Weijian Mai, Jian Zhang, Pengfei Fang, Zhijun Zhang
Abstract
In the era of Artificial Intelligence Generated Content (AIGC), conditional multimodal synthesis technologies (e.g., text-to-image, text-to-video, text-to-audio, etc) are gradually reshaping the natural content in the real world. The key to multimodal synthesis technology is to establish the mapping relationship between different modalities. Brain signals, serving as potential reflections of how the brain interprets external information, exhibit a distinctive One-to-Many correspondence with various external modalities. This correspondence makes brain signals emerge as a promising guiding condition for multimodal content synthesis. Brain-conditional multimodal synthesis refers to decoding brain signals back to percep
Core problem
The core problem is the difficulty in establishing a mapping relationship between brain signals (e.g., EEG, fMRI, MEG) and external modalities (e.g., images, videos, audio) for brain-conditional multimodal content synthesis. This is complicated by the One-to-Many correspondence between brain signals and external stimuli, making it challenging to decode brain signals to accurately reconstruct perceptual experiences in different modalities.
Key findings and Contribution
- MindEye model performance on the NSD dataset showed PixCorr of 0.309 and SSIM of 0.323.
- UniBrain model performance on the NSD dataset showed PixCorr of 0.249 and SSIM of 0.330.
- SSNIR_SC model performance on the GOD dataset showed PixCorr of 0.459 and SSIM of 0.607.
- BrainICG model performance on the GOD dataset showed PixCorr of 0.223 and SSIM of 0.453.
- Mind_Video model performance on the DNV dataset showed MSE of 0.171 and ACC of 0.202.
- BSR model was able to reconstruct approximate spectral patterns and perceptual qualities akin to actual sound stimuli.
- The taxonomy categorizes AIGC-Brain decoding models into six types: Map, BPM, BPFA, MTF, E2E, and CAEA.
Limitations
- Non-invasive neuroimaging data (e.g., EEG, fMRI, MEG) have low signal-to-noise ratios and high variability.
- Despite advancements, interpretability remains a challenge, especially in understanding neural processes during decoding.
- Achieving real-time processing is critical for BCI applications, but current models face challenges in responsiveness and feedback.
Key quotes
MindEye model performance on the NSD dataset showed PixCorr of 0.309 and SSIM of 0.323, indicating high fidelity in image reconstruction from brain signals. The model leverages a mapping network and pretrained generative models to achieve this level of detail and semantic accuracy.
Type: Findings
Non-invasive neuroimaging data (e.g., EEG, fMRI, MEG) have low signal-to-noise ratios and high variability, making it difficult to learn and generalize effective mappings. This variability poses significant challenges for deep neural networks, which require large datasets to perform well.
Type: Limitations