Abstract

Recent advances in fMRI-based visual decoding have enabled compelling reconstructions of perceived images. However, most approaches rely on subject-specific training, limiting scalability and practical deployment. VoxelFormer is a lightweight transformer architecture that enables multi-subject training for visual decoding from fMRI. VoxelFormer integrates a Token Merging Transformer (ToMer) for efficient voxel compression and a query-driven Q-Former that produces fixed-size neural representations aligned with the CLIP image embedding space.

My role in this project

I proposed the question formulation, architecture and guided the students through the project.


Citation
@ARTICLE{Le2025-vd,
  title         = "{VoxelFormer}: Parameter-efficient multi-subject visual
                   decoding from {fMRI}",
  author        = "Le, Chenqian and Zhao, Yilin and Emami, Nikasadat and Yadav,
                   Kushagra and Liu, Xujin ``chris and Chen, Xupeng and Wang,
                   Yao",
  journal       = "arXiv [cs.CV]",
  month         =  sep,
  year          =  2025,
  archivePrefix = "arXiv",
  primaryClass  = "cs.CV",
  eprint        = "2509.09015"
}