CellVTA is a novel method that enhances the performance of vision foundation models for cell instance segmentation.
It incorporates a CNN-based adapter module to extract high-resolution spatial information from input images and injects it into the Vision Transformer (ViT) through a cross-attention mechanism.
CellVTA achieves excellent results with 0.538 mPQ on the CoNIC dataset and 0.506 mPQ on the PanNuke dataset, surpassing state-of-the-art cell segmentation methods.
The code and models for CellVTA are publicly available on GitHub at https://github.com/JieZheng-ShanghaiTech/CellVTA.