Retrieval augmentation, the practice of retrieving additional data from large auxiliary pools, has emerged as an effective technique for enhancing model performance in the low-data regime.
Prior approaches have employed only nearest-neighbor based strategies for data selection, which retrieve auxiliary samples with high similarity to instances in the target task.
COBRA (COmBinatorial Retrieval Augmentation) is a new approach that employs an alternative CMI measure that considers both diversity and similarity to a target dataset for retrieval augmentation.
COBRA consistently outperforms previous retrieval approaches, providing significant gains in downstream model performance without incurring significant computational overhead.