Recent advances in language modeling and vision have been made by training large models on diverse, multi-task data.
Value-based reinforcement learning has typically been driven by small models trained in single-task contexts due to challenges like sparse rewards and gradient conflicts.
This work introduces high-capacity value models trained via cross-entropy and conditioned on learnable task embeddings, showing improved multi-task training in online RL settings.
The approach outlined in this study leads to state-of-the-art single and multi-task performance across various benchmarks and enables sample-efficient transfer to new tasks.