Researchers studied the corruption-robustness of in-context reinforcement learning, specifically focusing on the Decision-Pretrained Transformer (DPT).
They introduced the Adversarially Trained Decision-Pretrained Transformer (AT-DPT) framework to combat reward poisoning attacks targeting the DPT.
The AT-DPT framework involves training an attacker to minimize the true reward of the DPT by poisoning environment rewards, while training the DPT model to infer optimal actions from the poisoned data.
Evaluation results demonstrated that the proposed AT-DPT method outperformed standard bandit algorithms and robust baselines in bandit settings, even against adaptive attackers, and showed robustness in more complex environments beyond bandit scenarios.