Recent research explores the combination of non-uniform exploration and supervised learning in decision-making systems to improve immediate performance while maintaining off-policy learning capabilities.
An analysis conducted at Adyen, a global payments processor, demonstrates that regression oracles can enhance system performance but may introduce challenges due to rigid algorithmic assumptions.
The study reveals that improvements in policy may lead to subsequent performance degradation due to shifts in reward distribution and increased class imbalance in training data.
There is a potential 'oscillation effect' identified where regression oracles influence probability estimates, impacting the stability and performance consistency of policy models over successive iterations.