<ul><li>A new log-sum-exponential estimator is introduced for off-policy learning and evaluation in logged bandit feedback datasets.</li><li>The estimator addresses challenges such as high variance, low-quality propensity scores, and heavy-tailed reward distributions.</li><li>It demonstrates variance reduction and robustness under heavy-tailed conditions, outperforming traditional inverse propensity score estimators.</li><li>Theoretical analysis and empirical evaluations confirm the practical advantages of the new estimator in off-policy learning and evaluation scenarios.</li></ul>

Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning

Discover more