A new log-sum-exponential estimator is introduced for off-policy learning and evaluation in logged bandit feedback datasets.The estimator addresses challenges such as high variance, low-quality propensity scores, and heavy-tailed reward distributions.It demonstrates variance reduction and robustness under heavy-tailed conditions, outperforming traditional inverse propensity score estimators.Theoretical analysis and empirical evaluations confirm the practical advantages of the new estimator in off-policy learning and evaluation scenarios.