Researchers propose a class of instrument variable-based reinforcement learning (IV-RL) algorithms to address reinforcement bias in data analysis.The interaction between data generation and data analysis leads to reinforcement bias, exacerbating the endogeneity problem.The proposed IV-RL algorithms are incorporated into a stochastic approximation framework and have theoretical properties.The analysis also includes formulas for inference on optimal policies and highlights how intertemporal dependencies affect inference.