Reinforcement Learning (RL) is a paradigm where an agent learns sequential actions in uncertain environments, aiming to maximize cumulative rewards. The effectiveness of RL has been established across diverse domains, showcasing its aptitude for handling complex tasks [1,2]. Unlike other machine learning paradigms, RL distinguishes itself by actively acquiring information, with agents gathering data through firsthand interaction with the environment, termed as ‘online’ RL. However, in many real-world scenarios, such as robotics, autonomous driving, and healthcare, direct experimentation can be infeasible or even hazardous. Moreover, direct interactions can often be costly. Thus, it can be beneficial for agents to learn from existing offline datasets, eliminating the need for direct experimentation. This approach, termed ‘offline’ RL or batch RL, focuses on learning exclusively from previously collected datasets.

Offline RL, while advantageous in utilizing existing datasets, presents unique challenges [3]. A significant concern is balancing the enhancement of generalization capabilities and preventing undesired out-of-distribution (OOD) behaviors, largely due to the phenomenon of ‘distributional shift’. During policy evaluation, using Bellman updates can lead to the querying of values from OOD state-action pairs, potentially resulting in a cascade of extrapolation errors. The problem intensifies with the use of high-capacity function approximators, like neural networks. To tackle these OOD issues, several offline RL algorithms incorporate a form of pessimism, employing both model-free and model-based methods [4-8]. Recent model-based approaches adjust the estimated Markov Decision Process (MDP) model from offline datasets to encourage conservative behavior [6,7,9]. They penalize the policy or rewards in situations where the model’s estimated accuracy is low. Various methods have sought to achieve this balance by either estimating model uncertainty or regularizing the value function. However, the questions of which choice of model uncertainty should be used and what conservatism is more suitable in practice still remain open.

In a recent paper by the research group of Prof. Min-hwan Oh published at ICML 2023, a novel method for model-based offline RL, named ‘Count-MORL’, is introduced. This method utilizes the frequency of state-action pairs from offline datasets to quantify model estimation errors, penalizing the reward function based on these estimates. A unique aspect of Count-MORL is its reliance on hash code heuristics to approximate the frequency of state-action pairs, particularly beneficial for continuous state and action spaces. This research underscores the significance of count uncertainty and its potential impact on performance. Numerical evaluations reveal that Count-MORL consistently surpasses existing offline RL methods in terms of performance. The main contributions of this study include the introduction of count-based conservatism in model-based offline deep RL, providing theoretical analyses for the proposed method, and demonstrating superior performance in practical applications.

  1. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  2. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
  3. Levine, S., Kumar, A., Tucker, G., and Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  4. Kumar, A., Zhou, A., Tucker, G., and Levine, S. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33: 1179–1191, 2020.
  5. Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning, pp. 19967–20025. PMLR, 2022.
  6. Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J. Y., Levine, S., Finn, C., and Ma, T. MOPO: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  7. Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims, T. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33: 21810–21823, 2020.
  8. Lu, C., Ball, P., Parker-Holder, J., Osborne, M., and Roberts, S. J. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022.
  9. Rafailov, R., Yu, T., Rajeswaran, A., and Finn, C. Offline reinforcement learning from images with latent space models. In Learning for Dynamics and Control, pp. 1154– 1168. PMLR, 2021.