Curiosity in Deep Reinforcement Learning

Curiosity in Deep Reinforcement LearningUnderstanding Random Network DistillationMichael KlearBlockedUnblockFollowFollowingDec 3Learning to play Montezuma’s Revenge, a previously difficult deep RL task, has seen big breakthroughs with Exploration by Random Network Distillation (source: Parker Brothers Blog)Learning to play Atari games is a popular benchmark task for deep reinforcement learning (RL) algorithms..The end result is an agent who prefers to stop and watch TV rather than continue exploring the maze.The next-state prediction curious agent ends up “procrastinating” when faced with a TV, or source of random noise, in the environment (source).Avoiding Procrastination with Random Network DistillationA solution to the noisy TV problem is proposed in Exploration by Random Network Distillation (RND), a very recent paper published by some of the good folks at OpenAI.The novel idea here is to apply a similar technique to the next-state prediction method described above, but to remove the dependence on the previous state.Next-State prediction vs..The output of this function itself is actually unimportant; what’s important is that we have some unknown, deterministic function (a randomly-initialized neural network) that transforms observations in some way.The task of our predictive model, then, is not to predict the next state, but to predict the output of the unknown random model given an observed state..We can train this model using the outputs of the random network as labels.When the agent is in a familiar state, the predictive model should make good predictions of the expected output from the random network..When the agent is in an unfamiliar state, the predictive model will make poor predictions about the random network output.In this way, we can define an intrinsic reward function that is again proportional to the loss of the predictive model.Conceptual overview of the intrinsic reward computation..It no longer tries to predict the unpredictable next-frame on the screen, but instead just needs to learn how these frames get transformed by the random network.Exploring Montezuma’s RevengePrevious next-state prediction curiosity mechanisms failed to solve Montezuma’s revenge because of bad solutions, but RND seems to have overcome these issues.Agents driven by curiosity explore rooms, and learn to collect keys which allow them to unlock new rooms.Despite this success, the agent only “occasionally” passes the first level.. More details

Leave a Reply