Inverse Reinforcement Learning

Once we have the right reward function, the problem is reduced to finding the right policy, and can be solved with standard reinforcement learning methods.The main problem when converting a complex task into a simple reward function is that a given policy may be optimal for many different reward functions..That is, even though we have the actions from an expert, there exist many different reward functions that the expert might be attempting to maximize.In other words, our goal is to model an agent taking actions in a given environment..We therefore suppose that we have a state space S (the set of states the agent and environment can be in), an action space A (the set of actions the agent can take), and a transition function T(s′|s,a), which gives the probability of moving from state s to state s′ when taking action a..For instance, for an AI learning to control a car, the state space would be the possible locations and orientations of the car, the action space would be the set of control signals that the AI could send to the car, and the transition function would be the dynamics model for the car..The tuple of (S,A,T) is called an MDP∖R, which is a Markov Decision Process without a reward function..(The MDP∖R will either have a known horizon or a discount rate γ but we’ll leave these out for simplicity.)The inference problem for IRL is to infer a reward function R given an optimal policy π∗:S→A for the MDP∖R..We learn about the policy π∗ from samples (s,a) of states and the corresponding action according to π∗ (which may be random)..Typically, these samples come from a trajectory, which records the full history of the agent’s states and actions in a single episode:In the car example, this would correspond to the actions taken by an expert human driver who is demonstrating desired driving behaviour (where the actions would be recorded as the signals to the steering wheel, brake, etc.).Given the MDP∖R and the observed trajectory, the goal is to infer the reward function R..In a Bayesian framework, if we specify a prior on R we have:The likelihood P(ai|si,R) is just πR(s)[ai], where πR is the optimal policy under the reward function R..Note that computing the optimal policy given the reward is in general non-trivial; except in simple cases, we typically approximate the policy using reinforcement learning..Due to the challenges of specifying priors, computing optimal policies and integrating over reward functions, most work in IRL uses some kind of approximation to the Bayesian objective.Reward SignalIn most reinforcement learning tasks there is no natural source for the reward signal..Instead, it has to be hand-crafted and carefully designed to accurately represent the task.Often, it is necessary to manually tweak the rewards of the RL agent until desired behavior is observed.. More details

Leave a Reply