# Building a Crawling Robot With Q Learning

Building a Crawling Robot With Q LearningCode HerokuBlockedUnblockFollowFollowingMar 30This post is a part of our Introduction to Machine Learning course at Code Heroku.

In our Introduction to Reinforcement Learning post, we have seen how to model a Reinforcement Learning problem using Markov Decision Process (MDP).

In case of model based learning, we try to create a model for reinforcement learning problems using Markov Decision Process (MDP).

We try to calculate all the possible probabilities associated with each state (and probabilities of getting rewards) and then extract a policy from it.

But, will calculating all the probabilities be computationally feasible for most real world problems?.No, right?.That’s why the concept of model free learning got introduced in reinforcement learning.

Prefer videos?.Subscribe to our YouTube channel for watching video lessons.

In case of model free learning, the only thing that we are concerned about is – being in the current state, what is the next best action to take?There are two types of model free learning – Monte Carlo Learning and TD Learning.

Monte Carlo Learning takes a sequence of actions for solving a Reinforcement Learning problem and then in the end, it tries to evaluate – how good the “sequence of actions” were.

Whereas, TD – Learning takes a single action at a time and tries to evaluate how good the individual action was.

Q Learning is a type of TD – Learning that we are going to learn today.

The problem with Monte Carlo Learning is – unless we reach an end state, we won’t be able to evaluate our actions.

So, it can’t be implemented for problems which don’t have any end state.

For example, in our crawling robot problem, there is no end state; The robot’s only goal is to walk.

That’s why we are applying Q Learning for solving this problem.

We know that in case of reinforcement learning problems, we can take multiple actions from a state.

For example, consider the following image of PacMan.

The agent can either move in upward, downward, left or right direction.

But, to achieve the final goal, the agent needs to move in the direction which gives maximum reward (value).

Q(s,a) is a function which takes a state-action pair as input and then tells us the value (reward) associated with taking this action ‘a’ on state ‘s’.

In order to maximize our rewards, we should select the maximum Q(s,a) value for choosing the value of the state ( V(s) ).

Let’s see if you actually understood this idea or not.

Given the following table of Q(s,a), how would you calculate the V(s) for each state?We can see that the top left block (state) has two possible values of Q(s,a).

If it goes to the state at its right, it will get a Q(s,a) of 90.

If it goes to the state below it, it will get a Q(s,a) of 72.

So, in order to get maximum rewards, we should select V(s) as maximum of all possible Q values from the current state = max(90,72) = 90.

This way we can calculate the V(s) for the rest of the states as well.

Over here, we have seen how we can calculate V(s) from Q(s,a) values.

But, we still don’t know how to calculate the Q(s,a) values itself.

Let’s see how to do this.

The Q value of a state-action pair can is nothing but the summation of the instant reward and all of future rewards.

Q(s,a) = Instant Reward + Future RewardAnd for calculating the value of “future reward”, we need to select the maximum reward of the next state and multiply it with a discount factor — gamma (Ɣ).

Based on this equation, we are going to update our Q values.

Now that we have a clear idea of Q Learning, let’s try to implement a crawling robot with this.

For implementing this, we will use a crawling environment which was originally created for University of Berkeley’s CS188 course.

Along with this, we will create two more files – Agents.

py and play.

py for for simulating the whole environment.

Inside the play.

py file, let us first initialize the environment and agent.

For this crawling robot problem, can you figure out what are the states?.What are the actions?.And what are the rewards?We can think of the position of the arms as the states, the movement of the arms as actions and the velocity of the robot as a good reward signal.

We will keep a track of all the rewards using total_rewards variable.

Then we will iterate the learning process for 30,000 times and will print average rewards after each 5,000 steps.

In order to make this thing to work, we need to define the learn() function in agents.

py file.

We will implement the formula of Q(s,a) inside this function.

But, instead of directly assigning the new value to Q, we will use a parameter, alpha, to gradually increase the new value over the old value.

It is similar to learning rate in supervised learning.

Now we need to define the choose_action() function.

In this function, we would define our action to choose the maximum Q value for a state.

But, sometimes we will intentionally choose random actions to add some randomness to our learning process.

This will make sure that our learning process won’t be biased towards taking some specific action.

Now if we try to run the above file, we will see an output screen where the robot is trying to crawl and getting better at it as the number of steps are increasing.