The Fundamentals of Reinforcement Learning

A reward and penalty system is implemented, so the program can learn from its mistakes.

As soon as the computer is able to differentiate between “A”s and “non-A”s, the training phase is completed and the computer is tested with new data.

Through deep learning, reinforcement learning has allowed computers to master Atari games, Dota2, and the Chinese board game Go.

Creating personalized recommendation lists on Netflix or Amazon is possible because of reinforcement learning.

While reinforcement learning has allowed for large developments in artificial intelligence, it is not without its shortcomings.

Sample EfficiencyMontezuma’s RevengeSample efficiency is a big discrepancy between human learning and computer learning.

A human could look at a simple video game challenge and figure out what you need to avoid and what the object of the game is.

A robot takes much longer to catch on.

A modern reinforcement learning algorithm needs about 4 million frames in order to solve a level like this level successfully and consistently.

The equivalent of 4 million frames in human time is 37 hours of uninterrupted gameplay, if the engine is running at 30 frames per second.

This is about 2,000x slower than a human being.

Priors RemovedWith PriorsResearchers at Berkeley argue that humans have the advantage because of prior knowledge.

We know the shape of ladders, doors, and keys, and we know that we should probably avoid fire and angry skulls because that indicates danger.

Researchers at Berkeley created different versions of the same video game one with object priors and others with these priors removed individually.

In a regular designed video game with object priors intact, humans were able to complete a level in 1.

8 minutes with an average of 3.

3 deaths before completion.

It is actually pretty challenging, but this is how we can imagine playing as a computer.

You can try them out yourself here.

In fact, when all priors are removed, the learning curve for the level is sharply increased for humans, but computers do not require any additional time to solve the level consistently.

By eliminating all of these object priors, human performance in this game increased to 20 minutes and average deaths rose to 40.

Credit Assignment ProblemMachine learning and reinforcement learning are not new ideas.

Minsky addressed these topics in his innovative paper from 1961.

In this paper Minsky discusses many concepts that were ahead of his time including the credit assignment problem.

Credit assignment involves what kind of actions should result in a reward and which actions should result in a penalty.

Temporal credit assignment is a big problem for reinforcement learning and, if addressed, may reduce the length of time it takes for a computer to “learn” a video game.

If a computer is playing Pong pretty well, volleying the ball back and forth for a while.

The computer is making smart and quick moves throughout, but at the end they narrowly miss the ball, which results in a loss.

Despite a strong performance in this round, the computer who has been programmed with reinforcement learning will be less likely to use those sequences of moves in the future because it will associate it with a loss.

The computer is unable to determine which few moves preceding the loss actually caused the negative result, so its good performance is thrown out.

This problem has received the most attention in the reinforcement learning community.

Structural credit assignment involves generalizing which sequence of actions will result in the same outcome.

Transfer credit assignment is generalizing which sequence of moves can be applied to different tasks successfully.

Big picture pattern recognition would be critical for this skill.

Quantifying and addressing these discrepancies in reinforcement learning would greatly reduce the disparity of learning time between computers and humans.

Multi-Armed BanditThe multi-armed bandit problem is a classic problem in reinforcement learning.

There is a fixed limited set of resources and these resources must be allocated between a set of competing choices in order to maximize the reward, or maximum gain.

The properties of each choice are hidden at the beginning of the trial.

Information about each choice is revealed over time or by allocating resources to that choice.

This is a dilemma of exploration vs exploitation.

The only true way to move through this problem is with trial-and-error exploration.

This problem is often conceptualized as 10 different slot machines.

The slot machines either payout, or they don’t, but some machines payout more often than others.

The goal is to find the machine with the highest win-rate.

One solution for this problem involved allocating 10% of resources toward exploration and the rest of the resources were dedicated to exploiting the machine with the perceived highest payout rate.

ApplicationsThere are no shortage of applications for reinforcement learning.

It can do well in video games, but also be used for energy consumption optimization.

There is a website that showcases different algorithms of AI using reinforcement learning.

Check it out here!.Google’s DeepMind had a computer teach itself how to walk.

And if a robot can teach itself how to flip pancakes, I’m all in.

Machine learning is where data science and robotics meet, and they all have a symbiotic relationship.

Teaching computers how to learn like we do is the next big hurdle to jump.

.

. More details

Leave a Reply