One of the most fundamental questions in the field of reinforcement learning for scientists across the globe has been – “How to learn a new skill?”. The desire to understand the answer is obvious – if we can understand this, we can enable human species to do things we might not have thought before. Alternately, we can train machines using reinforcement learning to do more “human” tasks and create true artificial intelligence.
While we don’t have a complete answer to the above question yet, there are a few things which are clear. Irrespective of the skill, we first learn by interacting with the environment. Whether we are learning to drive a car or whether it an infant learning to walk, the learning is based on the interaction with the environment. Learning from interaction is the foundational underlying concept for all theories of learning and intelligence.
Today, we will explore Reinforcement Learning – a goal-oriented learning based on interaction with environment. Reinforcement Learning is said to be the hope of true artificial intelligence. And it is rightly said so, because the potential that Reinforcement Learning possesses is immense.
Reinforcement Learning is growing rapidly, producing wide variety of learning algorithms for different applications. Hence it is important to be familiar with the techniques of reinforcement learning. If you are not familiar with reinforcement learning, I will suggest you to go through my previous article on introduction to reinforcement learning and the open source RL platforms.
Once you have an understanding of underlying fundamentals, proceed with this article. By the end of this article you will have a thorough understanding of Reinforcement Learning and its practical implementation.
P.S. For implementation, we assume that you have basic knowledge of Python. If you don’t know Python, you should first go through this tutorial
Reinforcement Learning is learning what to do and how to map situations to actions. The end result is to maximize the numerical reward signal. The learner is not told which action to take, but instead must discover which action will yield the maximum reward. Let’s understand this with a simple example below.
Consider an example of a child learning to walk.
Here are the steps a child will take while learning  to walk:
Sounds like a difficult task right? It actually is a bit challenging to get up and start walking, but you have become so use to it that you are not fazed by the task. But now you can get the gist of how difficult it is for a child.
Let’s formalize the above example, the “problem statement” of the example is to walk, where the child is an agent trying to manipulate the environment (which is the surface on which it walks) by taking actions (viz walking) and he/she tries to go from one state (viz each step he/she takes) to another. The child gets a reward (let’s say chocolate) when he/she accomplishes a submodule of the task (viz taking couple of steps) and will not receive any chocolate (a.k.a negative reward) when he/she is not able to walk. This is a simplified description of a reinforcement learning problem.
Here’s a good introductory video on Reinforcement Learning.
Reinforcement Learning belongs to a bigger class of machine learning algorithm. Below is the description of types of machine learning methodologies.
Let’s see a comparison between RL and others:
There is also a fourth type of machine learning methodology called semi-supervised learning, which is essentially a combination of supervised and unsupervised learning. It differs from reinforcement learning as similar to supervised and semi-supervised learning has direct mapping whereas reinforcement does not.
To understand how to solve a reinforcement learning problem, let’s go through a classic example of reinforcement learning problem – Multi-Armed Bandit Problem. First, we would understand the fundamental problem of exploration vs exploitation and then go on to define the framework to solve RL problems.
Suppose you have many slot machines with random payouts. A slot machine would look something like this.
Now you want to do is get the maximum bonus from the slot machines as fast as possible. What would you do?
One naive approach might be to select only one slot machine and keep pulling the lever all day long. Sounds boring, but it may give you “some” payouts. With this approach, you might hit the jackpot (with a probability close to 0.00000….1) but most of the time you may just be sitting in front of the slot machine losing money. Formally, this can be defined as a pure exploitation approach. Is this the optimal choice? The answer is NO.
Let’s look at another approach. We could pull a lever of each & every slot machine and pray to God that at least one of them would hit the jackpot. This is another naive approach which would keep you pulling levers all day long, but give you sub-optimal payouts. Formally this approach is a pure exploration approach.
Both of these approaches are not optimal, and we have to find a proper balance between them to get maximum reward. This is said to be exploration vs exploitation dilemma of reinforcement learning.
First, we formally define the framework for reinforcement learning problem and then list down the probable approaches to solve the problem.
The mathematical framework for defining a solution in reinforcement learning scenario is called Markov Decision Process. This can be designed as:
We have to take an action (A) to transition from our start state to our end state (S). In return getting rewards (R) for each action we take. Our actions can lead to a positive reward or negative reward.
The set of actions we took define our policy (Ï€) and the rewards we get in return defines our value (V). Our task here is to maximize our rewards by choosing the correct policy. So we have to maximize for all possible values of S for a time t.
Let me take you through another example to make it clear.
This is a representation of a shortest path problem. The task is to go from place A to place F, with as low cost as possible. The numbers at each edge between two places represent the cost taken to traverse the distance. The negative cost are actually some earnings on the way. We define Value is the total cumulative reward when you do a policy.
Here,
Now suppose you are at place A, the only visible path is your next destination and anything beyond that is not known at this stage (a.k.a observable space).
You can take a greedy approach and take the best possible next step, which is going from {A -> D} from a subset of {A -> (B, C, D, E)}. Similarly now you are at place D and want to go to place F, you can choose from {D -> (B, C, F)}. We see that {D -> F} has the lowest cost and hence we take that path.
So here, our policy was to take {A -> D -> F} and our Value is -120.
Congratulations! You have just implemented a reinforcement learning algorithm. This algorithm is known as epsilon greedy, which is literally a greedy approach to solving the problem. Now if you (the salesman) want to go from place A to place F again, you would always choose the same policy.
Can you guess which category does our policy belong to i.e. (pure exploration vs pure exploitation)?
Notice that the policy we took is not an optimal policy. We would have to “explore” a little bit to find the optimal policy. The approach which we took here is policy based learning, and our task is to find the optimal policy among all the possible policies. There are different ways to solve this problem, I’ll briefly list down the major categories
I would try to cover in-depth reinforcement learning algorithms in future articles.
We will be using Deep Q-learning algorithm. Q-learning is a policy based learning algorithm with the function approximator as a neural network. This algorithm was used by Google to beat humans at Atari games!
Let’s see a pseudocode of Q-learning:
A simple description of Q-learning can be summarized as follows:
We will first see what Cartpole problem is then go on to coding up a solution
When I was a kid, I remember that I would pick a stick and try to balance it on one hand. Me and my friends used to have this competition where whoever balances it for more time would get a “reward”, a chocolate!
Here’s a short video description of a real cart-pole system
Let’s code it up!
To setup our code, we need to first install a few things,
From terminal, run the following commands:
git clone https://github.com/matthiasplappert/keras-rl.git
cd keras-rl
python setup.py install
Assuming you have pip installed, you need to install the following libraries
pip install h5py pip install gym
First we have to import modules that are necessary
import numpy as np import gym from keras.models import Sequential from keras.layers import Dense, Activation, Flatten from keras.optimizers import Adam from rl.agents.dqn import DQNAgent from rl.policy import EpsGreedyQPolicy from rl.memory import SequentialMemory
Then set the relevant variables
ENV_NAME = 'CartPole-v0' # Get the environment and extract the number of actions available in the Cartpole problem env = gym.make(ENV_NAME) np.random.seed(123) env.seed(123) nb_actions = env.action_space.n
Next, we build a very simple single hidden layer neural network model.
model = Sequential() model.add(Flatten(input_shape=(1,) + env.observation_space.shape)) model.add(Dense(16)) model.add(Activation('relu')) model.add(Dense(nb_actions)) model.add(Activation('linear')) print(model.summary())
Next, we configure and compile our agent. We set our policy as Epsilon Greedy and we also set our memory as Sequential Memory because we want to store the result of actions we performed and the rewards we get for each action.
policy = EpsGreedyQPolicy() memory = SequentialMemory(limit=50000, window_length=1) dqn = DQNAgent(model=model, nb_actions=nb_actions, memory=memory, nb_steps_warmup=10, target_model_update=1e-2, policy=policy) dqn.compile(Adam(lr=1e-3), metrics=['mae']) # Okay, now it's time to learn something! We visualize the training here for show, but this slows down training quite a lot. dqn.fit(env, nb_steps=5000, visualize=True, verbose=2)
Now we test our reinforcement learning model
dqn.test(env, nb_episodes=5, visualize=True)
This will be the output of our model:
And Voila! You have just built a reinforcement learning bot!
Now that you have seen a basic implementation of Re-inforcement learning, let us start moving towards a few more problems, increasing the complexity little bit every time.
For those, who don’t know the game – it was invented in 1883 and consists of 3 rods along with a number of sequentially-sized disks (3 in the figure above) starting at the leftmost rod. The objective is to move all the disks from the leftmost rod to the rightmost rod with the least number of moves. (You can read more on wikipedia)
If we have to map this problem, let us start with states:
All possible states:Â
Here are our 27 possible states:
All disks in a rod | One disk in a Rod | (13) disks in a rod | (23) disks in a rod | (12) disks in a rod |
    (123)** | 321 |  (13)2* |  (23)1* |  (12)3* |
     *(123)* | 312 |  (13)*2 | (23)*1 | (12)*3 |
      **(123) | 231 | 2(13)* | 1(23)* | 3(12)* |
132 | *(13)2 | *(23)1 | *(12)3 | |
213 | 2*(13) | 1*(23) | 3*(12) | |
123 | *2(13) | *1(23) | *3(12) |
Where (12)3* represents disks 1 and 2 in leftmost rod (top to bottom) 3 in middle rod and * denotes an empty rightmost rod
Numerical Reward:
Since we want to solve the problem in least number of steps, we can attach a reward of -1 to each step.
Policy:
Now, without going in any technical details, we can map possible transitions between above states. For example (123)** -> (23)1* with reward -1. It can also go to (23)*1
If you can now see a parallel, each of these 27 states mentioned above can represent a graph similar to that of shortest path algorithm above and we can find the most optimal solutions by experimenting various states and paths.
While I can solve this for you as well, I would want you to do this by yourself. Follow the same line of thought I used above and you should be good.
Start by defining the Starting state and the end state. Next, define all possible states and their transitions along with reward and policy. Finally, you should be able to create a solution for solving a rubix cube using the same approach.
As you would realize that the complexity of this Rubix Cube is many folds higher than the Towers of Hanoi. You can also understand how the possible number of options have increased in number. Now, think of number of states and options in a game of Chess and then in Go! Google DeepMind recently created a deep reinforcement learning algorithm which defeated Lee Sedol!
With the recent success in Deep Learning, now the focus is slowly shifting to applying deep learning to solve reinforcement learning problems. The news recently has been flooded with the defeat of Lee Sedol by a deep reinforcement learning algorithm developed by Google DeepMind. Similar breakthroughs are being seen in video games, where the algorithms developed are achieving human-level accuracy and beyond. Research is still at par, with both industrial and academic masterminds working together to accomplish the goal of building better self-learning robots
Some major domains where RL has been applied are as follows:
There are so many things unexplored and with the current craze of deep learning applied to reinforcement learning, there certainly are breakthroughs incoming!
Here is one of the recent news:
Excited to share an update on #AlphaGo! pic.twitter.com/IT5HGBmYDr
— Demis Hassabis (@demishassabis) January 4, 2017
I hope now you have in-depth understanding of how reinforcement learning works. Here are some additional resources to help you explore more about reinforcement learning
I hope you liked reading this article. If you have any doubts or questions, feel free to post them below. If you have worked with Reinforcement Learning before then share your experience below. Through this article I wanted to provide you an overview of reinforcement learning with its practical implementation. Hope you make found it useful.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Thanks for posting the article. Very interesting and clear though my first trying for atari(https://github.com/matthiasplappert/keras-rl.git/example/dqn_atari.py) training has ended up with ResourceExhaustedError.. Since I'm using GT730 GDDR5 1G RAM and I'm rather noob in this field my question is that is it a normal situation that I can't fit my model(atari example) with my GPU?
Hi David! Thanks for the feedback. To be frank, I think 1GB of VRAM is somewhat low to do "not-so-simple" tasks and especially deep neural networks. I have too faced with problem in the past when I had a 2GB NVidia card before moving on to a bigger machine. If possible, you can try increasing memory or use the available cloud services like AWS. Having said that, I would suggest you to try a few things
Its realy good article to understand reinforcement learning
It’s amazing in support of me to have a site, which is useful in support of my know-how. thanks admin, you surely come with remarkable articles. Cheers for sharing your website page.Excellent blog here...
Thank you Shepherd!
brilliant
Thanks Sandeep
Brilliant Article Faizan , There are 10 tutorials of Reinforcement Learning which is taught by David Silver. David is one of the founding fathers of Reinforcement Learning. Here is the link, Its really amazing & worth watching, https://www.youtube.com/watch?v=2pWv7GOvuf0 You can post these videos
Thanks Vaibhav! And thanks for your suggestion. Added!
Semi supervised learning .... in this context ... an overkill Well presented introduction to RL . .. Thanks
Thanks harry
Thanku for this great article :)
Thanks Ashok
The way of presenting the article is fantastic. I get so amazed after reading this article. Thank You Sir.
You are welcome Saksham!
I I encountered a problem with the terminal. when I ran the "git clone https://github.com/matthiasplappert/keras-rl.git" it had an error saying "'git' is not recognized...". I think that I did something wrong. I am a beginner at programming so maybe I didn't do something that i should.
however this is a very good article and I understood a lot
Hi, You would have to install git to clone this project. Here is the link for installation steps https://git-scm.com/downloads
Thanks for the great article that explains basics! It really helps! :) Best, Paul
Thanks Paul
hiii some one can take give me example with markov dcision process in matlab
Hi - you can easily search for this on the web
Hi! Can I use your article as a source for our group report and show the examples and explanations in this article to my classmates? This is a really great article, it was easy to understand and has helped me understand beginner concepts in so I thought I could use your explanations to introduce my classmates to RL. If it is not okay, I can understand.
Hi - you can absolutely use this as a resource. It would be great if you could mention the appropriate source too. Thanks
Great article. Can reinforcement learning be applied on financial transnational data to learn customer spending patters and predict future cash flow. I see unsupervised learning has been research for the above problem but yet reinforcement learning is not researched on transactional data. Do you think researching the same would benefit?
Although I haven't searched much - there definitely may be research going on in this field too. For example, you can see applications of reinforcement learning in stock market prediction etc
Thanks for a great article. You explain it in a clear way that helps me move forward on the ideas.
Hello, I'm a master student in Taiwan that study in Industrial Engineering, I want to apply this reinforcement learning for my Thesis. that is about inventory problem, that the reinforcement learning will be the one who makes the decision for example to order the items. Do you where I can study something like that? another question is do I need to make the model of reinforcement learning all by myself or actually in python there are already in the library that I can use? or if there any simulation way to do this reinforcement learning. Thank you very much.
Assalam.u.alaikum I really love your article. I am.student of undergraduate and have planed to work on this feild. This article really help me out to understand the basic concept of RL. I am beginner and I love to ask some questions if possible in this or Any Other platform. [email protected]
Its a good Article but you explain how I can define my own environments and test it to learn
Very well done. Thank you very much for this beginner's guide. However I'm a bit lost when you start coding. I mean, I understand the python syntax, but it's something like "you import this and this and then do this, this and this". Could you please explain a bit your code? Thanks again.
Hi Faizan, Thanks for the great article. When I tried to use the code, the below step gave me an error: dqn.fit(env, nb_steps=5000, visualize=True, verbose=2) Error: TypeError: get_updates() takes 3 positional arguments but 4 were given
For the Rubik's cube problem, how do you "define all possible states and their transitions along with reward and policy"? There are 43,252,003,274,489,856,000 possible states. Am I understanding correctly? Starting state is arbitrary. I know the final state. There are 12 possible moves at each point in time (6 faces, counterclockwise or clockwise). I believe it takes at most 20 moves to reach any other state.
Hi - The article is very good but i am getting this eroor : "gym.error.UnregisteredEnv: No registered env with id: CartPOle-v0" , I had follow the mentioned steps. Please response how to register this environment.
Great post! I've been struggling to wrap my head around reinforcement learning, but your explanation of the basics was crystal clear. The implementations in Python were also very helpful. Can't wait to dive deeper into this topic!