Reinforcement Learning (RL) | Markov Decision Process

Oftentimes when we think of A.I., we think of Terminator, Ultron, the Matrix, or any of those fancy Hollywood movies. To date, reinforcement learning is probably the closest thing we have to that vision of what A.I. looks like.

When we learn in real life, oftentimes we use a combination of both supervised and unsupervised learning. We have labelled datasets that we learn from, but at the same time, we have unlabeled datasets we work with to look for underlying structures or patterns that would help us better grasp the learning material at hand. However, the difference in both is the concept of time. Our time is a scarce resource, and we cannot keep learning forever. We have deadlines and other priorities that we must meet. As such, with each passing second, a penalty is imposed on us because that is a second that we could have used for a different task.

The thing that makes reinforcement learning (RL) powerful is this idea of incentive. Every move it makes has a corresponding incentive, think of it like points that you receive with every score. At the same time, penalties also exist in every move for the RL agent to be incentivized to move faster. It tries to maximize its expected future rewards by trying to look for an optimal way to perform a task at hand.

The Reinforcement Learning model gets thrown into an environment and is made to run multiple tests. It tries to answer questions like, “in state A, what action should I take? What does by total reward in the future look like? Is this really the right thing to do toward that goal?” In a way, you can say that it is like supervised learning since the outcome is well defined. However, you can also say that it is unsupervised learning because it tries to look for patterns and structures within the environment without being fed the way to maximize future rewards beforehand. Somehow you can say it is the best of both worlds.

So far we have discussed ways an A.I. can learn either with or without help. Is there a way to combine the two? We, humans, use both all the time, so why do we have to use them separately for our A.I.? In reinforcement learning (RL), we use insights and techniques from both to model decision making processes.

The key to Reinforcement Learning models is the concept of trial and error and their corresponding incentives. In the real world, we often repeat doing things that give us satisfaction, while we avoid things that give us pain or suffering. This explanation is vague for a reason. For example, gruelling exercise to better your strength and speed is difficult and may even bring you to the brink of utter exhaustion. However, it may be that the reason people keep on doing it is the satisfaction that it brings them for the future since, with every repeat, they bring themselves closer to their fitness goals.

In real life, “satisfaction” is subjective as it depends on person to person. In Reinforcement Learning, however, we get to define what constitutes “satisfaction” and what does not. In a first-person shooter game, a game bot programmed with RL may define satisfaction as the amount of kills made from the opposing team, while its number of deaths are defined as not being satisfactory. The main insight is that actions that give us satisfying results are more likely to be repeated than those that offer no satisfaction whatsoever.

If you have not noticed by now, most of the insights we get from our understanding of how we learn get transformed into math, including decision making. However, our mathematical interpretations are never completely accurate, which is why an iterative process of trial and error is done in order to check what we are lacking and figure out a way to fix that.

Markov Decision Process

Decision making can be said as a generalization of both supervised and unsupervised learning. Informed decisions are those that are learned through multiple iterations of supervised learning. More examples mean greater opportunities to learn from past mistakes to improve upon in the future. When left to one’s own devices, making decisions based on “hunches” or “gut feeling” can be said to mirror unsupervised learning. Deciding on something by being able to see patterns or structures is something we do when thrown into a new situation. Reinforcement Learning is as close as we can get right now to actual AGI.

The Markov decision process (MDP) is at the centrepiece of most Reinforcement Learning models. MDP is a mathematical framework used for decision making where said decisions are part-random and part-under control by the agent. MDPs are defined by a set of states, actions, rewards, and transitions. For example, playing tic-tac-toe can be modelled using MDP. For each state, you have certain actions that you can take. Each action that you take grants you an award of some sort, whether it is gaining an edge on your opponents or winning the game itself.

The goal of the game agent is to find a policy that results in the greatest future rewards. This policy is something that you learn through experience. Given a scenario, because of past experiences, you would know what to do. For example, when you are in city X past midnight, do you take route A or B? Experience helps in your decision-making process because you would know by then what the proper course of action should be.

The beauty of the MDP is that it also works for non-deterministic situations. These are situations in which “going up” or “pressing the button” do not necessarily result in the desired outcome. For example, maybe “going up” is only 80% probable, while “going down” is 10% probable, and “going left” and “going right” are each 5% probable.

The image above is an environment that has a total of 12 states in the 4×3 grid. An agent also has four moves, namely up, down, left, and right. But because of stochasticity, which is a fancy way of saying being probabilistic, the supposed consequence of certain actions is never certain. This is also the case in real life.

For example, when we are playing basketball, shooting from the three-point line may not always result in a field goal. Or to be more specific, shooting under your most favourable conditions—however they may be defined—do not always yield the result you want. You may be wide open for a three-point shot while being “on fire” the entire game, but that does not mean that the shot is assured to go in. There is always a chance it might not, and we do not have control over that chance.

(Deep) Q-Learning

When creating Reinforcement Learning agents, we want them to be as independent as to a game as possible. We call Q-learning as a “model-free” RL technique. These techniques that we subject our RL agents to use as their medium for training must be generalizable. Recall that earlier, intelligence is not only defined as the ability to recall past knowledge, but also the ability to apply the same methods to other subject matters as well. This is where Q-learning steps into play.

The “Q” in Q-learning can be thought of as “quality.” Certain combinations of state and action yield a “quality” to them—often denoted by a number with the highest one having the highest quality. For example, in basketball, it may make more sense to shoot the ball when you are wide open and in your “sweet spot.” Which action to take is dictated by whatever our policy is for a given state. The beauty about Q-learning is that it handles stochastic situations pretty well without the need for model-specific adjustments. Q-learning also converges to an optimal solution, meaning training does not get stuck in a loop.

However, to improve Q-learning, we have to make it deep. Much like in DNNs where hidden layers represent a deeper abstraction of ideas, making our Q-learning deeper will allow for better learning, and more often than not, be more creative in making moves. This is where we finally introduce deep Q-network (DQN).

Let us use game bots for our example. Reinforcement Learning is widely used for creating game bots because they are the closest to mimicking real-world scenarios. As inputs, we can take in pixel values. A familiar model mostly discussed is the CNN. However, unlike earlier, we do not use max-pooling since adding it makes our CNN spatially invariant. Max pooling is fine if you are doing classification, but for our purposes, we need our RL agent to be sensitive to the location in each given pixel values. The only difference between Q-learning and DQNs is that the later utilizes more layers of CNNs and DNNs.

Current State of RL

Right now RL research is hot between Google’s DeepMind and Sam Altman and Elon Musk’s OpenAI. Both are world-renowned research organizations that focus all their energy on A.I. research. Both organizations seek to advanced A.I. research by creating machines that can learn without the need to be taught. The beauty of RL is that it can be completely bad at what it does in the beginning, while a week later, become a complete expert on what it is you programmed it to do. This is a powerful approach compared to supervised learning, where you have to spoon-feed your models the answers in order for it to minimize future errors.

Google DeepMind’s most famous success is having their program called AlphaGo to defeat world-renowned Go player Lee Sedol from South Korea four times in a five-game series. This is the first time a program was able to beat a high-ranking Go player without any handicaps. This marked a foundation to which we can apply our current Reinforcement Learning implementations when stretched far enough.

Sam Altman and Elon Musk’s OpenAI also had its recent fair share of the spotlight when their DotA game bot beat world-class player Dendi in a 1v1 battle during the international DotA 2 tournament. Chief Technical Officer Greg Brockman explained that their game bot was trained to play against itself for two weeks. It is amazing how their game bot only took weeks to become world-class in playing DotA when these other world-class players took years of hard work.

OpenAI has since released an open-source library called Gym, which is a collection of different games that a developer can play around with. By democratizing access to this library, OpenAI has made RL development the competition by seeing which machines score better. This will rapidly increase RL development as more unique solutions are generated every day.

Possibilities

Game bots are only the beginning. They serve as a foundation with what Reinforcement Learning can do as they mimic what can happen in real life. An immediate practical use for RL is in self-driving cars. Tesla is currently using RL and other neural network-based algorithms to better its self-driving cars. By making RL agents train on simulations first, or from having them observe actual drivers, their learning can accelerate.

The benefit of having an RL-based driver is hivemind communication. Human drivers communicate poorly with other drivers on the road for obvious reasons. By nature, we tend to be easily distracted in most scenarios. One moment we may be dozing off into the sunset when we are driving back home, and the other moment we are so laser-focused on our driving so as to avoid being tangled in an accident. As for RL-based drivers? No such variability occurs. A network of communicating A.I. is much faster and less prone to miscommunication than one driver trying to tell the driver two cars away to move it. One can imagine the traffic inefficiencies will eventually be smoothed out as machines are the ones communicating in this hivemind.

Apart from self-driving cars, there is also surgery. It sounds far-fetched now because we would need to run more tests over time, but we are marginally getting close. Imagine a robot performing surgery with absolute precision with its scalpel. It does not shake, get nervous, or suffer from fatigue. If a DotA game bot can become world-class in two weeks time, imagine what surgery-related RL research can do if they train it for a year.

Another exciting application for Reinforcement Learning is in customer service. Chatbots also count in RL research if we include natural language processing into the mix. A company can cut its expenditures by instead hiring bots to chat with customers on social media or talk to customers calling in via phone. These bots will be able to give answers quickly and will not be worried about an influx of customers trying to get a response all at the same time. Much like self-driving cars and surgery, these bots also do not suffer from fatigue, meaning they can work 24/7.

RL-based agents can help to automate a lot of jobs that: 1) requires less than a few seconds to think or do; or 2) ones where fatigue becomes a factor over time. The possibilities are endless so as long as the imagination and creativity are boundless.

Reinforcement Learning

Related

Era Innovator

Leave a ReplyCancel reply

Reinforcement Learning

Share this:

Related

Era Innovator

Leave a ReplyCancel reply