RL Fundamentals

This page covers some key concepts around reinforcement learning, including Markov Decision Processes, agent components, and types of RL Agents.

 What is Reinforcement Learning?

I would capture the essence of RL with the following sentence:

Reinforcement Learning is an optimization method in which an agent takes actions based on its state to maximize reward when interacting with an environment.

Unlike traditional problem-solving methods, the person designing a reinforcement learning agent does not need to know how to solve the problem. This is really important if you think about it! The agent learns on its own as an artificial intelligence (AI), and we can tackle an entirely new class of problems in entirely new ways.

RL and Markov Decision Processes

RL is well suited to solve for interactions that fall under a Markov Decision Process (MDP), where:

  • a system/agent/thing is in some state s, with a set of actions A to take

  • the system/agent/thing selects some action a in that state with probability P( a | s)

  • the action selected results in the transition to a new state s’ with reward R with some probability P( s’, R | s, a )

 

Example MDP: There are four states, and each state has multiple actions which transition to the next state. State transitions numbers are <probability of selecting action>, <reward for taking the action from the state>. For example, looking at the rightmost arrow: the agent in State 3 would select this action with a 33% chance and receive a +2 reward for taking such action from State 3

 

To model a process (or more specifically, an agent-environment interaction) as an MDP, it must satisfy a specific condition known as the Markov Property. Taking a well-worded definition I found:

The Markov Property states that the conditional probability distribution of future states of the process (conditional on both past and present values) depends only upon the present state; that is, given the present, the future does not depend on the past

Thinking of applying this to problems that have longer temporal scales where past behavior has a “delayed” impact on the present or future, these can still be defined as an MDP such that the prior experience is captured somehow in the features of the present state (think of using some rolling average of previous values to capture the past as a feature of the present state)

(Here’s a Wikipedia link on Markov Decision Processes for those that want a deeper dive into the subject)

From MDPs to RL Agents

So, if we have an MDP defined for some problem we’re trying to solve, why do we need AI and reinforcement learning?

—> Rarely, if ever, do we have an MDP with states, actions, rewards, and transition probabilities known for the kinds of problems we’re trying to solve with RL.

The beauty of RL is that our RL agent will learn all of these things on its own about some problems, and then learn to act optimally based on these things to optimize some reward that we specify. An agent interacts with an environment in the following manner:

Here, the environment block encapsulates all of the dynamics of some MDP which we (and the agent) don’t know. By repeatedly interacting with the environment, the agent will learn which actions to perform in each state to maximize the reward received.

Now, while RL can learn these things, it doesn’t mean that it’s straightforward in how it learns them and how well it performs. There are tons of elements to an RL agent that still come down to the human designing them, primarily:

  • Problem formulation: How should states and actions be represented? How does the agent interact with the environment?

  • Reward design: How should reward be defined based on interactions with the environment? Are there additional reward factors to encourage certain behavior from the agent?

  • Agent design: How should the agent learn? How should the agent decide which actions to take? How should it “remember” outcomes and environment dynamics?

We can break each of these down into some fundamental pieces of any RL agent.

Building Blocks of an RL Agent

Expanding on the diagram above, let’s think about the agent-environment interaction in four blocks. The green block represents agent external things we can edit, and the blue blocks represent agent internal things we can edit.

 
 

Let’s think of a human performing some task to contextualize this difference. Say that John Appleseed has given up on planting apple trees and has taken up a life of gambling. Each morning, he wakes up and has to decide which casino to go to, at which he plays slots all day.

  • Agent external edits:

    • Reward function: Reward could be purely the amount of money made in a day, or reward could be the amount of money made + some fuel expense to get to each casino based on the distance from his home at the apple orchard

  • Agent internal edits:

    • Policy: How does John ultimately make the decision about which casino to go to (i.e., which action to take)?

    • Value representation: How does John “remember” and keep track of what is good vs what is bad?

    • Hyperparameters: How quickly does John change his value of some casino?

    • State and action representation: Should yesterday’s results, or day of the week, or how much fuel is left, etc be a feature of the state? How should actions be quantified and stored?

All of these different decisions fall on the designer of the RL agent, and the choices made can have serious implications for agent performance. Unfortunately, there’s no clear-cut, objective answer to these questions, and it takes experimentation and experience to develop good agents.

Types of RL Algorithms

While we can’t say outright what the best implementation is for every type of problem, we can group problems into categories based on some features and have a good starting place for algorithm selection. Some of these criteria are:

  • Value representation: Are we using function approximation or not (tabular)?

  • Models: Is the environment already understood and well modeled? Will we try to learn a model of the environment?

  • Learning frequency: Will we learn on each time step?

  • Action representation: Are the actions discrete or continuous?

  • Control application: Is this a problem of optimal control? Will the agent be controlling some system?

The RL Specialization from Coursera relates many of the RL algorithms/approaches to the features above in an eloquent way, which I’ve adapted into the chart below:

You can click on the image above to enlarge

Going Deeper

Although there’s so many more nuances of RL to cover, this is as in depth as I’ll go on my site explaining the general concepts of RL. For those that was to continue learning about RL, I’d encourage you to check out the RL resources page I’ve put together (linked below). I specifically recommend the RL textbook: Reinforcement Learning, Second Edition — Sutton and Barto. I found this to be the most comprehensive source of RL material, and it pairs very well with the RL Coursera Specialization.