Basics of reinforcement learning

Discussion related to various core topics in RL.

Ashish Gaurav

Changelog
  • (2024-01-12) Added basic definitions of MDP framework
  • (2024-01-15) Added Bellman recurrence for value functions

Reinforcement learning is a subfield of machine learning that models various scenarios as decision making processes. Typically, the objective is to either predict the next decisions or learn to take the optimal actions in order to maximize some notion of reward.

MDP Framework #

A decision-making scenario is often formulated as a Markov decision process (MDP), or a variant of this formulation. In the most basic formulation where the states $s\in\mathcal S$ are fully observable, there is an agent taking actions, $a\in\mathcal A$, in an environment. The interaction happens in episodes, which leads to a trajectory of transitions.

  • The agent observes an initial state $S_0\sim \mu$ (where $\mu$ is the initial state distribution)
  • The agent takes an action $A_0$ in this state $S_0$ and transitions to state $S_1\sim \mathbf p(S_1|S_0,A_0)$
  • The agent takes the action $A_1$ in the state $S_1$ and transitions to $S_2\sim \mathbf p(S_2|S_1,A_1)$, and this continues until the end of the episode
  • Every transition to next state is governed by the same next-state distribution $\mathbf p(S^\prime|S,A)$
    • $S\in\mathcal S$ is the current state
    • $A\in \mathcal A$ is the action taken in the current state
    • $S^\prime\in\mathcal S$ is the next state sampled from $\mathbf p$
  • The episode ends if
    • the agent reaches a terminal state $S_T$, or
    • the horizon $H$ is reached (for a finite horizon setting)

The trajectory of transitions that arises from this episode is, $$ \tau := \langle (S_0,A_0,S_1), (S_1,A_1,S_2),\cdots,(S_{T-1},A_{T-1},S_T) \rangle $$

Given a reward function $r:\mathcal S\times \mathcal A\times\mathcal S \rightarrow \mathbb R$, the return is defined as:

$$ G(\tau) := \sum_{t=0}^{T-1} \gamma^t r(S_t, A_t, S_{t+1}), 0\leq \gamma \leq 1 $$

The agent’s actions are chosen according to a policy $\pi$, which can be:

  • deterministic, i.e. $\pi: \mathcal S\rightarrow \mathcal A$, or
  • stochastic, i.e. $\pi:\mathcal S\rightarrow \Delta(\mathcal A)$

Given the policy, the probability of this trajectory can be written as, $$ \begin{align*} p(\tau) &= p(S_0) p(A_0|S_0) p(S_1|S_0,A_0) p(A_1|S_1) p(S_2|S_1,A_1) \cdots\\ \end{align*} $$

Similarly, it is helpful to define the value and action-value functions for a policy $\pi$: $$ \begin{align*} V^\pi(s) &:= \mathbb E_{\tau\sim\pi}[G(\tau)|S_0=s]\\ Q^\pi(s) &:= \mathbb E_{\tau\sim\pi}[G(\tau)|S_0=s,A_0=a] \end{align*} $$

Finally, the MDP is defined as a 5-tuple:

$$ \mathcal M := (\mathcal S,\mathcal A, r, \mathbf p, \gamma) $$

Bellman recurrence #

Value functions satisfy the following recurrence by definition, also called the Bellman recurrence (see \cite{suttonbook}, Ch. 3):

$$ \begin{align*} V^\pi(s) &= \mathbb E_{\tau\sim\pi}[G(\tau)|S_0=s]\\ &= \sum_{a} \pi(a|s) \sum_{s^\prime\in \mathcal S}\mathbf p(s^\prime|s,a)\mathbb E_{\tau\sim\pi}[r(s,a)+\gamma G(\tau)|S_0=s^\prime]\\ &= \sum_{a} \pi(a|s) \sum_{s^\prime\in \mathcal S}\mathbf p(s^\prime|s,a)[r(s,a)+\gamma \mathbb E_{\tau\sim\pi}[G(\tau)|S_0=s^\prime]]\\ &= \sum_{a} \pi(a|s) \sum_{s^\prime\in \mathcal S}\mathbf p(s^\prime|s,a)[r(s,a)+\gamma V^\pi(s^\prime)] \end{align*} $$