Reinforcement Learning Basics
A long introduction to RL is presented here, I select and modifed some content here in the following.
Basic Concepts
The agent is acting in an environment. How the environment reacts to certain actions is defined by a model which we may or may not know. The model defines the reward function and transition probabilities.

When we know the model, optimal solution can be achieved via dynamic programming.

When we don’t know the model, we can choose (1) learn the model explicitly; (2) do modelfree RL.
The agent can stay in one of many states (\(s\in \mathcal S\)) of the environment, and choose to take one of many actions (\(a\in \mathcal A\)) to switch from one state to another. Which state the agent will arrive in is decided by transition probabilities between states (\(\mathcal P\)). Once an action is taken, the environment delivers a reward (\(r\in \mathcal R\)) as feedback [1].
The agent’s policy \(\pi (s)\) provides the guideline on what is the optimal action to take in a certain state with the goal to maximize the total rewards. A value function \(V(s)\) represents the expected objective value obtained following policy \(\pi\) from each state in \(s\). In other words, the value function quantifies how good a state is. Both policy and value functions are what we try to learn in reinforcement learning.
Basic Components
Value Function
The future reward, also known as return, is a total sum of discounted rewards going forward.
\[G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+ \cdots =\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}\]The statevalue of a state s is the expected return if we are in this state at time \(t, \mathcal S=s\)
\[V_{\pi}(s) = \mathbb{E}_{\pi}[G_t \vert S_t = s]\]Similarly, we define the actionvalue (“Qvalue”) of a stateaction pair as:
\[Q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t \vert S_t = s, A_t = a]\]The difference between actionvalue and statevalue is the action advantage function (“Avalue”):
\[A_{\pi}(s, a) = Q_{\pi}(s, a)  V_{\pi}(s)\]Approaches
Dynamic Programming
When the model is fully known, following Bellman equations, we can use Dynamic Programming to iteratively evaluate value functions and improve policy.
MonteCarlo Methods
MonteCarlo (MC) methods use a simple idea: It learns from episodes of raw experience without modeling the environmental dynamics. MC use the observed mean return as an approximation of the expected return. To compute the empirical return \(G_t\), MC methods need to learn from complete episodes \(S_1, A_1, R_2, \dots, S_T\), and all the episodes must eventually terminate.
TemporalDifference Learning
Similar to MonteCarlo methods, TemporalDifference (TD) Learning is modelfree and learns from episodes of experience. However, TD learning can learn from incomplete episodes and hence we don’t need to track the episode up to termination.
SARSA
QLearning
Deep QNetwork (DQN)
Policy Gradient
All the above methods aim to learn the state/action value function and then to select actions accordingly (with maximum \(G_t\)). Policy Gradient methods instead learn the policy directly with a parameterized function respect to \(\theta\), \(\pi(a \vert s; \theta)\).
ActorCritic
If the value function is learned in addition to the policy, we would get ActorCritic algorithm.
 Critic: updates value function parameters w and depending on the algorithm it could be actionvalue \(Q(a \vert s; w)\) or statevalue \(V(s; w)\).
 Actor: updates policy parameters θ, in the direction suggested by the critic, \(\pi(a \vert s; \theta)\).
Ref.
[1] Lilian Weng, A (Long) Peek into Reinforcement Learning, Blog.