Off Policy
Different strategiesIt means that the strategy for generating new samples is different from the strategy used when the network updates parameters. A typical example is the Q-learning algorithm.
Different strategy thinking
Different strategies refer to that the learned strategy is different from the sampled strategy. It first generates a large amount of behavior data under a certain probability distribution, and then finds the target strategy from these data that deviate from the Off optimal strategy.
The adoption of this plan requires the following conditions to be met: assuming that π is the target strategy and μ is the behavioral strategy, then the condition for learning from μ to π is that when π ( a | s ) > 0, µ ( a | s ) > 0 must hold.
Q-learning algorithm
The Q-Learning algorithm learns how to choose the next action based on perceived rewards and penalties, where Q represents the quality function of the policy π, which maps each state-action pair (s, a) to the total expected future reward after observing the state s and determining the action a.
The Q-Learning algorithm is Model-Free, which means that it does not model the dynamic knowledge of the MDP, but directly estimates the Q values of different actions in each state, and then selects the action with the highest Q value in each state and the corresponding strategy.
If the computer continuously accesses all state actions, the Q-Learning algorithm will converge to the optimal Q function.
Different strategy advantages
- Can learn based on teaching samples given by humans or guided samples given by other agents;
- Experience generated from old strategies can be used;
- It is possible to learn a deterministic policy while using an exploratory policy;
- You can use one strategy to sample and learn multiple strategies at the same time.