On Policy
Same strategyIt means that the strategy for generating samples is the same as the network parameter update strategy. It directly performs the next action selection based on the current strategy, and then uses this sample to update the strategy. The strategy for generating samples is the same as the strategy during learning.
SARSA algorithm
SARSA (State-Action-Reward-State-Action) is an algorithm for learning Markov decision process strategies, which is often used in reinforcement learning in the field of machine learning.
Key points of SARSA algorithm
- When in state s', you know which a' to take and take that action;
- The selection of action a follows the e-greedy strategy, and the calculation of the target Q value is based on the action a' obtained by the e-greedy strategy, so it is on-policy learning.
Advantages and disadvantages of the same strategy
- Advantages: Each step can be updated, which is obvious, and the learning speed is fast; it can face scenarios with no results and has a wide range of applications.
- Disadvantages: Encountering the contradiction between exploration and utilization; only using the known optimal choice may not lead to learning the optimal solution; converging to the local optimum, adding exploration and reducing learning efficiency.
Same strategy and different strategies
The difference between the same strategy and different strategies is whether to use the established strategy or a new strategy when updating the Q value.