HyperAI超神经

Q-Learning is a model-free, off-policy reinforcement learning algorithm that finds the best course of action given the current state of an agent.Based on where the agent is in the environment, it will decide what action to take next. “Q” refers to the function that the algorithm calculates — the expected reward for an action taken in a given state.

The goal of Q-learning is to find the best course of action given the current state. To do this, it may develop its own rules or operate outside of a prescribed policy.This means that no policy is actually needed, hence the term "off-policy".For any finite Markov decision process, Q-learning finds an optimal policy that maximizes the expected value of the total reward at any and all consecutive steps, starting from the current state. Q-learning can identify the best action selection policy for any given finite Markov decision process, given infinite exploration time and a partially randomized policy.

An example of Q-learning is an ad recommendation system. In a normal ad recommendation system, the viewer gets ads based on the viewer’s previous purchases or websites that the viewer may have visited. If the viewer has purchased a TV, the viewer gets recommendations of different brands of TVs.

Important terms in Q-Learning

States: States S (States) represent the current position of the agent in the environment.
Action: An action is a step taken by an agent when it is in a specific state.
Reward: For each action, the agent receives a positive or negative reward.
Plot: When the agent ends up in a terminal state and cannot take new actions.
Q value: It is used to determine how good an action A is when it is executed in a specific state S. It is expressed as Q (A, S).
Temporal Difference: A formula for finding the Q value by using the current state and action and the values of the previous state and action.