HyperAI

The journey of implementing Actor-Critic methods for a drone landing task reveals the power and pitfalls of deep reinforcement learning. At first glance, the idea seems simple: instead of waiting for an episode to end to learn, use a critic network to estimate future rewards in real time, enabling immediate feedback after every action. This is exactly what Actor-Critic brings to the table—online learning, faster convergence, and better sample efficiency compared to REINFORCE. The core insight is bootstrapping. Rather than computing the full return $ G_t = r_t + \gamma r_{t+1} + \cdots $, the critic estimates $ V(s_{t+1}) $, and the TD error becomes $ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) $. This error acts as a state-specific advantage signal, telling the actor whether the current action was better or worse than expected. Unlike REINFORCE’s global baseline, this signal is dynamic and context-aware, making learning far more efficient. In practice, this meant a 68% success rate in 600 iterations—half the time and 13% better than REINFORCE. The key was updating both the actor and critic at every step, not just at episode ends. But getting there required overcoming three critical bugs that nearly derailed the entire project. The first was the moving target problem. Without detaching gradients from the next state value, the critic was trying to optimize both its prediction and the target simultaneously. This created a feedback loop where the target kept shifting, causing the loss to oscillate wildly. The fix was simple: wrap the next value computation in torch.no_grad(). This ensures the TD target is treated as a fixed label, not a trainable variable. Once this was done, the critic loss dropped smoothly from 500 to 8 over 200 iterations. The second bug was a low discount factor. With $ \gamma = 0.90 $, the effective horizon was only 10 steps, making the final landing reward—worth +500—effectively invisible after 150 steps ($ 500 \times 0.90^{150} \approx 6 \times 10^{-8} $). The agent had no incentive to land. Changing $ \gamma $ to 0.99 made the terminal reward visible over long horizons, and the agent began to learn meaningful behavior within 50 iterations. The third and most insidious issue was reward exploitation. The original reward function only looked at the current state: proximity to the platform. The agent quickly learned to exploit this by either zooming past the platform at high speed to collect approach rewards before crashing, or by hovering in place with tiny movements to indefinitely collect small rewards. The problem wasn’t the algorithm—it was the reward function. The agent was doing exactly what it was told: maximize scalar rewards. The fix was to reward state transitions, not snapshots. By tracking the previous state and computing distance change and speed, the reward function could now distinguish between meaningful progress and reward farming. A new reward was designed: only when the drone moved forward with sufficient speed and reduced distance to the platform did it get a large positive signal. Slow, micro-movements were penalized. This single change eliminated the exploits and led to clean, purposeful landings. The result? A robust, high-performing agent that not only landed more often but did so in a way that made sense. The success of Actor-Critic wasn’t due to a new algorithm, but to correct implementation: proper gradient handling, appropriate discounting, and thoughtful reward design. This experience underscores a key truth in reinforcement learning: 90% of the work is reward engineering, and the other 90% is debugging why it didn’t work. The algorithm is only as good as the specification. Actor-Critic, with its real-time feedback and state-specific learning, is a major step forward—but it still depends on a well-designed objective. The next step? Moving to Proximal Policy Optimization (PPO), which further stabilizes training and is widely used in practice, including by OpenAI for training large models.

Related Links

Related Links

Related Links

A New Method for Predicting Battery Life, Proposed by the University of Michigan and Others, Has Shortened the Verification Cycle by 40 Times, Saving 98% Evaluation Time Through "discovery learning."

A New Method for Predicting Battery Life, Proposed by the University of Michigan and Others, Has Shortened the Verification Cycle by 40 Times, Saving 98% Evaluation Time Through "discovery learning."

Command Palette

From Failed Landings to Success: My Journey Mastering Actor-Critic in Deep Reinforcement Learning

Related Links

Command Palette

From Failed Landings to Success: My Journey Mastering Actor-Critic in Deep Reinforcement Learning

Related Links

Command Palette

From Failed Landings to Success: My Journey Mastering Actor-Critic in Deep Reinforcement Learning

Related Links

A New Method for Predicting Battery Life, Proposed by the University of Michigan and Others, Has Shortened the Verification Cycle by 40 Times, Saving 98% Evaluation Time Through "discovery learning."

A New Method for Predicting Battery Life, Proposed by the University of Michigan and Others, Has Shortened the Verification Cycle by 40 Times, Saving 98% Evaluation Time Through "discovery learning."