A chronicle of how one small step for hyper-parameters became a giant leap for continuous-action control.
What’s harder than parallel parking in traffic?
Landing a moon-lander with throttle values that slide, not click. In the Lunar Lander-Continuous environment you steer two thrusters—main and side—each able to fire anywhere between −1 and +1. Infinite actions, unforgiving physics, and exactly zero parking sensors.
Why classic Q-learning sulks here
Deep Q-Networks shine when you can simply say “action #3,” but asking them to sweep an infinite action set would melt your GPU. We need a policy that outputs real-valued thrust directly. Enter policy-gradient actor–critic land!
Meet today’s hero: Deep Deterministic Policy Gradient (DDPG)
Think of DDPG as two neural-network buddies:
Role |
Job |
Fun twist |
Actor |
Decides: “Fire main at 0.42, side at –0.08.” |
Gets exploratory jitters from Gaussian noise. |
Critic |
Judges that move’s long-term value. |
Uses a slowly moving target twin to stay chill. |
Both share an experience-replay scrapbook so they don’t overreact to any single bumpy touchdown.
Exploration hack: Instead of the fancy Ornstein-Uhlenbeck process, we simply sprinkled zero-mean Gaussian noise on every thrust command—less tuning, same lunar chaos.
Rocket science, but reproducible
- Training: 1 000 episodes × 1 000 steps max, ~1 h on a humble 4-core CPU.
- Networks: 2 hidden layers, 256 neurons each, ReLU everywhere except a tanh at the actor’s output to keep actions in [−1, 1].
- Lifesavers: gradient clipping (±5) after a few wild loss spikes.
First light: the baseline launch log
Early episodes looked like SpaceX prototypes: spectacular hops, frequent explosions, reward graph doing the samba. But by episode ~800 the curve flirted with the “mission-success” line of 200 points average. Default settings finally landed at ≈ 249 ± 57 on the last run—mission technically accomplished!
Turbo-tuning with Optuna
Four knobs went into the Bayesian blender:
- Discount γ (0.90 – 0.99)
- Soft target τ (0.0001 – 0.01)
- Batch size {32, 64, 128}
- Update frequency {1 – 5 actor/critic steps per env step}
Surprise winners
- γ ≈ 0.985—big-picture thinking pays off, but γ > 0.99 made the agent dreamy and slow.
- τ ≈ 0.0045—fast enough to learn, slow enough to stay zen.
- Batch 128—more samples, smoother gradients, no memory drama.
- Updates every 2-4 steps—too eager and you chase noise; too lazy and you stagnate.
Best trial (call sign T-4) cruised to 264 ± 23 with steadier landings and smaller variances. Not bad for 20 tries!
(Fun fact: a hand-coded PID heuristic still scored ~290, proving that humans with equations can out-fly RL—for now.)
What nearly cratered us
Oops moment |
Quick fix |
Critic loss skyrocketed → actor followed bad advice |
Gradient clipping ±5 |
Vanishing actor gradients (tanh saturation) |
Considering Leaky ReLU next round |
Critic over-estimates Q (DDPG classic flaw) |
TD3 beckons on the roadmap |
Lessons from low lunar orbit
- Continuous action ≠ chaos when you hand the wheel to an actor network.
- Noise still matters—Gaussian simplicity worked fine.
- Hyper-parameters are rocket fuel; a tiny τ tweak can nosedive or sky-rocket performance.
- Baseline DDPG is good—but TD3, SAC, or PPO could push beyond heuristic pilots.
Final transmission
“It’s not just about landing softly; it’s about landing softly every time.”
DDPG—with a dash of Optuna pixie dust—does exactly that. Next mission: slap entropy bonuses on a Soft Actor-Critic and see if we can beat the human PID wiz. But for tonight, enjoy the view: our neural network just left the lander upright, engines off, and flags flying. 🌔✨
Note: If the embedded PDF is not displayed properly or if you are viewing this on a mobile device, please click here to access the PDF directly.