Fast Lap? Going for Pole using RL

May 01, 2025

A light-hearted tour of my AWS DeepRacer reinforcement-learning project

Warm-up Lap — Why race toy cars with RL?

If you can teach a 1 / 18-scale racer to hug corners, dodge cones and out-smart rival bots, you’ve basically distilled every juicy challenge in autonomous driving into a bite-size sandbox. AWS DeepRacer gives us that sandbox: simulated tracks, stereo cameras, a LIDAR ring and just enough physics to make you fist-pump when the car finishes a lap cleanly. My mission? Train one brain that blasts around three tracks and survives three very different race modes—solo time-trials, static-obstacle slaloms and chaotic head-to-head heats—without rage-quitting into a wall.

Meet the Pit-Crew Algorithms

Proximal Policy Optimization (PPO)—the steady race engineer

I picked PPO (clipped) because it updates the policy in baby steps—like tightening wheel-nuts a quarter-turn at a time instead of yanking them off. The clipped objective keeps the new policy from wandering too far from the safe baseline, which is gold when every bad update sends your car lawn-mowing.

A camera-LIDAR dream team

The agent sees the world through a 3-layer CNN that chews on twin 160 × 120 grayscale images, while a mini-MLP digests a 64-ray LIDAR scan. Mash those together, pass through a 512-unit “combiner,” and you have a fused ego-view that feeds both actor (which button to press) and critic (was that smart?).

Track Walk & Scenarios

Track nickname	Real name	Length	Width	Personality
A-Z Speedway	reInvent2019 Wide	16.6 m	1.07 m	Friendly oval for first dates
Smile Speedway	reInvent2019 Track	23.1 m	1.07 m	Twisty grin-shaped sprint
Empire City	New York Track	21.9 m	0.76 m	Skinny, skyscraper-tight corners

Each track is tackled in three flavors:

Time-Trial – no traffic, just vibes
Obstacle-Avoidance – six static barrels begging for bumper kisses
Head-to-Bot – three rambunctious opponent cars that never heard of personal space

Reward Shaping — bribing the driver

Think of rewards as snacks you toss at the car to encourage good manners:

Stay centered: exponential bonus for keeping your wheels near the mid-line
Make progress: a linear treat every time you inch forward
Celebrate the finish: big cookie for ≥ 98 % progress, scaled by how few steps you used
Fear the barrel: exponential penalty when LIDAR screams “too close!”
Crash tax: -10 if you smack a wall; +20 if you see the chequered flag

Weights? One-third each to center, progress and obstacle penalties—because democracy.

Training Regimen — from baby steps to Nürburgring

Fixed start on A-Z – learn to drive in a straight line without crying.
Random starts – spawn the agent anywhere on the track so it can’t memorise scenery.
Random direction + harder tracks – reverse laps and graduate to the slim-fit Empire City.

Twenty-thousand simulation steps (~2 hours on a plain laptop) with PPO hyper-params straight from the OpenAI cookbook (γ = 0.99, ε = 0.2, 5 epochs per 512-step rollout).

Race-Day Results

Time-Trials: Flawless victory

100 % lap completion on all tracks
Lap times: 92 s (A-Z), 302 s (Smile), 431 s (Empire)

Obstacle slalom: Traffic-cone trauma

Completion plummets to 15–22 %. The car either tip-toes till timeout or swerves into the wall after a heroic dodge.

Head-to-Head: Bumper-car chaos

Completion: ~50–60 % on the roomy tracks, nosedives to ≈30 % on skinny Empire City. Opponent body-checks are brutal.

Moral: Our solo-racing prodigy panics when the playground gets crowded.

Lessons from the Crash Logs

Curriculum blind spot: Training never showed the agent a single cone or rival, so evaluation felt like throwing a toddler onto the Autobahn.
Reward tug-of-war: “Stay centered” vs. “swerve to live” sends mixed signals; weight tuning is black-art territory.
Tiny data diet: 20 k steps are plenty for smooth solo laps but peanuts for learning 4-D chess with moving bots.

Next Upgrades

Multi-task curriculum – sprinkle obstacles and bots during training; maybe even self-play.
Lean rewards – axe the micromanagement; reward progress & penalize collisions, let the network figure out the jazz steps.
Opponent modelling – give the car a crystal ball (recurrent net) to guess rival moves.
More track time & hyper-tuning – because GPUs don’t need sleep.

Checkered Flag

We turned a timid toy into a track-lapping champ—as long as nobody else shows up. The project proves that curriculum learning plus PPO can nail geometry generalization, yet true robustness demands training that mirrors real-world mayhem. Next season, the car’s getting street-smarts, thicker skin and maybe a flamenco horn for overtakes. Stay tuned! 🏁

Note: If the embedded PDF is not displayed properly or if you are viewing this on a mobile device, please click here to access the PDF directly.