Choosing the Right Reinforcement Learning Framework
Stable Baselines3, Ray RLlib, or something else? This guide explains what reinforcement learning is, when it's useful for robotics, and how to choose and get started with the right framework.
Reinforcement learning (RL) is one of the most exciting areas of robotics research. It’s the technology behind robots that learn to walk, manipulate objects, and play games — all without being explicitly programmed with the rules. But it’s also one of the most misunderstood.
This guide will give you a clear picture of what RL is, when it makes sense to use it, and how to choose a framework to get started.
What is Reinforcement Learning?
In reinforcement learning, an agent learns to take actions in an environment to maximize a cumulative reward. The agent starts knowing nothing and learns through trial and error — taking actions, observing what happens, and gradually figuring out which actions lead to better outcomes.
The key components:
- Agent: the learner/decision-maker (your robot)
- Environment: everything the agent interacts with
- State: what the agent can observe about the environment
- Action: what the agent can do
- Reward: a scalar signal telling the agent how well it’s doing
- Policy: the agent’s strategy (what action to take in each state)
The goal is to learn a policy that maximizes the total reward over time.
<rect class="d-fill-sky d-stroke-sky" stroke-width="2" x="330" y="80" width="150" height="60" rx="8"/>
<text class="d-label-bold" x="405" y="115" text-anchor="middle">Environment</text>
<!-- top arrow: action -->
<path class="d-line" d="M190,95 H330" marker-end="url(#arrow-w06a)"/>
<text class="d-label-sm" x="260" y="86" text-anchor="middle">action</text>
<!-- bottom arrow: state + reward -->
<path class="d-line" d="M330,125 H190" marker-end="url(#arrow-w06a)"/>
<text class="d-label-sm" x="260" y="145" text-anchor="middle">new state + reward</text>
<text class="d-label-sm" x="260" y="190" text-anchor="middle">repeat, episode after episode, until the policy improves</text>
<defs>
<marker id="arrow-w06a" markerWidth="9" markerHeight="9" refX="7" refY="4.5" orient="auto">
<path d="M0,0 L9,4.5 L0,9 z" fill="#374151"/>
</marker>
</defs>
</svg>
When to Use RL for Robotics
RL is powerful but not always the right tool. Use it when:
- The task is too complex to program explicitly (e.g., learning to walk on uneven terrain)
- You have a good simulation environment to train in
- You can define a clear reward function
- You have significant compute resources and time
Don’t use RL when:
- A simple control algorithm (PID, etc.) would work fine
- You can’t simulate the environment accurately
- You need guaranteed safety (RL policies can behave unpredictably)
- You need to explain the robot’s decisions
The Frameworks
Stable Baselines3 (SB3)
Best for: beginners, research, quick experiments
Stable Baselines3 is the most beginner-friendly RL library. It implements the most popular algorithms (PPO, SAC, TD3, A2C, DDPG) with clean, well-documented APIs. If you’re learning RL or doing research that doesn’t require massive scale, start here.
import gymnasium as gym
from stable_baselines3 import PPO
# Create the environment
env = gym.make("CartPole-v1")
# Create and train the agent
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50_000)
# Evaluate the trained agent
obs, _ = env.reset()
for _ in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
env.render()
if terminated or truncated:
obs, _ = env.reset()
env.close()
Pros: Simple API, excellent documentation, well-tested implementations Cons: Single-machine only, not designed for large-scale distributed training
Ray RLlib
Best for: production systems, distributed training, large-scale experiments
Ray RLlib is built on top of the Ray distributed computing framework. It supports dozens of algorithms and can scale from a laptop to a cluster of hundreds of machines. It’s more complex than SB3 but much more powerful.
from ray.rllib.algorithms.ppo import PPOConfig
config = (
PPOConfig()
.environment("CartPole-v1")
.env_runners(num_env_runners=4)
.training(
train_batch_size=4000,
lr=3e-4,
gamma=0.99,
)
)
algo = config.build_algo()
for i in range(10):
result = algo.train()
reward = result["env_runners"]["episode_return_mean"]
print(f"Iteration {i}: reward = {reward:.2f}")
Heads up: RLlib’s API changes more often than most libraries. The snippet above targets the Ray 2.x “new API stack” (
.env_runners()/num_env_runners, and metrics underenv_runners/episode_return_mean). Older tutorials use.rollouts(num_rollout_workers=...)andresult['episode_reward_mean'], which are deprecated — always check the docs for your installed Ray version.
Pros: Scalable, many algorithms, production-ready Cons: Steeper learning curve, more complex setup
Gymnasium (formerly OpenAI Gym)
Gymnasium is not an RL algorithm library — it’s a standard interface for RL environments. Almost every RL framework uses Gymnasium environments. Understanding it is essential.
import gymnasium as gym
env = gym.make("Pendulum-v1", render_mode="human")
obs, info = env.reset()
for step in range(200):
# Random action (replace with your policy)
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
obs, info = env.reset()
env.close()
Framework Comparison
| Feature | Stable Baselines3 | Ray RLlib |
|---|---|---|
| Difficulty | Beginner-friendly | Intermediate |
| Algorithms | 7 core algorithms | 30+ algorithms |
| Distributed training | No | Yes |
| Documentation | Excellent | Good |
| Community | Large | Large |
| Best for | Learning, research | Production, scale |
Getting Started: Your First RL Robot
The standard workflow for applying RL to a robot:
- Build a simulation of your robot in a physics engine (Gazebo, Isaac Sim, PyBullet, MuJoCo)
- Wrap it as a Gymnasium environment (implement
reset(),step(),observation_space,action_space) - Define a reward function (what behavior do you want to encourage?)
- Train with SB3 or RLlib in simulation
- Transfer to the real robot (the “sim-to-real” problem — a whole topic in itself)
import gymnasium as gym
import numpy as np
from gymnasium import spaces
class SimpleRobotEnv(gym.Env):
"""A minimal custom robot environment."""
def __init__(self):
super().__init__()
# Define action space: continuous joint velocities
self.action_space = spaces.Box(
low=-1.0, high=1.0, shape=(2,), dtype=np.float32
)
# Define observation space: joint angles + target position
self.observation_space = spaces.Box(
low=-np.pi, high=np.pi, shape=(4,), dtype=np.float32
)
self.state = np.zeros(4)
self.target = np.array([0.5, 0.5])
def reset(self, seed=None, options=None):
super().reset(seed=seed) # seeds the RNG for reproducibility
self.state = np.random.uniform(-0.5, 0.5, size=4).astype(np.float32)
return self.state, {}
def step(self, action):
# Update state based on action
self.state[:2] += action * 0.1
self.state = np.clip(self.state, -np.pi, np.pi)
# Reward: negative distance to target
distance = np.linalg.norm(self.state[:2] - self.target)
reward = -distance
terminated = distance < 0.05 # Success condition
truncated = False
return self.state, reward, terminated, truncated, {}
Recommended Learning Path
- Complete the Gymnasium documentation tutorial
- Train an agent on CartPole with Stable Baselines3
- Build a custom environment for a simple task
- Experiment with different algorithms (PPO is a good default)
- When you need scale or more algorithms, migrate to RLlib
Next week, we’ll dive into ROS2 — the Robot Operating System — and why it’s become the standard middleware for serious robotics projects.