Choosing the Right Reinforcement Learning Framework

Reinforcement learning (RL) is one of the most exciting areas of robotics research. It’s the technology behind robots that learn to walk, manipulate objects, and play games — all without being explicitly programmed with the rules. But it’s also one of the most misunderstood.

This guide will give you a clear picture of what RL is, when it makes sense to use it, and how to choose a framework to get started.

What is Reinforcement Learning?

In reinforcement learning, an agent learns to take actions in an environment to maximize a cumulative reward. The agent starts knowing nothing and learns through trial and error — taking actions, observing what happens, and gradually figuring out which actions lead to better outcomes.

The key components:

Agent: the learner/decision-maker (your robot)
Environment: everything the agent interacts with
State: what the agent can observe about the environment
Action: what the agent can do
Reward: a scalar signal telling the agent how well it’s doing
Policy: the agent’s strategy (what action to take in each state)

The goal is to learn a policy that maximizes the total reward over time.

  <rect class="d-fill-sky d-stroke-sky" stroke-width="2" x="330" y="80" width="150" height="60" rx="8"/>
  <text class="d-label-bold" x="405" y="115" text-anchor="middle">Environment</text>

  <!-- top arrow: action -->
  <path class="d-line" d="M190,95 H330" marker-end="url(#arrow-w06a)"/>
  <text class="d-label-sm" x="260" y="86" text-anchor="middle">action</text>

  <!-- bottom arrow: state + reward -->
  <path class="d-line" d="M330,125 H190" marker-end="url(#arrow-w06a)"/>
  <text class="d-label-sm" x="260" y="145" text-anchor="middle">new state + reward</text>

  <text class="d-label-sm" x="260" y="190" text-anchor="middle">repeat, episode after episode, until the policy improves</text>
  <defs>
    <marker id="arrow-w06a" markerWidth="9" markerHeight="9" refX="7" refY="4.5" orient="auto">
      <path d="M0,0 L9,4.5 L0,9 z" fill="#374151"/>
    </marker>
  </defs>
</svg>

The core RL loop: the agent picks an action, the environment responds with a new state and a reward, and the agent uses that feedback to get a little better next time.

When to Use RL for Robotics

RL is powerful but not always the right tool. Use it when:

The task is too complex to program explicitly (e.g., learning to walk on uneven terrain)
You have a good simulation environment to train in
You can define a clear reward function
You have significant compute resources and time

Don’t use RL when:

A simple control algorithm (PID, etc.) would work fine
You can’t simulate the environment accurately
You need guaranteed safety (RL policies can behave unpredictably)
You need to explain the robot’s decisions

The Frameworks

Stable Baselines3 (SB3)

Best for: beginners, research, quick experiments

Stable Baselines3 is the most beginner-friendly RL library. It implements the most popular algorithms (PPO, SAC, TD3, A2C, DDPG) with clean, well-documented APIs. If you’re learning RL or doing research that doesn’t require massive scale, start here.

import gymnasium as gym
from stable_baselines3 import PPO

# Create the environment
env = gym.make("CartPole-v1")

# Create and train the agent
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50_000)

# Evaluate the trained agent
obs, _ = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    env.render()
    if terminated or truncated:
        obs, _ = env.reset()

env.close()

Pros: Simple API, excellent documentation, well-tested implementations Cons: Single-machine only, not designed for large-scale distributed training

Ray RLlib

Best for: production systems, distributed training, large-scale experiments

Ray RLlib is built on top of the Ray distributed computing framework. It supports dozens of algorithms and can scale from a laptop to a cluster of hundreds of machines. It’s more complex than SB3 but much more powerful.

from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .environment("CartPole-v1")
    .env_runners(num_env_runners=4)
    .training(
        train_batch_size=4000,
        lr=3e-4,
        gamma=0.99,
    )
)

algo = config.build_algo()

for i in range(10):
    result = algo.train()
    reward = result["env_runners"]["episode_return_mean"]
    print(f"Iteration {i}: reward = {reward:.2f}")

Heads up: RLlib’s API changes more often than most libraries. The snippet above targets the Ray 2.x “new API stack” (.env_runners() / num_env_runners, and metrics under env_runners/episode_return_mean). Older tutorials use .rollouts(num_rollout_workers=...) and result['episode_reward_mean'], which are deprecated — always check the docs for your installed Ray version.

Pros: Scalable, many algorithms, production-ready Cons: Steeper learning curve, more complex setup

Gymnasium (formerly OpenAI Gym)

Gymnasium is not an RL algorithm library — it’s a standard interface for RL environments. Almost every RL framework uses Gymnasium environments. Understanding it is essential.

import gymnasium as gym

env = gym.make("Pendulum-v1", render_mode="human")
obs, info = env.reset()

for step in range(200):
    # Random action (replace with your policy)
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        obs, info = env.reset()

env.close()

Framework Comparison

Feature	Stable Baselines3	Ray RLlib
Difficulty	Beginner-friendly	Intermediate
Algorithms	7 core algorithms	30+ algorithms
Distributed training	No	Yes
Documentation	Excellent	Good
Community	Large	Large
Best for	Learning, research	Production, scale

Getting Started: Your First RL Robot

The standard workflow for applying RL to a robot:

Build a simulation of your robot in a physics engine (Gazebo, Isaac Sim, PyBullet, MuJoCo)
Wrap it as a Gymnasium environment (implement reset(), step(), observation_space, action_space)
Define a reward function (what behavior do you want to encourage?)
Train with SB3 or RLlib in simulation
Transfer to the real robot (the “sim-to-real” problem — a whole topic in itself)

import gymnasium as gym
import numpy as np
from gymnasium import spaces

class SimpleRobotEnv(gym.Env):
    """A minimal custom robot environment."""
    
    def __init__(self):
        super().__init__()
        # Define action space: continuous joint velocities
        self.action_space = spaces.Box(
            low=-1.0, high=1.0, shape=(2,), dtype=np.float32
        )
        # Define observation space: joint angles + target position
        self.observation_space = spaces.Box(
            low=-np.pi, high=np.pi, shape=(4,), dtype=np.float32
        )
        self.state = np.zeros(4)
        self.target = np.array([0.5, 0.5])
    
    def reset(self, seed=None, options=None):
        super().reset(seed=seed)  # seeds the RNG for reproducibility
        self.state = np.random.uniform(-0.5, 0.5, size=4).astype(np.float32)
        return self.state, {}
    
    def step(self, action):
        # Update state based on action
        self.state[:2] += action * 0.1
        self.state = np.clip(self.state, -np.pi, np.pi)
        
        # Reward: negative distance to target
        distance = np.linalg.norm(self.state[:2] - self.target)
        reward = -distance
        
        terminated = distance < 0.05  # Success condition
        truncated = False
        
        return self.state, reward, terminated, truncated, {}

Recommended Learning Path

Complete the Gymnasium documentation tutorial
Train an agent on CartPole with Stable Baselines3
Build a custom environment for a simple task
Experiment with different algorithms (PPO is a good default)
When you need scale or more algorithms, migrate to RLlib

Next week, we’ll dive into ROS2 — the Robot Operating System — and why it’s become the standard middleware for serious robotics projects.