How Openclaw Rl Training fits into a Paperclip company.

Openclaw Rl Training drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md470 linesmarkdown
Expand
1---2name: openclaw-rl-training3description: OpenClaw-RL framework for training personalized AI agents via reinforcement learning from natural conversation feedback4triggers:5  - train an agent with OpenClaw-RL6  - set up reinforcement learning for my AI agent7  - use GRPO or OPD with OpenClaw8  - configure async RL training pipeline9  - train agent from conversation feedback10  - set up agentic RL for terminal or GUI11  - use on-policy distillation with OpenClaw12  - deploy OpenClaw-RL with Tinker or local GPU13---14 15# OpenClaw-RL Training16 17> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.18 19OpenClaw-RL is a fully asynchronous reinforcement learning framework that converts live multi-turn conversations into training signals for personalized AI agents. It wraps a self-hosted model as an OpenAI-compatible API via [OpenClaw](https://openclaw.ai), intercepts conversations, and continuously optimizes the policy in the background without interrupting usage. It also supports scalable RL for terminal, GUI, SWE, and tool-call agents.20 21## Architecture Overview22 23Four independent async loops that never block each other:241. **Agent Serving** — OpenClaw-compatible API serving rollouts252. **Rollout Collection** — Captures multi-turn conversations as training trajectories263. **PRM/Judge Evaluation** — Scores turns using next-state feedback (majority voting optional)274. **Policy Training** — GRPO/OPD/Combine training via [slime](https://github.com/THUDM/slime) or [Tinker](https://thinkingmachines.ai/tinker/)28 29## Installation30 31```bash32git clone https://github.com/Gen-Verse/OpenClaw-RL33cd OpenClaw-RL34 35# Install core dependencies36pip install -r requirements.txt37 38# Install slime (training backend)39cd slime && pip install -e . && cd ..40 41# Optional: install SGLang for fast inference42pip install sglang43```44 45## Project Structure46 47```48OpenClaw-RL/49├── openclaw-rl/          # Binary RL (GRPO) method50├── openclaw-opd/         # On-Policy Distillation method51├── openclaw-combine/     # Combined Binary RL + OPD52├── openclaw-test/        # Evaluation utilities53├── terminal-rl/          # Track 2: Terminal agent RL54├── gui-rl/               # Track 2: GUI agent RL55├── swe-rl/               # Track 2: SWE agent RL56├── toolcall-rl/          # Track 2: Tool-call agent RL57├── slime/                # Core training framework58└── openclaw/             # Runtime / API server59```60 61## Three Learning Paradigms62 63### 1. Binary RL (GRPO)64A Process Reward Model scores each turn from next-state feedback. Uses GRPO advantage estimation with PPO-style clipped surrogate loss.65 66### 2. On-Policy Distillation (OPD)67When next state reveals useful hindsight, a judge extracts a textual hint to augment the prompt, creating an enhanced teacher. Token-level log-probability gap becomes a directional advantage signal.68 69### 3. Combination Method (Recommended)70Merges Binary RL scalar supervision with OPD token-level directional signal. Strongest and most robust optimization.71 72## Quick Start — Personal Agent (Track 1)73 74### Binary RL Launch Script75 76```bash77# openclaw-rl/run_qwen3_7b_openclaw_rl.sh78export MODEL_PATH=/path/to/qwen3-7b79export DATA_PATH=/path/to/conversation/data80export CKPT_SAVE_DIR=/path/to/checkpoints81 82bash openclaw-rl/run_qwen3_7b_openclaw_rl.sh83```84 85### OPD Launch Script86 87```bash88export MODEL_PATH=/path/to/qwen3-7b89export JUDGE_MODEL_PATH=/path/to/judge-model90export DATA_PATH=/path/to/conversation/data91 92bash openclaw-opd/run_qwen3_7b_openclaw_opd.sh93```94 95### Combination Method (One Line)96 97```bash98# Launch with combined Binary RL + OPD99bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh100```101 102## Configuration — Key Environment Variables103 104```bash105# Model configuration106export MODEL_PATH=/path/to/base/model107export JUDGE_MODEL_PATH=/path/to/judge/model   # For OPD108export PRM_MODEL_PATH=/path/to/prm/model       # For Binary RL109 110# Training configuration111export CKPT_SAVE_DIR=./checkpoints112export CKPT_ARGS="--save-interval 100 --save-dir $CKPT_SAVE_DIR"113 114# Rollout configuration115export ROLLOUT_ARGS="--rollout-batch-size 64 --num-rollouts-per-prompt 4"116 117# Optimizer configuration118export OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01 --adam-beta1 0.9 --adam-beta2 0.999"119 120# GPU partitioning (e.g., 8 GPUs: 4 for training, 4 for rollout)121export TRAIN_GPUS="0,1,2,3"122export ROLLOUT_GPUS="4,5,6,7"123 124# LoRA (optional, reduces GPU memory)125export LORA_ARGS="--lora-rank 64 --lora-alpha 128 --lora-dropout 0.05"126```127 128## LoRA Training129 130```bash131# Add LoRA args to any launch script132export LORA_ARGS="--use-lora --lora-rank 64 --lora-alpha 128"133 134# Example: LoRA Binary RL135bash openclaw-rl/run_qwen3_7b_lora_openclaw_rl.sh136```137 138## Custom Loss / Rollout Functions (Plugin API)139 140The slime framework exposes extension points without modifying core code:141 142```bash143# Custom loss function144--custom-loss-function-path ./my_method/custom_loss.py145 146# Custom rollout function  147--rollout-function-path ./my_method/custom_rollout.py148 149# Custom generation function150--custom-generate-function-path ./my_method/custom_generate.py151 152# Custom reward model153--custom-rm-path ./my_method/custom_rm.py154```155 156### Example Custom Loss (TypeScript-style config, Python implementation)157 158```python159# my_method/custom_loss.py160import torch161from typing import Dict, Any162 163def compute_loss(164    policy_logits: torch.Tensor,165    reference_logits: torch.Tensor,166    rewards: torch.Tensor,167    advantages: torch.Tensor,168    config: Dict[str, Any]169) -> torch.Tensor:170    """171    Custom GRPO-style loss with clipped surrogate objective.172    """173    # Log-ratio between policy and reference174    log_ratio = policy_logits - reference_logits175    ratio = torch.exp(log_ratio)176    177    clip_range = config.get("clip_range", 0.2)178    179    # PPO-style clipped objective180    clipped = torch.clamp(ratio, 1 - clip_range, 1 + clip_range)181    loss = -torch.min(ratio * advantages, clipped * advantages).mean()182    183    # KL penalty184    kl_coeff = config.get("kl_coeff", 0.01)185    kl_penalty = kl_coeff * log_ratio.mean()186    187    return loss + kl_penalty188```189 190### Example Custom Reward Model191 192```python193# my_method/custom_rm.py194from transformers import AutoModelForSequenceClassification, AutoTokenizer195import torch196 197class CustomPRM:198    def __init__(self, model_path: str):199        self.tokenizer = AutoTokenizer.from_pretrained(model_path)200        self.model = AutoModelForSequenceClassification.from_pretrained(201            model_path, torch_dtype=torch.bfloat16202        )203        self.model.eval()204 205    def score(self, prompt: str, response: str, next_state: str) -> float:206        """207        Score a turn given prompt, response, and next-state feedback.208        """209        combined = f"Prompt: {prompt}\nResponse: {response}\nOutcome: {next_state}"210        inputs = self.tokenizer(combined, return_tensors="pt", truncation=True, max_length=2048)211        212        with torch.no_grad():213            logits = self.model(**inputs).logits214        215        # Binary reward: positive class probability216        return torch.softmax(logits, dim=-1)[0, 1].item()217 218 219def get_reward_model(config):220    return CustomPRM(config["prm_model_path"])221```222 223## Deploying on Tinker (Cloud)224 225```bash226# One-line cloud deployment — Hybrid RL, OPD, Binary RL all supported227export TINKER_API_KEY=$TINKER_API_KEY228export TINKER_ENDPOINT=$TINKER_ENDPOINT229 230# Submit job via Ray231ray job submit --address $TINKER_ENDPOINT \232  --working-dir . \233  -- bash openclaw-combine/run_qwen3_7b_openclaw_combine.sh234```235 236## Track 2 — General Agentic RL237 238### Terminal Agent RL239 240```bash241export ENV_TYPE=terminal242export MAX_STEPS=20243export PARALLEL_ENVS=32   # Number of parallel environment instances244 245bash terminal-rl/run_terminal_rl.sh246```247 248### GUI Agent RL249 250```bash251export ENV_TYPE=gui252export SCREENSHOT_BACKEND=playwright   # or selenium253export PARALLEL_ENVS=16254 255bash gui-rl/run_gui_rl.sh256```257 258### Tool-Call Agent RL259 260```bash261export ENV_TYPE=toolcall262export TOOLS_CONFIG=./toolcall-rl/tools_config.json263export PARALLEL_ENVS=64264 265bash toolcall-rl/run_toolcall_rl.sh266```267 268### SWE Agent RL269 270```bash271export ENV_TYPE=swe272export SWE_BENCH_PATH=/path/to/swe-bench273export PARALLEL_ENVS=8   # SWE environments are heavier274 275bash swe-rl/run_swe_rl.sh276```277 278## Data Format — Conversation Trajectories279 280OpenClaw-RL automatically classifies API messages. Manual format for custom data:281 282```json283{284  "session_id": "user_session_abc123",285  "turns": [286    {287      "type": "main",288      "prompt": "Help me refactor this function to use async/await",289      "response": "Here's the refactored version: ...",290      "next_state": "User accepted the change and said 'perfect, thanks!'",291      "trainable": true292    },293    {294      "type": "side", 295      "prompt": "What is 2+2?",296      "response": "4",297      "trainable": false298    }299  ]300}301```302 303- **`main` turns**: Multi-turn interactions that form training trajectories304- **`side` turns**: Non-trainable system/utility turns excluded from training305 306## OpenClaw API Server Setup307 308```bash309# Start OpenClaw-compatible API server wrapping your model310export BASE_MODEL_PATH=/path/to/your/model311export OPENCLAW_PORT=8000312export OPENCLAW_HOST=0.0.0.0313 314# Using SGLang backend (recommended for speed)315python -m openclaw.server \316  --model-path $BASE_MODEL_PATH \317  --port $OPENCLAW_PORT \318  --backend sglang \319  --enable-rl-intercept          # Enable conversation capture for RL320  --rl-buffer-dir ./rl_buffer    # Where to store captured trajectories321```322 323```typescript324// Using the server as OpenAI-compatible API in TypeScript325import OpenAI from "openai";326 327const client = new OpenAI({328  baseURL: "http://localhost:8000/v1",329  apiKey: process.env.OPENCLAW_API_KEY ?? "local",330});331 332const response = await client.chat.completions.create({333  model: "your-model-name",334  messages: [335    { role: "user", content: "Help me write a sorting algorithm" }336  ],337  stream: true,338});339 340for await (const chunk of response) {341  process.stdout.write(chunk.choices[0]?.delta?.content ?? "");342}343```344 345## Majority Voting for Robust PRM Scoring346 347```bash348# Enable majority voting for more robust reward estimation349export MAJORITY_VOTE_N=5   # Number of judge calls per turn350export MAJORITY_VOTE_THRESHOLD=0.6351 352# Add to your launch script args:353--majority-vote-n $MAJORITY_VOTE_N \354--majority-vote-threshold $MAJORITY_VOTE_THRESHOLD355```356 357## Adding a New Method (Contribution Pattern)358 359```bash360# 1. Create a new top-level folder361mkdir my-new-method362cd my-new-method363 364# 2. Required files365touch README.md                           # Document what, how, env vars366touch run_qwen3_7b_my_method.sh          # Launch script367touch custom_loss.py                      # If custom loss needed368touch custom_rollout.py                   # If custom rollout needed369```370 371```bash372# run_qwen3_7b_my_method.sh — follow existing conventions373#!/bin/bash374set -e375 376MODEL_SIZE="7b"377MODEL_PATH=${MODEL_PATH:-/path/to/qwen3-7b}378CKPT_SAVE_DIR=${CKPT_SAVE_DIR:-./checkpoints/my-method}379 380CKPT_ARGS="--save-interval 50 --save-dir $CKPT_SAVE_DIR"381ROLLOUT_ARGS="--rollout-batch-size 32 --num-rollouts-per-prompt 4"382OPTIMIZER_ARGS="--lr 1e-6 --weight-decay 0.01"383 384ray job submit --working-dir .. -- \385  python slime/train.py \386    --model-path $MODEL_PATH \387    --custom-loss-function-path my-new-method/custom_loss.py \388    $CKPT_ARGS $ROLLOUT_ARGS $OPTIMIZER_ARGS389```390 391## Common Patterns392 393### Monitor Training Progress394 395```bash396# View Ray dashboard397ray dashboard  # Opens at http://localhost:8265398 399# Watch checkpoint saves400watch -n 10 ls -la $CKPT_SAVE_DIR401 402# Stream training logs403tail -f ./logs/training.log404```405 406### Resume from Checkpoint407 408```bash409export RESUME_CKPT=$CKPT_SAVE_DIR/checkpoint-500410# Add to launch script:411--resume-from-checkpoint $RESUME_CKPT412```413 414### Evaluate Trained Checkpoints415 416```bash417bash openclaw-test/run_eval.sh \418  --model-path $CKPT_SAVE_DIR/checkpoint-latest \419  --eval-tasks "conversation,coding,tool-use"420```421 422## Troubleshooting423 424**Out of GPU memory during rollout + training:**425```bash426# Use LoRA to reduce memory footprint427export LORA_ARGS="--use-lora --lora-rank 32"428# Or reduce parallel environments429export PARALLEL_ENVS=8430# Or use offloading431--offload-optimizer-state432```433 434**Async loop falling behind (buffer overflow):**435```bash436# Reduce rollout batch size or increase judge throughput437export ROLLOUT_ARGS="--rollout-batch-size 16"438# Or add more judge workers439--num-judge-workers 4440```441 442**PRM scores all near 0.5 (reward collapse):**443- Verify `next_state` fields contain meaningful feedback signals444- Check judge model prompt template matches expected format445- Try increasing majority vote N: `--majority-vote-n 7`446 447**SGLang server not starting:**448```bash449# Check SGLang version compatibility450pip install sglang==0.4.x  # Check slime/requirements.txt for pinned version451# Fallback to vLLM backend452--backend vllm453```454 455**Ray job submission fails:**456```bash457# Start Ray cluster first458ray start --head --num-gpus=$(nvidia-smi -L | wc -l)459# Then submit job460ray job submit --address auto -- bash run.sh461```462 463## Key References464 465- [Technical Report (arXiv)](https://arxiv.org/abs/2603.10165)466- [OpenClaw Plugin](https://openclaw.ai)467- [Slime Training Framework](https://github.com/THUDM/slime)468- [Tinker Cloud Platform](https://thinkingmachines.ai/tinker/)469- [SDFT Paper](https://arxiv.org/abs/2601.19897) — integrated in openclaw-opd470- [SDPO Paper](https://arxiv.org/abs/2601.20802) — integrated in openclaw-opd
Related skills
Agency Agents Ai Specialists

Install Agency Agents Ai Specialists skill for Claude Code from aradotso/trending-skills.
Agent Browser Automation

Install Agent Browser Automation skill for Claude Code from aradotso/trending-skills.
Antigravity Manager

Install Antigravity Manager skill for Claude Code from aradotso/trending-skills.