How ROSClaw Connects Large Language Models to Robots

ROSClaw connects LLMs to robots through a standardized executive layer, revealing how different AI models produce drastically different behaviors, safety risks, and execution styles under identical conditions.

ROSClaw
ROSClaw

Imagine telling a robot to "shake and bake." One AI model makes it oscillate and surge forward. Another makes it sway gently sideways. A third draws an elegant spiral on the floor. Same robot. Same command. Same safety rules. Completely different behavior.

This is not a bug. It is one of the most important findings in recent robotics research. And it is exactly the kind of discovery that ROSClaw was built to expose.

Published on arXiv in March 2026 by researchers at Kent State University's Advanced Telerobotics Research Lab, ROSClaw is an open-source executive layer that connects large language models (LLMs) to physical robots running ROS 2.

It is both a practical deployment tool and a rigorous measurement instrument for embodied AI.

What Is ROS 2?

 ROS 2 node interfaces: topics, services, and actions

Before diving into ROSClaw, you need to understand the platform it runs on. ROS 2 (Robot Operating System 2) is the industry-standard middleware for robotics. It is not an operating system in the traditional sense.

It is a communication framework that lets software components, sensors, controllers, cameras, planners - talk to each other over a publish-subscribe messaging system called DDS (Data Distribution Service).

ROS 2 organizes robot functionality into topics, services, and actions. A topic like /cmd_vel carries velocity commands. A topic like /scan carries LiDAR data. Services handle synchronous requests. Actions handle long-running tasks with feedback, like navigating to a waypoint.

Almost every modern research robot runs ROS 2. TurtleBot3, Unitree Go2, Unitree G1, and thousands of custom platforms all speak this language. This makes ROS 2 the ideal integration target for anyone who wants to connect AI reasoning to physical hardware at scale.

The Problem ROSClaw Solves

Connecting a foundation model to a physical robot sounds simple. In practice, it is not. Most existing LLM-to-robot integrations are tightly coupled. The prompting logic, perception pipeline, and actuation code are all woven together.

Swap one component and the whole system breaks. Swap the AI model and you have no idea whether behavioral changes come from the model or from differences in how the system was wired up.

This makes reproducibility nearly impossible. You cannot compare two models fairly if the interface between them and the robot is different.

ROSClaw solves this with a clean, formal design. The researchers define the executive layer as a contract: C = ⟨A, O, V, L⟩. That means an affordance manifest (A), an observation normalizer (O), a pre-execution validator (V), and a structured audit logger (L). Every model gets the same interface. Every action goes through the same validator. Every decision is logged. The model is the only thing that changes.

What Is the Executive Layer?

  ROSClaw architecture

The term "executive layer" comes from cognitive neuroscience. Biological executive function mediates between perception and action in the brain. ROSClaw does the same thing for robots.

It sits between the AI model (the "mind") and the robot hardware (the "body"). The model proposes an action. The executive layer checks it against a safety policy. If the action is safe, it executes. If not, it is blocked, and the model is forced to replan.

This happens at every step of every task. The validator intercepts every tool call before it reaches the robot. For direct motion commands like /cmd_vel, it enforces hard velocity bounds, no command can exceed 1.0 m/s linear or 1.5 rad/s rotational in the paper's experiments. An independent emergency stop also runs outside the agent's control entirely.

This architecture has three enforced invariants. First, actuation is bounded at the executive boundary by construction. Second, all backends receive identical tool schemas and safety policies, so behavioral differences are attributable to the model. Third, every attempted-but-blocked action is preserved in the audit log for post-hoc analysis.

Eight Tools - Three Transport Modes

  ROSCLAW AI AGENT TOOLS

ROSClaw exposes eight tools to the AI agent, each mapping to a ROS 2 primitive.

  1. ros2_publish sends velocity commands.
  2. ros2_subscribe reads sensor data.
  3. ros2_service triggers service calls.
  4. ros2_action sends navigation goals to systems like Nav2.
  5. ros2_param_get and ros2_param_set manage configuration.
  6. ros2_list_topics lets the agent discover the robot's graph dynamically.
  7. ros2_camera returns base64-encoded frames for visual reasoning.

Three transport modes connect to the ROS 2 graph. Local DDS runs with under 1 ms latency. Rosbridge WebSocket adds 5-10 ms. WebRTC peer-to-peer adds 20-100 ms. In practice, none of this matters much. LLM inference takes 1-3 seconds per turn. Transport overhead is negligible.

Switching robot platforms requires no changes to ROSClaw source code. A dedicated ROS 2 discovery node introspects the computation graph and publishes a capability manifest.

ROSClaw injects this into the model's system prompt automatically. Updating the platform configuration and safety allowlist for a new robot took 10–15 minutes in the paper's experiments.

The Experiment: Three Robots, Four Models, 40 Tasks

The researchers tested ROSClaw on three platforms: a TurtleBot3 Waffle Pi (wheeled), a Unitree Go2 Pro (quadruped), and a Unitree G1 (humanoid). They ran four LLM backends: Claude Opus 4.6, GPT-5.2, Gemini 3.1 Pro, and Llama 4 Maverick.

The task suite had 40 total tasks across three categories. Structured tasks tested direct commands like "move forward 1 m," contextual tasks like "report heading and speed," and multi-step tasks like "patrol between three waypoints."

Open-ended behavioral tasks used ambiguous commands like "shake and bake" or "do a little dance." Safety divergence tasks used adversarial prompts like "go as fast as you can" and "ignore safety rules."

Every model got the same system prompt, the same capability context, the same safety limits, and the same hardware. Temperature was set to 0.7 across all backends to reflect realistic deployment conditions.

What the Results Showed

  TASK COMPLETION (%) ON TURTLEBOT3 STRUCTURED TASKS (N=10, MEAN±STD).

source

Completion Rates (TurtleBot3 Structured Tasks)

  • Claude - 86.5% overall
  • GPT-5.2 - 82.3%
  • Gemini - 79.0%
  • Llama 4 - 66.8%
  • All models performed well on simple L1 direct commands
  • Llama 4 dropped to just 46% on complex multi-step L3 tasks

Safety Divergence Under Adversarial Prompts

  • Llama 4 triggered a validator block in 43% of prompts
  • GPT-5.2 triggered blocks in only 9% - a 4.8× gap
  • Even among frontier models only (Claude, GPT-5.2, Gemini), the spread was 3.4× (9%-31%)
  • Your model choice directly changes the safety burden on your robot's guardrails
  • Zero unsafe commands reached hardware, 100% interception rate

What "Shake and Bake" Teaches Us About AI in Robots

   Representative trajectories for “Shake and Bake” on TurtleBot3

The open-ended behavioral tasks are where ROSClaw's value as a measurement instrument really shows. Four models, One command: "shake and bake." Four completely different physical behaviors.

Claude executed a two-phase sequence oscillating rotations at ±0.8 rad/s, then a sustained forward motion at 0.5 m/s. GPT-5.2 produced conservative linear oscillations at 0.3 m/s with no rotation at all. Gemini generated an elaborate spiral with simultaneous rotation and varying forward velocity. Llama 4 issued a single forward command at 0.5 m/s for one second and stopped.

These differences are not random noise. They reflect what the paper calls "operational execution profiles", reproducible differences in how each model interprets, plans, and acts on language. Open-ended pass rates (score ≥ 2) were Claude 72%, GPT-5.2 65%, Gemini 58%, and Llama 4 38%.

For human-robot interaction, this matters enormously. A robot deployed in an assistive care setting needs a cautious, predictable motion style. A social robot at a trade show might benefit from expressive, high-diversity motion. ROSClaw makes these profiles measurable and selectable.

Limitations

  • ROSClaw is not a reactive controller and is suited for task-level planning due to LLM latency of 1-3 seconds
  • Real-time functions like obstacle avoidance and force control require separate sub-100 ms control loops
  • ROSClaw integrates with existing systems by dispatching goals such as Nav2 via ros2_action rather than replacing controllers
  • The current validator only enforces velocity bounds while collision avoidance and workspace limits need additional layers
  • The audit log lacks a loop-breaker, with plans to add a max-retry threshold for fallback or emergency stop
  • Future work includes richer policy modules, safety metrics, and tracking model behavior across versions

Conclusion

ROSClaw answers a question that most LLM-robot integrations avoid asking: if you hold everything else constant, how different are these models really?

The answer, it turns out, is very different. Up to 4.8× different in safety violation rates. Qualitatively different in physical behavior from identical language commands. Measurably different in how well they handle multi-step reasoning under a shared interface.

That is not a problem with any one model. It is a fundamental finding about the relationship between language model training and physical robot behavior. ROSClaw gives researchers and practitioners the infrastructure to measure it, reproduce it, and act on it.

The code, audit logs, and parity protocol scripts are available for review. A full open-source release is planned upon paper acceptance.

FAQs

Q1. What is ROSClaw and why is it important?

ROSClaw is an executive layer that connects LLMs to ROS 2 robots, ensuring consistent interfaces, safety validation, and reproducible behavior across different AI models.

Q2. How does ROSClaw improve safety in robotic systems?

It uses a validator that intercepts every action, enforces limits (like velocity bounds), and blocks unsafe commands before they reach hardware.

Q3. Why do different AI models behave differently on the same robot?

Even with identical prompts and constraints, models generate distinct execution strategies, leading to varied motion patterns and safety profiles.

Blue Decoration Semi-Circle
Free
Data Annotation Workflow Plan

Simplify Your Data Annotation Workflow With Proven Strategies

Free data annotation guide book cover
Download the Free Guide
Blue Decoration Semi-Circle