Vessel Reinforcement Learning

Introduction

In the Python Client of ASVSim, we have implemented a reinforcement learning (RL) example that enables researchers and developers to train autonomous vessel navigation agents. The RL environment provides a simulation for training agents to navigate vessels through procedurally generated port environments with waypoint-based navigation while avoiding static and dynamic obstacles.

As the generalizability of RL methods is highly dependent on variety during training, we have implemented a procedural generation system that randomizes the port environment during training. The terrain is regenerated at configurable intervals. See Procedural Generation for more details. Furthermore, you can spawn static and dynamic obstacles in the environment using the simAddObstacle() function. See Vessel API for more details.

The reinforcement learning system is located in PythonClient/Vessel/ and consists of:

pcg_vessel_env.py — Gymnasium environment with PCG terrain randomization
train.py — Training script with CrossQ, wandb logging, and checkpointing
eval.py — Evaluation script for trained checkpoints

Environment Overview (`pcg_vessel_env.py`)

Environment Specifications

The PCGVesselEnv environment implements a continuous control task where an agent learns to navigate a vessel through procedurally generated port sections to reach waypoints while avoiding obstacles. Do note that this is an example and the environment, reward, action and observation spaces are not optimized for RL training.

Specification	Details
Action Space	Box(2,) - [thrust, rudder_angle]
Action Range	thrust: [0, 0.7], rudder: [0.48, 0.52]
Observation Space	Box(54,) - waypoint distances + vessel state + LiDAR
Episode Length	Maximum 800 timesteps (adjustable with `action_repeat`)
Success Condition	Reach within 10 meters of each waypoint

Observation Space Details

The observation vector contains 54 elements:

obs = [
    # Waypoint distances (6)
    dx_prev_waypoint,             # X distance to previous waypoint
    dy_prev_waypoint,             # Y distance to previous waypoint
    dx_current_waypoint,          # X distance to current waypoint
    dy_current_waypoint,          # Y distance to current waypoint
    distance_to_current_waypoint, # Euclidean distance to current waypoint
    dx_next_waypoint,             # X distance to next waypoint (0 if none)
    dy_next_waypoint,             # Y distance to next waypoint (0 if none)
    distance_to_next_waypoint,    # Euclidean distance to next waypoint (0 if none)

    # Heading (3)
    heading_error,                # Angle error to current waypoint [-pi, pi]
    sin(heading),                 # Sine of vessel heading
    cos(heading),                 # Cosine of vessel heading

    # Dynamics (5)
    linear_velocity_x,            # X-axis linear velocity
    linear_velocity_y,            # Y-axis linear velocity
    linear_acceleration_x,        # X-axis linear acceleration
    linear_acceleration_y,        # Y-axis linear acceleration
    angular_acceleration_z,       # Z-axis angular acceleration

    # Previous actions (2)
    prev_thrust,                  # Previous thrust action
    prev_rudder_angle,            # Previous rudder angle action

    # LiDAR (36)
    lidar_sectors[0:36]           # 36 min-pooled sectors (10° each, 360° coverage)
]

The LiDAR observations are created by min-pooling 3600 raw points into 36 sectors of 100 points each. Ground and vessel labels are filtered out so only obstacle distances remain.

Action Space Details

Action	Range	Description
`thrust`	[0, 0.7]	Forward propulsion control
`rudder_angle`	[0.48, 0.52]	Steering control (0.5 = straight)

The environment supports multi-waypoint navigation through PCG-generated port sections. The goal_distance parameter controls how many sections (waypoints) the agent must navigate through. When an intermediate waypoint is reached (within 10m), the agent automatically advances to the next waypoint. The episode ends successfully only when the final waypoint is reached.

Reward Structure

Condition	Reward
Progress toward waypoint	`1.0 * (prev_distance - current_distance)`
Time penalty (per step)	`-0.1`
Collision	`-100.0` (episode terminates)
Final waypoint reached	`+500.0` (episode terminates)
Timeout	Episode truncated

Training (`train.py`)

Overview

The training script uses CrossQ from sb3-contrib, a sample-efficient off-policy algorithm for continuous control. Training includes observation normalization via VecNormalize, periodic model checkpointing, and optional Weights & Biases logging.

Basic Usage

# Minimal — start training with defaults
python train.py --sim-path path/to/Blocks.exe

# Full example with all options
python train.py \
    --sim-path path/to/Blocks.exe \
    --timesteps 2500000 \
    --terrain-regen 10 \
    --num-obstacles 4 \
    --num-dynamic-obstacles 0 \
    --num-waypoints 1 \
    --action-repeat 1 \
    --seed 43 \
    --wandb-key YOUR_WANDB_KEY

Command Line Arguments

Argument	Default	Description
`--ip`	`127.0.0.1`	Simulator IP address
`--timesteps`	`2500000`	Total training timesteps
`--terrain-regen`	`10`	Regenerate PCG terrain every N episodes
`--num-obstacles`	`4`	Number of static obstacles per episode
`--num-dynamic-obstacles`	`0`	Number of moving obstacles per episode
`--num-waypoints`	`1`	Number of waypoints to navigate (1 or 2)
`--sim-path`	`Blocks/Blocks.exe`	Path to simulator executable
`--sim-wait`	`10`	Seconds to wait for simulator startup
`--sim-log`	`false`	Log simulator output to `logs/sim.log`
`--action-repeat`	`1`	Number of times to repeat each action
`--seed`	`43`	Random seed for reproducibility
`--wandb-key`	`None`	Weights & Biases API key (disables wandb if not set)

CrossQ Hyperparameters

The default hyperparameters used in training:

model = CrossQ(
    "MlpPolicy",
    env,
    learning_rate=0.0003,
    gamma=0.99,
    batch_size=256,
    buffer_size=500000,
    learning_starts=5000,
    train_freq=1,
    stats_window_size=10,
    policy_kwargs=dict(net_arch=[512, 512]),
)

Output Structure

Training outputs are saved to the logs/ directory:

logs/
├── training/
│   ├── models/           # Checkpoints every 25k steps + final model
│   └── tb/               # TensorBoard logs
└── sim.log               # Simulator output (if --sim-log is set)

Evaluation (`eval.py`)

Usage

# Evaluate a checkpoint
python eval.py --checkpoint logs/training/models/crossq_pcg_vessel_25000_steps.zip

# Evaluate with more episodes
python eval.py --checkpoint logs/training/models/crossq_pcg_vessel_policy.zip --episodes 200

The evaluation script loads a trained CrossQ checkpoint, runs it for a number of episodes, and prints per-episode results along with aggregate statistics including success rate, collision rate, timeout rate, mean reward, and mean final distance to goal.

Environment Configuration

AirSim Settings

Ensure your settings.json includes proper vessel and sensor configuration for RL training:

{
  "SettingsVersion": 2.0,
  "SimMode": "Vessel",
  "Vehicles": {
    "milliampere": {
      "VehicleType": "MilliAmpere",
      "HydroDynamics": {
        "hydrodynamics_engine": "FossenCurrent"
      },
      "PawnPath": "DefaultVessel",
      "AutoCreate": true,
      "RC": {
        "RemoteControlID": 0
      },
      "Sensors": {
        "lidar1": {
          "SensorType": 6,
          "Enabled": true,
          "NumberOfChannels": 8,
          "RotationsPerSecond": 1,
          "MeasurementsPerCycle": 450,
          "range": 100000,
          "X": 0,
          "Y": 0,
          "Z": -8.2,
          "Roll": 0,
          "Pitch": 0,
          "Yaw": 0,
          "VerticalFOVUpper": -2,
          "VerticalFOVLower": -10,
          "GenerateNoise": false,
          "DrawDebugPoints": false,
          "HorizontalFOVStart": -180,
          "HorizontalFOVEnd": 180
        },
        "Distance": {
          "SensorType": 5,
          "Enabled": true,
          "MaxDistance": 600,
          "DrawDebugPoints": false
        },
        "Imu": {
          "SensorType": 2,
          "Enabled": true
        }
      }
    }
  }
}

Dependencies

Install the required Python packages:

pip install gymnasium stable-baselines3 sb3-contrib wandb tensorboard numpy

Or install from the provided requirements.txt:

pip install -r requirements.txt

Spawning Obstacles

The environment automatically spawns static buoys and dynamic boats at each episode reset based on the --num-obstacles and --num-dynamic-obstacles arguments. For manual obstacle spawning and cleanup, see Vessel API — Spawning Obstacles.

Simulator Crash Recovery

The environment automatically handles simulator crashes. If a connection error occurs during step() or reset(), the environment will:

Kill the existing simulator process
Restart a fresh simulator instance
Reconnect and re-activate PCG
Continue training from the next episode

For additional examples and advanced usage, see the Vessel API documentation, Procedural Generation, and AirSim API reference.