Vessel Reinforcement Learning
Introduction
In the Python Client of ASVSim, we have implemented a reinforcement learning (RL) example that enables researchers and developers to train autonomous vessel navigation agents. The RL environment provides a simulation for training agents to navigate vessels through procedurally generated port environments with waypoint-based navigation while avoiding static and dynamic obstacles.
As the generalizability of RL methods is highly dependent on variety during training, we have implemented a procedural generation system that randomizes the port environment during training. The terrain is regenerated at configurable intervals. See Procedural Generation for more details. Furthermore, you can spawn static and dynamic obstacles in the environment using the simAddObstacle() function. See Vessel API for more details.
The reinforcement learning system is located in PythonClient/Vessel/ and consists of:
pcg_vessel_env.py— Gymnasium environment with PCG terrain randomizationtrain.py— Training script with CrossQ, wandb logging, and checkpointingeval.py— Evaluation script for trained checkpoints
Environment Overview (pcg_vessel_env.py)
Environment Specifications
The PCGVesselEnv environment implements a continuous control task where an agent learns to navigate a vessel through procedurally generated port sections to reach waypoints while avoiding obstacles. Do note that this is an example and the environment, reward, action and observation spaces are not optimized for RL training.
| Specification | Details |
|---|---|
| Action Space | Box(2,) - [thrust, rudder_angle] |
| Action Range | thrust: [0, 0.7], rudder: [0.48, 0.52] |
| Observation Space | Box(54,) - waypoint distances + vessel state + LiDAR |
| Episode Length | Maximum 800 timesteps (adjustable with action_repeat) |
| Success Condition | Reach within 10 meters of each waypoint |
Observation Space Details
The observation vector contains 54 elements:
obs = [
# Waypoint distances (6)
dx_prev_waypoint, # X distance to previous waypoint
dy_prev_waypoint, # Y distance to previous waypoint
dx_current_waypoint, # X distance to current waypoint
dy_current_waypoint, # Y distance to current waypoint
distance_to_current_waypoint, # Euclidean distance to current waypoint
dx_next_waypoint, # X distance to next waypoint (0 if none)
dy_next_waypoint, # Y distance to next waypoint (0 if none)
distance_to_next_waypoint, # Euclidean distance to next waypoint (0 if none)
# Heading (3)
heading_error, # Angle error to current waypoint [-pi, pi]
sin(heading), # Sine of vessel heading
cos(heading), # Cosine of vessel heading
# Dynamics (5)
linear_velocity_x, # X-axis linear velocity
linear_velocity_y, # Y-axis linear velocity
linear_acceleration_x, # X-axis linear acceleration
linear_acceleration_y, # Y-axis linear acceleration
angular_acceleration_z, # Z-axis angular acceleration
# Previous actions (2)
prev_thrust, # Previous thrust action
prev_rudder_angle, # Previous rudder angle action
# LiDAR (36)
lidar_sectors[0:36] # 36 min-pooled sectors (10° each, 360° coverage)
]
The LiDAR observations are created by min-pooling 3600 raw points into 36 sectors of 100 points each. Ground and vessel labels are filtered out so only obstacle distances remain.
Action Space Details
| Action | Range | Description |
|---|---|---|
thrust |
[0, 0.7] | Forward propulsion control |
rudder_angle |
[0.48, 0.52] | Steering control (0.5 = straight) |
Waypoint Navigation
The environment supports multi-waypoint navigation through PCG-generated port sections. The goal_distance parameter controls how many sections (waypoints) the agent must navigate through. When an intermediate waypoint is reached (within 10m), the agent automatically advances to the next waypoint. The episode ends successfully only when the final waypoint is reached.
Reward Structure
| Condition | Reward |
|---|---|
| Progress toward waypoint | 1.0 * (prev_distance - current_distance) |
| Time penalty (per step) | -0.1 |
| Collision | -100.0 (episode terminates) |
| Final waypoint reached | +500.0 (episode terminates) |
| Timeout | Episode truncated |
Training (train.py)
Overview
The training script uses CrossQ from sb3-contrib, a sample-efficient off-policy algorithm for continuous control. Training includes observation normalization via VecNormalize, periodic model checkpointing, and optional Weights & Biases logging.
Basic Usage
# Minimal — start training with defaults
python train.py --sim-path path/to/Blocks.exe
# Full example with all options
python train.py \
--sim-path path/to/Blocks.exe \
--timesteps 2500000 \
--terrain-regen 10 \
--num-obstacles 4 \
--num-dynamic-obstacles 0 \
--num-waypoints 1 \
--action-repeat 1 \
--seed 43 \
--wandb-key YOUR_WANDB_KEY
Command Line Arguments
| Argument | Default | Description |
|---|---|---|
--ip |
127.0.0.1 |
Simulator IP address |
--timesteps |
2500000 |
Total training timesteps |
--terrain-regen |
10 |
Regenerate PCG terrain every N episodes |
--num-obstacles |
4 |
Number of static obstacles per episode |
--num-dynamic-obstacles |
0 |
Number of moving obstacles per episode |
--num-waypoints |
1 |
Number of waypoints to navigate (1 or 2) |
--sim-path |
Blocks/Blocks.exe |
Path to simulator executable |
--sim-wait |
10 |
Seconds to wait for simulator startup |
--sim-log |
false |
Log simulator output to logs/sim.log |
--action-repeat |
1 |
Number of times to repeat each action |
--seed |
43 |
Random seed for reproducibility |
--wandb-key |
None |
Weights & Biases API key (disables wandb if not set) |
CrossQ Hyperparameters
The default hyperparameters used in training:
model = CrossQ(
"MlpPolicy",
env,
learning_rate=0.0003,
gamma=0.99,
batch_size=256,
buffer_size=500000,
learning_starts=5000,
train_freq=1,
stats_window_size=10,
policy_kwargs=dict(net_arch=[512, 512]),
)
Output Structure
Training outputs are saved to the logs/ directory:
logs/
├── training/
│ ├── models/ # Checkpoints every 25k steps + final model
│ └── tb/ # TensorBoard logs
└── sim.log # Simulator output (if --sim-log is set)
Evaluation (eval.py)
Usage
# Evaluate a checkpoint
python eval.py --checkpoint logs/training/models/crossq_pcg_vessel_25000_steps.zip
# Evaluate with more episodes
python eval.py --checkpoint logs/training/models/crossq_pcg_vessel_policy.zip --episodes 200
The evaluation script loads a trained CrossQ checkpoint, runs it for a number of episodes, and prints per-episode results along with aggregate statistics including success rate, collision rate, timeout rate, mean reward, and mean final distance to goal.
Environment Configuration
AirSim Settings
Ensure your settings.json includes proper vessel and sensor configuration for RL training:
{
"SettingsVersion": 2.0,
"SimMode": "Vessel",
"Vehicles": {
"milliampere": {
"VehicleType": "MilliAmpere",
"HydroDynamics": {
"hydrodynamics_engine": "FossenCurrent"
},
"PawnPath": "DefaultVessel",
"AutoCreate": true,
"RC": {
"RemoteControlID": 0
},
"Sensors": {
"lidar1": {
"SensorType": 6,
"Enabled": true,
"NumberOfChannels": 8,
"RotationsPerSecond": 1,
"MeasurementsPerCycle": 450,
"range": 100000,
"X": 0,
"Y": 0,
"Z": -8.2,
"Roll": 0,
"Pitch": 0,
"Yaw": 0,
"VerticalFOVUpper": -2,
"VerticalFOVLower": -10,
"GenerateNoise": false,
"DrawDebugPoints": false,
"HorizontalFOVStart": -180,
"HorizontalFOVEnd": 180
},
"Distance": {
"SensorType": 5,
"Enabled": true,
"MaxDistance": 600,
"DrawDebugPoints": false
},
"Imu": {
"SensorType": 2,
"Enabled": true
}
}
}
}
}
Dependencies
Install the required Python packages:
pip install gymnasium stable-baselines3 sb3-contrib wandb tensorboard numpy
Or install from the provided requirements.txt:
pip install -r requirements.txt
Spawning Obstacles
The environment automatically spawns static buoys and dynamic boats at each episode reset based on the --num-obstacles and --num-dynamic-obstacles arguments. For manual obstacle spawning and cleanup, see Vessel API — Spawning Obstacles.
Simulator Crash Recovery
The environment automatically handles simulator crashes. If a connection error occurs during step() or reset(), the environment will:
- Kill the existing simulator process
- Restart a fresh simulator instance
- Reconnect and re-activate PCG
- Continue training from the next episode
For additional examples and advanced usage, see the Vessel API documentation, Procedural Generation, and AirSim API reference.