Skip to content

Agents must beat unmanaged baseline #6

@jkbennitt

Description

@jkbennitt

The Problem

First paired benchmark against a live RimWorld colony shows agents are not helping:

Agent:    0.801 ± 0.03
Baseline: 0.830 ± 0.00
Delta:    -0.029 (p = 0.37)

The unmanaged colony (RimWorld's built-in pawn AI) scores higher than our 6-agent team. The agents are net-negative — they issue actions that fail or disrupt colonist routines.

Why Agents Are Losing

1. High action failure rate

  • set_growing_zone → RIMAPI 500 every time (fork bug, tracked separately)
  • place_blueprint → agent doesn't include x,z coordinates
  • toggle_power → agent sends building_id=0 (no valid IDs in state)
  • haul_resource → RIMAPI rejects the job assignment

Agents propose ~14 actions per tick but only ~6 execute. The rest fail silently. Failed actions waste the tick without benefit.

2. Agents disrupt productive colonist behavior

  • RimWorld's built-in AI already assigns colonists to work, eat, sleep, haul
  • Our agents override work priorities, draft colonists away from tasks, reassign researchers
  • If the override is wrong or the action fails, the colonist is worse off than if we'd done nothing

3. No understanding of what's already working

  • Agents see a snapshot of colony state but don't know what colonists are currently doing
  • They propose "set_work_priority growing=1" but the colonist is already growing
  • The action succeeds but adds no value — and may disrupt the colonist's current task queue

4. 10-second tick interval means minimal game progression

  • Colony runs for 10 seconds between deliberation cycles
  • Not enough time for actions to have measurable impact before the next override

What Needs to Change

Fix action reliability first

  • Fix set_growing_zone RIMAPI fork bug
  • Teach agents to include coordinates for blueprints
  • Expose building IDs in filtered state for toggle_power
  • Get execution rate from 43% to 90%+

Make agents aware of current colonist activity

  • Add current_activity or current_job to colonist state (if RIMAPI exposes it)
  • Agents should propose NO_ACTION when colonists are already doing the right thing
  • Penalize unnecessary overrides in the scoring

Increase tick interval for meaningful progression

  • Test with 30-60 second tick intervals so colony state actually changes between ticks
  • Fewer but higher-quality interventions > many disruptive ones

Add "do no harm" principle to agent prompts

  • System prompt: "Only propose actions that improve on the colony's current trajectory. If colonists are already productive, propose NO_ACTION."
  • Weight NO_ACTION higher in the conflict resolver when no crisis exists

Success Criteria

The benchmark answer should be:

Agent:    0.85 ± 0.05
Baseline: 0.75 ± 0.03
Delta:    +0.10** (p < 0.05)

Agents must demonstrably improve colony outcomes. Until then, the benchmark is failing honestly.

How to Reproduce

# Requires: RimWorld running, RIMAPI mod, LM Studio with Nemotron Nano 4B
# Save a Crashlanded colony as "rle_crashlanded_v1"

python scripts/run_scenario.py crashlanded_survival \
  --provider openai --model nvidia/nemotron-3-nano-4b \
  --base-url http://localhost:1234/v1 \
  --no-think --ticks 10

python scripts/run_scenario.py crashlanded_survival --no-agent --ticks 10

Compare the two final scores. Agent must be higher.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions