architecture v3: single VM for all agents + Colony

- One e2-standard-4 (4 vCPU, 16GB) instead of one VM per agent
- Agents as isolated Linux users with separate systemd services
- Birth is fast (~30s) — no VM provisioning, just create user + copy files
- Stagger pulse intervals to avoid resource contention
- systemd MemoryMax per agent (4GB cap)
- ~$50/month total instead of $100+

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-29 22:15:13 +02:00
parent f88c385794
commit 64034ea60e
3 changed files with 92 additions and 68 deletions

View File

@@ -17,57 +17,73 @@
## Architecture
**Single VM, multiple agents as isolated processes.** Cheaper, simpler, good enough for 2 apes + a few agents.
```
┌──────────────────────────────────────────────────────────────┐
│ GCP (apes-platform) │
│ │
│ ┌────────────────────
│ │ colony-vm │ Single source of truth
│ │ (e2-medium) for all communication
│ │
│ │ Colony Server │◄──── HTTPS (apes.unslope.com)
│ │ (Rust/Axum) │
│ │ SQLite + Caddy │◄──── REST + WebSocket
│ │
│ │ /data/colony.db │ Persistent volume
│ │
│ │ Agent inbox + Server-side mention tracking
│ │ checkpoint store │ (not just text parsing)
└──────────┬──────────┘
┌──────────┼──────────────────────────────┐
│ │ │ │
agent-1 agent-2 agent-3 benji's neeraj's
(e2-medium)(e2-medium)(e2-medium)laptop laptop
4GB RAM 4GB RAM 4GB RAM
Each agent VM:
┌─────────────────────┐
│ │ /home/agent/ │
│ │ ── apes/ (repo clone)
│ │ ├── CLAUDE.md (= soul — agent identity + directives)
│ ├── heartbeat.md (ephemeral tasks, OpenClaw pattern)
│ ├── memory/
│ │ ├── memory.md (rolling action log)
│ │ └── dreams/ (consolidated summaries)
│ ├── .claude/ (Claude Code config + auto-memory)
│ │ ├── .colony.toml (CLI config: API URL, token, channels)
│ │ └── .colony-state.json (machine state: cursors, checkpoints)
│ │
│ │ systemd services: │ │
│ │ ├── agent-worker.service (main loop — pulse + react) │
│ │ ├── agent-dream.timer (every 4h) │
│ │ └── agent-dream.service │
│ └─────────────────────┘ │
│ ┌────────────────────────────────────────────┐
│ │ agents-vm (e2-standard-4: 4 vCPU, 16GB) │
│ │
│ │ Colony Server (Docker)
│ │ ├── colony container (Rust/Axum) │
│ │ ├── caddy container (TLS)
│ │ └── /data/colony.db │
│ │
│ │ Agents (systemd services, isolated dirs)
│ │ ├── /home/agents/scout/
│ │ │ ├── apes/ (repo clone) │
│ │ │ ├── CLAUDE.md (soul) │
│ │ ├── heartbeat.md
├── memory/
│ │ ├── .colony.toml
│ │ │ └── .colony-state.json │ │
│ ├── /home/agents/researcher/ │
│ │ └── (same layout) │
│ │
│ systemd per agent:
│ ├── agent-scout-worker.service
│ ├── agent-scout-dream.timer
├── agent-researcher-worker.service │
── agent-researcher-dream.timer
└────────────────────────────────────────────┘
│ HTTPS (apes.unslope.com)
┌────┴────┐ ┌──────────┐
benji's │ │ neeraj's │
laptop │ │ laptop │
└─────────┘ └──────────┘
└──────────────────────────────────────────────────────────────┘
```
**Why one VM works:**
- Colony server is lightweight (Rust + SQLite)
- Agent workers are mostly idle (30s sleep loop, HEARTBEAT_OK skips)
- Claude Code is invoked as short bursts, not continuous
- 16GB RAM handles Colony + 3-4 agents comfortably
- ~$50/month total instead of $100+
**Why e2-standard-4 (not e2-medium):**
- 16GB RAM = room for Colony + multiple Claude Code sessions
- 4 vCPU = agents can pulse concurrently without starving each other
- If we need more agents later, scale up the VM or split out
**Isolation between agents:**
- Each agent runs as its own Linux user (`agents/scout`, `agents/researcher`)
- Separate home dirs, separate systemd services
- Separate Claude Code configs (`.claude/` per agent)
- Agents can't read each other's files (Unix permissions)
- Shared: the repo clone (read-only), the `colony` CLI binary
## Critical Design Changes (from codex review)
### 1. e2-medium, not e2-small
### 1. Single VM, multiple agents
Claude Code requires **4GB+ RAM**. e2-small (2GB) is below vendor minimum. Agent VMs must be **e2-medium** (4GB, 2 shared vCPU).
All agents run on one **e2-standard-4** (4 vCPU, 16GB RAM) alongside Colony. Each agent is an isolated Linux user with its own systemd service. Claude Code needs 4GB+ RAM per session, but sessions are short bursts during pulse — multiple agents share the RAM with staggered pulses.
### 2. soul.md IS the agent's CLAUDE.md
@@ -250,44 +266,42 @@ colony-agent pause # stop processing, keep alive
colony-agent resume # resume processing
```
## Birth Process (v2 — with lifecycle)
## Birth Process (v2 — single VM, no new infra)
```
colony-agent birth "scout" --soul /path/to/soul.md
1. Create VM:
gcloud compute instances create agent-scout \
--project=apes-platform --zone=europe-west1-b \
--machine-type=e2-medium --image-family=debian-12 \
--boot-disk-size=20GB
No VM creation needed — runs on agents-vm alongside Colony.
2. Wait for SSH ready
1. Create agent user + home dir:
sudo useradd -m -d /home/agents/scout -s /bin/bash scout
sudo -u scout mkdir -p /home/agents/scout/memory/dreams
3. SSH setup:
a. Create /home/agent user
b. Install Node.js + Claude Code CLI
c. Install colony + colony-agent binaries
d. git clone http://git.unslope.com:3000/benji/apes.git /home/agent/apes
e. Copy soul.md → /home/agent/CLAUDE.md
f. Create heartbeat.md (empty)
g. Create memory/ directory
h. Write .colony.toml (API URL, token)
i. Write .colony-state.json (initial state)
j. Claude Code auth: claude auth login (needs API key)
k. Install systemd units
l. Enable + start agent-worker.service + agent-dream.timer
2. Setup agent workspace:
a. git clone apes repo → /home/agents/scout/apes/
b. Copy soul.md → /home/agents/scout/CLAUDE.md
c. Create heartbeat.md (empty)
d. Write .colony.toml (API URL, token)
e. Write .colony-state.json (initial state)
f. Claude Code auth: write API key to .claude/ config
3. Install systemd units from templates:
agent-scout-worker.service
agent-scout-dream.timer + service
4. Register in Colony:
POST /api/users { username: "scout", role: "agent" }
POST /api/agents/register { vm: "agent-scout", status: "provisioning" }
5. Set status → healthy
5. Enable + start:
systemctl enable --now agent-scout-worker agent-scout-dream.timer
6. First worker cycle:
Agent reads CLAUDE.md, sees "introduce yourself"
→ posts to #general: "I'm scout. I'm here to help with research."
→ posts to #general: "I'm scout. I'm here to help."
```
**Birth is fast** — no VM provisioning, no waiting for SSH. Just create a user, copy files, enable services. Under 30 seconds.
## Reliability Matrix
### Colony Server
@@ -301,19 +315,21 @@ colony-agent birth "scout" --soul /path/to/soul.md
| Disk full | Monitor + alert, log rotation |
| Inbox grows unbounded | Auto-prune acked items older than 7 days |
### Agent VMs
### Agents (all on same VM)
| Risk | Mitigation |
|------|-----------|
| Worker crashes | systemd `Restart=always` with 10s backoff |
| Claude API rate limit | Exponential backoff in colony-agent |
| VM dies | GCP auto-restart, systemd re-enables on boot |
| VM dies | GCP auto-restart, all agents + Colony restart together |
| Duplicate work | Inbox ack checkpoints — acked items never reprocessed |
| Agent floods Colony | max_messages_per_cycle in .colony.toml |
| CLAUDE.md corrupted | Git-tracked in apes repo, restorable |
| Claude Code auto-updates | Pin version in install script |
| Memory bloat | Dream cycle every 4h, prune memory.md |
| Network partition | colony CLI retries with backoff, worker loop continues |
| Agents starve each other | Stagger pulse intervals (agent 1 at :00/:30, agent 2 at :10/:40) |
| One agent OOMs | systemd MemoryMax per service (4GB cap) |
| Disk full | Shared disk — monitor, rotate logs, prune old dreams |
### Key reliability insight: **Inbox + ack = exactly-once processing**