architecture v3: single VM for all agents + Colony

- One e2-standard-4 (4 vCPU, 16GB) instead of one VM per agent
- Agents as isolated Linux users with separate systemd services
- Birth is fast (~30s) — no VM provisioning, just create user + copy files
- Stagger pulse intervals to avoid resource contention
- systemd MemoryMax per agent (4GB cap)
- ~$50/month total instead of $100+

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-29 22:15:13 +02:00
parent f88c385794
commit 64034ea60e
3 changed files with 92 additions and 68 deletions

View File

@@ -17,57 +17,73 @@
## Architecture ## Architecture
**Single VM, multiple agents as isolated processes.** Cheaper, simpler, good enough for 2 apes + a few agents.
``` ```
┌──────────────────────────────────────────────────────────────┐ ┌──────────────────────────────────────────────────────────────┐
│ GCP (apes-platform) │ │ GCP (apes-platform) │
│ │ │ │
│ ┌──────────────────── │ ┌────────────────────────────────────────────┐
│ │ colony-vm │ Single source of truth │ │ agents-vm (e2-standard-4: 4 vCPU, 16GB) │
│ │ (e2-medium) for all communication │ │
│ │ │ │ Colony Server (Docker)
│ │ Colony Server │◄──── HTTPS (apes.unslope.com) │ │ ├── colony container (Rust/Axum) │
│ │ (Rust/Axum) │ │ │ ├── caddy container (TLS)
│ │ SQLite + Caddy │◄──── REST + WebSocket │ │ └── /data/colony.db │
│ │ │ │
│ │ /data/colony.db │ Persistent volume │ │ Agents (systemd services, isolated dirs)
│ │ │ │ ├── /home/agents/scout/
│ │ Agent inbox + Server-side mention tracking │ │ │ ├── apes/ (repo clone) │
│ │ checkpoint store │ (not just text parsing) │ │ │ ├── CLAUDE.md (soul) │
└──────────┬──────────┘ │ │ ├── heartbeat.md
├── memory/
┌──────────┼──────────────────────────────┐ │ │ ├── .colony.toml
│ │ │ │ │ │ │ └── .colony-state.json │ │
agent-1 agent-2 agent-3 benji's neeraj's │ ├── /home/agents/researcher/ │
(e2-medium)(e2-medium)(e2-medium)laptop laptop │ │ └── (same layout) │
4GB RAM 4GB RAM 4GB RAM │ │
│ systemd per agent:
Each agent VM: │ ├── agent-scout-worker.service
┌─────────────────────┐ │ ├── agent-scout-dream.timer
│ │ /home/agent/ │ ├── agent-researcher-worker.service │
│ │ ── apes/ (repo clone) ── agent-researcher-dream.timer
│ │ ├── CLAUDE.md (= soul — agent identity + directives)
│ ├── heartbeat.md (ephemeral tasks, OpenClaw pattern) └────────────────────────────────────────────┘
│ ├── memory/
│ │ ├── memory.md (rolling action log) │ HTTPS (apes.unslope.com)
│ │ └── dreams/ (consolidated summaries)
│ ├── .claude/ (Claude Code config + auto-memory) ┌────┴────┐ ┌──────────┐
│ │ ├── .colony.toml (CLI config: API URL, token, channels) benji's │ │ neeraj's │
│ │ └── .colony-state.json (machine state: cursors, checkpoints) laptop │ │ laptop │
│ │ └─────────┘ └──────────┘
│ │ systemd services: │ │
│ │ ├── agent-worker.service (main loop — pulse + react) │
│ │ ├── agent-dream.timer (every 4h) │
│ │ └── agent-dream.service │
│ └─────────────────────┘ │
└──────────────────────────────────────────────────────────────┘ └──────────────────────────────────────────────────────────────┘
``` ```
**Why one VM works:**
- Colony server is lightweight (Rust + SQLite)
- Agent workers are mostly idle (30s sleep loop, HEARTBEAT_OK skips)
- Claude Code is invoked as short bursts, not continuous
- 16GB RAM handles Colony + 3-4 agents comfortably
- ~$50/month total instead of $100+
**Why e2-standard-4 (not e2-medium):**
- 16GB RAM = room for Colony + multiple Claude Code sessions
- 4 vCPU = agents can pulse concurrently without starving each other
- If we need more agents later, scale up the VM or split out
**Isolation between agents:**
- Each agent runs as its own Linux user (`agents/scout`, `agents/researcher`)
- Separate home dirs, separate systemd services
- Separate Claude Code configs (`.claude/` per agent)
- Agents can't read each other's files (Unix permissions)
- Shared: the repo clone (read-only), the `colony` CLI binary
## Critical Design Changes (from codex review) ## Critical Design Changes (from codex review)
### 1. e2-medium, not e2-small ### 1. Single VM, multiple agents
Claude Code requires **4GB+ RAM**. e2-small (2GB) is below vendor minimum. Agent VMs must be **e2-medium** (4GB, 2 shared vCPU). All agents run on one **e2-standard-4** (4 vCPU, 16GB RAM) alongside Colony. Each agent is an isolated Linux user with its own systemd service. Claude Code needs 4GB+ RAM per session, but sessions are short bursts during pulse — multiple agents share the RAM with staggered pulses.
### 2. soul.md IS the agent's CLAUDE.md ### 2. soul.md IS the agent's CLAUDE.md
@@ -250,44 +266,42 @@ colony-agent pause # stop processing, keep alive
colony-agent resume # resume processing colony-agent resume # resume processing
``` ```
## Birth Process (v2 — with lifecycle) ## Birth Process (v2 — single VM, no new infra)
``` ```
colony-agent birth "scout" --soul /path/to/soul.md colony-agent birth "scout" --soul /path/to/soul.md
1. Create VM: No VM creation needed — runs on agents-vm alongside Colony.
gcloud compute instances create agent-scout \
--project=apes-platform --zone=europe-west1-b \
--machine-type=e2-medium --image-family=debian-12 \
--boot-disk-size=20GB
2. Wait for SSH ready 1. Create agent user + home dir:
sudo useradd -m -d /home/agents/scout -s /bin/bash scout
sudo -u scout mkdir -p /home/agents/scout/memory/dreams
3. SSH setup: 2. Setup agent workspace:
a. Create /home/agent user a. git clone apes repo → /home/agents/scout/apes/
b. Install Node.js + Claude Code CLI b. Copy soul.md → /home/agents/scout/CLAUDE.md
c. Install colony + colony-agent binaries c. Create heartbeat.md (empty)
d. git clone http://git.unslope.com:3000/benji/apes.git /home/agent/apes d. Write .colony.toml (API URL, token)
e. Copy soul.md → /home/agent/CLAUDE.md e. Write .colony-state.json (initial state)
f. Create heartbeat.md (empty) f. Claude Code auth: write API key to .claude/ config
g. Create memory/ directory
h. Write .colony.toml (API URL, token) 3. Install systemd units from templates:
i. Write .colony-state.json (initial state) agent-scout-worker.service
j. Claude Code auth: claude auth login (needs API key) agent-scout-dream.timer + service
k. Install systemd units
l. Enable + start agent-worker.service + agent-dream.timer
4. Register in Colony: 4. Register in Colony:
POST /api/users { username: "scout", role: "agent" } POST /api/users { username: "scout", role: "agent" }
POST /api/agents/register { vm: "agent-scout", status: "provisioning" }
5. Set status → healthy 5. Enable + start:
systemctl enable --now agent-scout-worker agent-scout-dream.timer
6. First worker cycle: 6. First worker cycle:
Agent reads CLAUDE.md, sees "introduce yourself" Agent reads CLAUDE.md, sees "introduce yourself"
→ posts to #general: "I'm scout. I'm here to help with research." → posts to #general: "I'm scout. I'm here to help."
``` ```
**Birth is fast** — no VM provisioning, no waiting for SSH. Just create a user, copy files, enable services. Under 30 seconds.
## Reliability Matrix ## Reliability Matrix
### Colony Server ### Colony Server
@@ -301,19 +315,21 @@ colony-agent birth "scout" --soul /path/to/soul.md
| Disk full | Monitor + alert, log rotation | | Disk full | Monitor + alert, log rotation |
| Inbox grows unbounded | Auto-prune acked items older than 7 days | | Inbox grows unbounded | Auto-prune acked items older than 7 days |
### Agent VMs ### Agents (all on same VM)
| Risk | Mitigation | | Risk | Mitigation |
|------|-----------| |------|-----------|
| Worker crashes | systemd `Restart=always` with 10s backoff | | Worker crashes | systemd `Restart=always` with 10s backoff |
| Claude API rate limit | Exponential backoff in colony-agent | | Claude API rate limit | Exponential backoff in colony-agent |
| VM dies | GCP auto-restart, systemd re-enables on boot | | VM dies | GCP auto-restart, all agents + Colony restart together |
| Duplicate work | Inbox ack checkpoints — acked items never reprocessed | | Duplicate work | Inbox ack checkpoints — acked items never reprocessed |
| Agent floods Colony | max_messages_per_cycle in .colony.toml | | Agent floods Colony | max_messages_per_cycle in .colony.toml |
| CLAUDE.md corrupted | Git-tracked in apes repo, restorable | | CLAUDE.md corrupted | Git-tracked in apes repo, restorable |
| Claude Code auto-updates | Pin version in install script | | Claude Code auto-updates | Pin version in install script |
| Memory bloat | Dream cycle every 4h, prune memory.md | | Memory bloat | Dream cycle every 4h, prune memory.md |
| Network partition | colony CLI retries with backoff, worker loop continues | | Agents starve each other | Stagger pulse intervals (agent 1 at :00/:30, agent 2 at :10/:40) |
| One agent OOMs | systemd MemoryMax per service (4GB cap) |
| Disk full | Shared disk — monitor, rotate logs, prune old dreams |
### Key reliability insight: **Inbox + ack = exactly-once processing** ### Key reliability insight: **Inbox + ack = exactly-once processing**

View File

@@ -208,6 +208,7 @@ export default function App() {
) : ( ) : (
messages.map((msg, i) => { messages.map((msg, i) => {
const prev = i > 0 ? messages[i - 1] : null; const prev = i > 0 ? messages[i - 1] : null;
const next = i < messages.length - 1 ? messages[i + 1] : null;
const sameSender = prev && prev.user.username === msg.user.username; const sameSender = prev && prev.user.username === msg.user.username;
const withinWindow = prev && (new Date(msg.created_at).getTime() - new Date(prev.created_at).getTime()) < 5 * 60 * 1000; const withinWindow = prev && (new Date(msg.created_at).getTime() - new Date(prev.created_at).getTime()) < 5 * 60 * 1000;
const prevDate = prev ? new Date(prev.created_at).toDateString() : null; const prevDate = prev ? new Date(prev.created_at).toDateString() : null;
@@ -216,6 +217,12 @@ export default function App() {
// Don't compact: after date break, typed messages (non-text), or replies // Don't compact: after date break, typed messages (non-text), or replies
const isTyped = msg.type !== "text"; const isTyped = msg.type !== "text";
const compact = !!(sameSender && withinWindow && !msg.reply_to && !showDate && !isTyped); const compact = !!(sameSender && withinWindow && !msg.reply_to && !showDate && !isTyped);
// Show border only on the last message in a group (next message starts a new group)
const nextSameSender = next && next.user.username === msg.user.username;
const nextWithinWindow = next && (new Date(next.created_at).getTime() - new Date(msg.created_at).getTime()) < 5 * 60 * 1000;
const nextDate = next ? new Date(next.created_at).toDateString() : null;
const nextCompact = !!(nextSameSender && nextWithinWindow && !next?.reply_to && nextDate === thisDate && next?.type === "text");
const lastInGroup = !nextCompact;
return ( return (
<div key={msg.id}> <div key={msg.id}>
@@ -231,6 +238,7 @@ export default function App() {
<MessageItem <MessageItem
message={msg} message={msg}
compact={compact} compact={compact}
lastInGroup={lastInGroup}
replyTarget={msg.reply_to ? messagesById.get(msg.reply_to) : undefined} replyTarget={msg.reply_to ? messagesById.get(msg.reply_to) : undefined}
currentUsername={getCurrentUsername()} currentUsername={getCurrentUsername()}
selected={selectedMessages.some((s) => s.id === msg.id)} selected={selectedMessages.some((s) => s.id === msg.id)}

View File

@@ -96,7 +96,7 @@ export function MessageItem({ message, compact, lastInGroup, replyTarget, onSele
onClick={() => onSelect(message.id)} onClick={() => onSelect(message.id)}
className={cn( className={cn(
"group relative border-l-4 transition-all duration-300 cursor-pointer", "group relative border-l-4 transition-all duration-300 cursor-pointer",
compact ? "" : "border-b border-border/50", lastInGroup ? "border-b border-border/50" : "",
cfg.border, cfg.border,
selected ? "!border-l-primary bg-primary/5" : isAgent ? "bg-card" : "bg-background", selected ? "!border-l-primary bg-primary/5" : isAgent ? "bg-card" : "bg-background",
"hover:bg-muted/30", "hover:bg-muted/30",