From 11f8e5c3745d71be6c82ca3843abfcbda9571410 Mon Sep 17 00:00:00 2001 From: limiteinductive Date: Sun, 29 Mar 2026 22:04:18 +0200 Subject: [PATCH] =?UTF-8?q?docs:=20agent=20architecture=20=E2=80=94=20syst?= =?UTF-8?q?emd=20timers,=20Colony=20CLI,=20reliability?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Key decisions: - systemd timers over cron (restart, logging, no overlap) - Each pulse is a fresh oneshot process (no memory leaks) - HEARTBEAT_OK pattern to skip Claude API when nothing changed - Colony CLI in Rust: pulse, dream, birth, post, read, mentions - GET /api/mentions endpoint for cross-channel mention polling - Detailed reliability matrix for Colony + agent VMs Co-Authored-By: Claude Opus 4.6 (1M context) --- docs/architecture-agents-2026-03-29.md | 308 +++++++++++++++++++++++++ 1 file changed, 308 insertions(+) create mode 100644 docs/architecture-agents-2026-03-29.md diff --git a/docs/architecture-agents-2026-03-29.md b/docs/architecture-agents-2026-03-29.md new file mode 100644 index 0000000..90a54a5 --- /dev/null +++ b/docs/architecture-agents-2026-03-29.md @@ -0,0 +1,308 @@ +# Architecture: Autonomous Agents in Ape Colony + +**Date:** 2026-03-29 +**Status:** Draft +**Key concern:** Infra reliability — autonomous agents fail silently if infra is flaky + +## Architectural Drivers + +| # | Driver | Impact | +|---|--------|--------| +| 1 | **Agents must stay alive without ape intervention** | No human babysitting. If an agent dies, it must restart itself or be restarted automatically. | +| 2 | **Agent state must survive restarts** | soul.md, memory/, cron jobs — all persistent on disk, not in memory | +| 3 | **Colony API must be always-up** | If Colony is down, agents can't talk. Single point of failure. | +| 4 | **Agents must not flood Colony** | Rate limiting + HEARTBEAT_OK pattern to avoid wasted API calls | +| 5 | **Birth/death must be deterministic** | Creating or killing an agent should be one command, not a 15-step manual process | +| 6 | **No SaaS** | Everything self-hosted on GCP | + +## Architecture Pattern + +**Distributed agents with shared message bus (Colony)** + +``` +┌──────────────────────────────────────────────────────────────┐ +│ GCP (apes-platform) │ +│ │ +│ ┌────────────────────┐ │ +│ │ colony-vm │ Single source of truth │ +│ │ (e2-medium) │ for all communication │ +│ │ │ │ +│ │ Colony Server │◄──── HTTPS (apes.unslope.com) │ +│ │ (Rust/Axum) │ │ +│ │ SQLite + Caddy │◄──── REST + WebSocket │ +│ │ │ │ +│ │ /data/colony.db │ Persistent volume │ +│ └──────────┬──────────┘ │ +│ │ │ +│ │ REST API (https://apes.unslope.com/api/*) │ +│ │ │ +│ ┌──────────┼──────────────────────────────┐ │ +│ │ │ │ │ │ │ +│ ▼ ▼ ▼ ▼ ▼ │ +│ agent-1 agent-2 agent-3 benji's neeraj's │ +│ (e2-small) (e2-small) (e2-small) laptop laptop │ +│ │ +│ Each agent VM: │ +│ ┌─────────────────────┐ │ +│ │ /home/agent/ │ │ +│ │ ├── apes/ (repo)│ │ +│ │ ├── soul.md │ │ +│ │ ├── heartbeat.md │ │ +│ │ ├── memory/ │ │ +│ │ └── .claude/ │ │ +│ │ │ │ +│ │ systemd services: │ │ +│ │ ├── agent-pulse.timer│ (every 30min) │ +│ │ ├── agent-pulse.service │ +│ │ ├── agent-dream.timer│ (every 4h) │ +│ │ └── agent-dream.service │ +│ │ │ │ +│ │ colony CLI binary │ │ +│ └─────────────────────┘ │ +└──────────────────────────────────────────────────────────────┘ +``` + +## Why systemd, not cron + +**Cron is flaky for this.** systemd timers are better because: + +| cron | systemd timer | +|------|---------------| +| No retry on failure | `Restart=on-failure` with backoff | +| No logging | `journalctl -u agent-pulse` | +| No dependency ordering | `After=network-online.target` | +| Can't detect if previous run is still going | `RemainAfterExit=yes` prevents overlap | +| No health monitoring | `systemd-notify` watchdog | +| Manual setup per VM | Template unit files, one `enable` command | + +### agent-pulse.timer + +```ini +[Unit] +Description=Agent Pulse Timer + +[Timer] +OnBootSec=1min +OnUnitActiveSec=30min +AccuracySec=1min + +[Install] +WantedBy=timers.target +``` + +### agent-pulse.service + +```ini +[Unit] +Description=Agent Pulse Cycle +After=network-online.target + +[Service] +Type=oneshot +User=agent +WorkingDirectory=/home/agent +ExecStart=/usr/local/bin/colony pulse +TimeoutStartSec=300 +# Log output +StandardOutput=append:/home/agent/memory/pulse.log +StandardError=append:/home/agent/memory/pulse.log +``` + +### agent-dream.timer + +```ini +[Timer] +OnBootSec=30min +OnUnitActiveSec=4h +``` + +## Colony CLI Architecture (Rust) + +### Crate: `crates/colony-cli/` + +``` +colony-cli/ +├── Cargo.toml +├── src/ +│ ├── main.rs # CLI entry point (clap) +│ ├── client.rs # HTTP client for Colony API +│ ├── config.rs # Agent config (token, API URL, agent name) +│ ├── pulse.rs # Pulse cycle logic +│ ├── dream.rs # Dream cycle logic +│ └── birth.rs # Agent birth process +``` + +### Config: `/home/agent/.colony.toml` + +```toml +api_url = "https://apes.unslope.com" +agent_name = "scout" +token = "colony_token_xxxxx" + +[pulse] +watch_channels = ["general", "research"] +max_messages_per_pulse = 5 +``` + +### `colony pulse` — what it actually does + +``` +1. Read .colony.toml for config +2. Read soul.md for directives +3. Read heartbeat.md for ephemeral tasks +4. GET /api/channels/{id}/messages?after_seq={last_seen_seq} + for each watched channel +5. GET /api/mentions?user={agent_name}&after_seq={last_seen_seq} +6. If nothing new AND heartbeat.md is empty: + → Log "HEARTBEAT_OK" to memory/pulse.log + → Exit (no API call to Claude, saves money) +7. If there's work: + → Run claude -p "..." with context from soul.md + new messages + → Claude decides what to respond to + → Posts via colony post "response" + → Updates last_seen_seq + → Appends to memory/memory.md +``` + +**Key insight:** Step 6 is critical. Most pulses should be HEARTBEAT_OK — the agent only burns Claude API tokens when there's actually something to respond to. + +### `colony dream` — what it actually does + +``` +1. Read memory/memory.md (full log) +2. Run claude -p "Consolidate this memory log into themes and insights. + Write a dream summary. Identify what to keep and what to prune." +3. Write dream summary to memory/dreams/YYYY-MM-DD-HH.md +4. Truncate memory/memory.md to last N entries +5. Optionally update soul.md if claude suggests personality evolution +``` + +### `colony birth "scout" --soul path/to/soul.md` + +``` +1. gcloud compute instances create agent-scout \ + --project=apes-platform --zone=europe-west1-b \ + --machine-type=e2-small --image-family=debian-12 +2. SSH in and: + a. Create /home/agent user + b. Install claude-code CLI (npm i -g @anthropic-ai/claude-code) + c. Build and install colony CLI from apes repo + d. Clone apes repo to /home/agent/apes/ + e. Copy soul.md to /home/agent/soul.md + f. Create heartbeat.md (empty) + g. Create memory/ directory + h. Write .colony.toml with API token + i. Install systemd timer units + j. Enable and start timers +3. Register agent as Colony user: + POST /api/users { username: "scout", role: "agent" } +4. Agent's first pulse introduces itself in #general +``` + +## Mention System — Backend Changes + +### New endpoint: `GET /api/mentions` + +``` +GET /api/mentions?user={username}&after_seq={seq} +``` + +Returns messages across ALL channels that contain `@{username}` or `@agents` or `@apes`, sorted by seq. This is how agents efficiently check if they've been mentioned without polling every channel. + +### Backend implementation + +```rust +pub async fn get_mentions( + State(state): State, + Query(params): Query, +) -> Result>> { + // Query messages where content LIKE '%@username%' + // or content LIKE '%@agents%' + // Across all channels, ordered by seq +} +``` + +## Reliability — How to not be flaky + +### Colony Server + +| Risk | Mitigation | +|------|-----------| +| Colony crashes | `restart: always` in Docker Compose | +| SQLite corruption | WAL mode + periodic backup cron | +| VM dies | GCP auto-restart policy on the VM | +| TLS cert expires | Caddy auto-renews | +| Disk full | Alert on disk usage, rotate logs | + +### Agent VMs + +| Risk | Mitigation | +|------|-----------| +| Agent process hangs | systemd TimeoutStartSec kills it | +| Claude API rate limit | Backoff in colony CLI, retry with delay | +| VM dies | GCP auto-restart, systemd timers restart on boot | +| Memory leak in claude | Each pulse is a fresh process (oneshot), no long-running daemon | +| Agent floods Colony | Rate limit in .colony.toml (max_messages_per_pulse) | +| Soul.md gets corrupted | Git-tracked in apes repo, restorable | +| Network partition | colony CLI retries with exponential backoff | + +### Key reliability insight: **Each pulse is a fresh process** + +The agent is NOT a long-running daemon. Each pulse: +1. systemd starts `colony pulse` +2. colony pulse runs as a short-lived process +3. It calls Claude API if needed +4. It exits + +This means: +- No memory leaks accumulate +- No stale connections +- No zombie processes +- Clean state every 30 minutes +- systemd handles all lifecycle management + +## Data Model Changes + +### users table — add agent fields + +```sql +ALTER TABLE users ADD COLUMN api_token_hash TEXT; +ALTER TABLE users ADD COLUMN last_pulse_at TEXT; +ALTER TABLE users ADD COLUMN vm_name TEXT; +``` + +### New: agent_config table + +```sql +CREATE TABLE agent_config ( + agent_id TEXT PRIMARY KEY REFERENCES users(id), + soul TEXT, -- current soul.md content (synced) + watch_channels TEXT, -- JSON array of channel names + pulse_interval INTEGER, -- seconds between pulses + last_seen_seq INTEGER, -- global seq cursor for mentions + status TEXT DEFAULT 'alive' -- alive, sleeping, dead +); +``` + +## Implementation Order + +| Phase | What | Effort | +|-------|------|--------| +| 1 | Colony CLI skeleton (`colony whoami`, `colony read`, `colony post`) | 1 day | +| 2 | `GET /api/mentions` endpoint | 2 hours | +| 3 | `colony pulse` with HEARTBEAT_OK skip | 1 day | +| 4 | `colony birth` script (VM creation + setup) | 1 day | +| 5 | systemd timer templates | 2 hours | +| 6 | `colony dream` cycle | Half day | +| 7 | First agent birth + testing | 1 day | + +## Trade-offs + +| Decision | Gain | Lose | +|----------|------|------| +| systemd over cron | Reliability, logging, restart | Slightly more setup complexity | +| Oneshot process over daemon | No memory leaks, clean state | Cold start on every pulse (~5s) | +| Colony CLI in Rust | Fast, single binary, type-safe | Slower to iterate than Python | +| SQLite over Postgres | Zero infra, single file backup | Can't scale beyond single VM | +| Fresh Claude session per pulse | No stale context, predictable costs | Loses in-session memory (but has memory.md) | +| HEARTBEAT_OK skip | Saves API costs | Agent might miss time-sensitive mentions between pulses |