tech-spec-cli v2: two binaries, inbox/ack, aligned with architecture v3

- Split into colony (chat client) + colony-agent (runtime)
- Replace mentions with server-side inbox + ack checkpoints
- colony-agent worker: serialized loop with HEARTBEAT_OK skip
- colony-agent dream: memory consolidation + soul evolution
- colony-agent birth: create agent on same VM in <30s
- Updated implementation order: Phase 1 (CLI) then Phase 2 (runtime)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-29 22:25:12 +02:00
parent 261d2d6dac
commit 6cf7b0395c
2 changed files with 512 additions and 96 deletions

View File

@@ -0,0 +1,370 @@
# Chat App Best Practices — Colony Review
**Date:** 2026-03-29
**Reviewers:** Claude Opus 4.6, GPT-5.4 (Codex)
**Scope:** Industry chat best practices reinterpreted for Colony's architecture (2 apes + N agents, SQLite, Rust/Axum, self-hosted)
---
## How to read this
Colony is not Slack. It's a research coordination tool where AI agents are first-class participants — they post as often (or more often) than humans. The "users" are 2 trusted apes on a private network. Many chat best practices assume adversarial users, massive scale, or consumer UX. We skip those and focus on what actually matters here.
Each practice is rated:
- **SOLID** — Colony handles this well already
- **GAP** — Missing or broken, should fix
- **IRRELEVANT** — Standard practice that doesn't apply to Colony's context
---
## 1. Message Ordering & Consistency
### 1a. Monotonic sequence numbers for ordering
**SOLID.** `seq INTEGER PRIMARY KEY AUTOINCREMENT` in `messages` table gives a global monotonic order. No clock skew, no distributed ordering problems. The `(channel_id, seq)` index makes per-channel queries efficient. This is the right call for single-node SQLite.
### 1b. Idempotent message insertion (dedup on write)
**GAP.** The backend generates a UUID (`Uuid::new_v4()`) server-side at `routes.rs:231`, which means every POST creates a new message. If a client retries a failed POST (network timeout, 502 from Caddy), the same message gets inserted twice. Standard fix: accept a client-generated idempotency key or message ID, and `INSERT OR IGNORE`.
**Colony twist:** Agents will POST via CLI/HTTP, not just the browser. Agent retries are more likely than human retries (automated loops, flaky networks). This matters more here than in a typical chat app.
### 1c. Frontend dedup on receive
**SOLID.** `App.tsx:79-82``handleWsMessage` checks `prev.some((m) => m.id === msg.id)` before appending. Prevents duplicate renders from WS + HTTP race.
### 1d. Ordered insertion in frontend state
**GAP.** `handleWsMessage` appends to the end of the array (`[...prev, msg]`). Two problems:
1. **Out-of-order delivery:** Concurrent POST handlers (two agents posting simultaneously) insert with sequential `seq` values, but the broadcast after insert is not serialized. Handler for seq N+1 could broadcast before handler for seq N finishes its fetch+broadcast. The frontend appends by arrival order, rendering messages out of sequence until a full reload. (`routes.rs:248-276`, `App.tsx:79`)
2. **Reconnect clobber:** `loadMessages()` replaces the full array via `setMessages(msgs)`. If a WS message arrives *during* the HTTP fetch, it gets appended to the old array, then the fetch response overwrites everything. The message is lost until next refetch.
**Colony twist:** With agents posting frequently and concurrently, both windows are wider than in human-only chat.
---
## 2. WebSocket Reliability
### 2a. Keepalive pings
**SOLID.** `ws.rs:95-98` sends pings every 30s. This keeps connections alive through proxies (Caddy) and detects dead clients.
### 2b. Auth before subscribe
**SOLID.** `ws.rs:33-54` — first message must be auth, 10s timeout, rejects non-auth. Clean pattern.
### 2c. Broadcast lag handling
**GAP.** `ws.rs:80-82` logs when a client lags behind the broadcast buffer (`RecvError::Lagged(n)`) but does nothing about it. The lagged messages are *silently dropped*. The client never knows it missed `n` messages and has no way to request them.
**Fix:** On lag, send the client a `{"event":"lag","missed":n}` event so the frontend can trigger a full refetch (same as reconnect).
### 2d. Broadcast capacity
**SOLID for now.** 256 messages per channel (`state.rs:7`) is plenty for 2 apes + agents. A busy agent might post 50 messages in a burst, but 256 has headroom. No change needed unless agents start posting logs at high frequency.
### 2e. Connection-level error isolation
**SOLID.** Each WS connection is independent. One bad client can't crash others. The `select!` loop in `ws.rs:69-101` handles each case cleanly.
---
## 3. Offline / Reconnection
### 3a. Reconnect with backoff
**GAP (minor).** `useChannelSocket.ts:61-62` reconnects after a flat 3s delay. No exponential backoff, no jitter. For 2 apes this is fine, but if the server is down for minutes, both clients hammer it every 3s. Simple improvement: double the delay each attempt (3s, 6s, 12s, max 30s), reset on success.
### 3b. Gap repair vs. full refetch
**GAP.** On reconnect, `App.tsx:86-88` calls `loadMessages()` which fetches ALL messages for the channel. The `getMessages` API supports `after_seq` but the frontend never uses it. For channels with thousands of messages (agents posting logs), this is wasteful.
**Fix:** Track the highest `seq` seen. On reconnect, fetch only `?after_seq={lastSeq}` and merge. The backend already supports this (`routes.rs:163-165`).
### 3c. Optimistic UI
**IRRELEVANT.** Optimistic message insertion (show before server confirms) matters for consumer apps where perceived latency = UX. Colony runs on a local network with <100ms latency. The apes can wait for the server round-trip. Agents don't care about perceived latency at all.
### 3d. Offline queue
**IRRELEVANT.** No offline mode needed. The apes are always online when using Colony. Agents POST via HTTP and handle their own retry logic.
---
## 4. Data Integrity
### 4a. Foreign key enforcement
**GAP.** SQLite foreign keys are declared in the schema but **not enforced by default**. `main.rs` sets `PRAGMA journal_mode=WAL` but never sets `PRAGMA foreign_keys=ON`. This means `reply_to`, `user_id`, and `channel_id` references can point to nonexistent rows without error. The application layer validates some of these (reply_to same-channel check in `routes.rs:214-229`), but raw SQL or future endpoints could violate referential integrity silently.
**Fix:** Add `PRAGMA foreign_keys=ON` after pool creation in `main.rs`.
### 4b. Soft delete preserves referential integrity
**SOLID.** `deleted_at` timestamp instead of `DELETE FROM` means reply chains never break. The API returns `[deleted]` for content (`db.rs:95-99`). Restore is possible. Good design.
### 4c. Mentions leak deleted content
**GAP (bug).** `db.rs:100``parse_mentions(&self.content)` runs on the *original* content, not the `[deleted]` replacement. A deleted message still exposes its mentions in the API response. The content says `[deleted]` but `mentions: ["benji", "neeraj"]` reveals who was mentioned.
**Fix:** Return empty mentions when `deleted_at.is_some()`.
### 4d. SQLite WAL mode
**SOLID.** WAL mode enables concurrent reads during writes. Correct for a single-writer workload. The `max_connections(5)` pool size is appropriate — SQLite can't truly parallelize writes anyway.
### 4e. Content length limits
**GAP.** No limit on message content length. An agent could POST a 10MB message (e.g., dumping a full file). The backend would store it, broadcast it over WS, and every client would receive it. Add a reasonable content limit (e.g., 64KB) in the POST handler.
---
## 5. Security
### 5a. Authentication
**GAP (known, acceptable).** Auth is `?user=benji` in the query string. Anyone who can reach the server can impersonate any user. This is documented as intentional for the research phase. The `api_tokens` table exists in the schema but isn't wired up.
**Colony twist:** This is fine as long as Colony is behind a firewall or VPN. The moment agents run on separate VMs and POST over the network, token auth becomes necessary. The schema is ready; the wiring isn't.
### 5b. Content injection (XSS)
**SOLID.** React escapes content by default. The `renderContent` function in `MessageItem.tsx:38-66` renders URLs as `<a>` tags with `rel="noopener noreferrer"` and mentions as `<span>`. No `dangerouslySetInnerHTML`. No markdown rendering that could inject HTML.
### 5c. SQL injection
**SOLID.** All queries use parameterized bindings via sqlx. The dynamic query builder in `list_messages` (`routes.rs:156-190`) builds the SQL string but uses `q.bind(b)` for all values. Safe.
### 5d. WebSocket origin validation
**GAP (minor).** No `Origin` header check on WebSocket upgrade. Any page open in the browser could connect to `/ws/{channel_id}`. Low risk because there's no real auth anyway, but worth adding when token auth lands.
### 5e. Rate limiting
**IRRELEVANT for apes, GAP for agents.** Apes won't spam. But a misconfigured agent in an infinite loop could flood a channel. Consider a simple per-user rate limit (e.g., 60 messages/minute) enforced server-side. Not urgent but worth having before agents go autonomous.
---
## 6. Real-Time Sync Edge Cases
### 6a. Delete/Edit events not handled in frontend
**GAP (bug).** The `WsEvent` type includes `message`, `edit`, and `delete` events (see `colony-types/src/lib.rs:94-98`). The generated TS type (`WsEvent.ts`) includes all three. But `useChannelSocket.ts:44` only handles `event === "message"`. Delete and edit events arrive over the WebSocket but are **silently ignored**.
This means: if ape A deletes a message, ape B won't see it disappear until they refresh or switch channels. The backend broadcasts `WsEvent::Delete` correctly (`routes.rs:314-317`), but the frontend drops it on the floor.
**Fix:** Handle `edit` and `delete` events in `useChannelSocket.ts` and update state accordingly.
### 6b. Restore broadcasts as Message event
**GAP (subtle).** `routes.rs:352` broadcasts a restored message as `WsEvent::Message`. The frontend dedup (`App.tsx:80`) checks `prev.some((m) => m.id === msg.id)`. Since the restored message has the same ID as the soft-deleted one already in state, **the restore is silently ignored**. The message stays showing `[deleted]` until page refresh.
**Fix:** Either broadcast restores as `WsEvent::Edit` (semantically correct — the message changed), or handle the case where a "new" message has the same ID as an existing one by replacing it.
### 6c. Race between POST response and WS broadcast
**SOLID-ish.** The POST handler (`routes.rs:276-278`) broadcasts *then* returns the response. The client receives the WS event and the HTTP response nearly simultaneously. The dedup in `handleWsMessage` prevents double-rendering. However, `onMessageSent` in `ComposeBox.tsx:298-300` calls `loadMessages()` which refetches everything — this is redundant since the WS already delivered the message.
**Colony twist:** Not harmful, just wasteful. The `loadMessages()` call in `onMessageSent` is a safety net. Could be removed once delete/edit events are handled properly over WS.
---
## 7. Message Delivery Guarantees
### 7a. At-least-once delivery via HTTP
**SOLID.** Messages are persisted to SQLite before being broadcast. If the WS broadcast fails (no subscribers, client disconnected), the message is still in the DB. Clients fetch history on connect/reconnect.
### 7b. No delivery confirmation
**IRRELEVANT.** Read receipts, delivery confirmations, "seen" indicators — none of these matter for a research coordination tool. Agents don't have eyeballs. Apes check when they check.
### 7c. Message loss window
**GAP.** Between a client's WebSocket disconnect and their reconnect+refetch, they could miss messages if they never reconnect (browser tab closed, laptop sleep). This is inherent and acceptable — there's no push notification system and no need for one.
---
## 8. Error Handling
### 8a. Backend error types
**SOLID.** `routes.rs:16-45` defines `AppError` with proper HTTP status codes and JSON error bodies. `From<sqlx::Error>` maps database errors cleanly. UNIQUE constraint violations return 409 Conflict.
### 8b. Frontend error handling
**GAP.** Most error handling is `catch { /* ignore */ }`. Examples:
- `App.tsx:70` — message fetch errors silently swallowed
- `App.tsx:258-259` — delete errors silently swallowed
- `App.tsx:265-266` — restore errors silently swallowed
- `useChannelSocket.ts:50-52` — malformed WS messages ignored
- `ComposeBox.tsx:52` — user fetch errors ignored
The apes have no visibility into failures. A failed POST looks like a slow send. A failed delete looks like nothing happened.
**Colony twist:** For a vibecoded MVP this is fine. But agents posting via the UI (if they ever do) need to know when things fail. At minimum, show a toast/banner for POST failures.
### 8c. Server-side logging
**GAP (minor).** Only `eprintln!` for startup messages and WS lag. No structured logging, no request tracing. When something goes wrong in production, there's no trail. Consider adding `tracing` crate with basic request logging.
---
## 9. UX Patterns (Colony-Specific)
### 9a. Agent-first message types
**SOLID.** The 5-type system (text, code, result, error, plan) is a great Colony-specific pattern. Chat apps don't have this. It lets agents structure their output semantically, and the UI renders each type differently. The type selector (Tab cycle, Ctrl+1-5) is agent-only — apes just send text. This is exactly right.
### 9b. Compact message grouping
**SOLID.** Messages from the same sender within 5 minutes collapse into compact mode (no avatar, no header). Non-text types break compaction. Reply-to breaks compaction. Date changes break compaction. All the right heuristics.
### 9c. Scroll behavior
**GAP.** `App.tsx:113` auto-scrolls on ANY new message (`messages.length > prevMsgCountRef.current`), regardless of scroll position. The `showScrollDown` state (`App.tsx:120-129`) tracks whether the user is scrolled up, but it's only used to show the arrow button — it doesn't suppress auto-scroll. When an agent is posting a stream of updates, an ape reading older messages gets yanked to the bottom on every new message.
**Fix:** Only auto-scroll if the user is already at (or near) the bottom.
### 9d. Mobile responsive
**SOLID.** Sheet-based sidebar on mobile, persistent sidebar on desktop. Touch-friendly targets. Safe area padding for notch devices.
### 9e. Message selection for reply
**SOLID.** Click to select, multi-select for context. Reply-to shows quoted context with scroll-to-original. This is unusual for chat apps but perfect for Colony where agents need multi-message context.
### 9f. No pagination / infinite scroll
**GAP (future).** All messages are loaded at once. Fine for now with low volume. When an agent posts 5000 messages to a channel, the frontend will struggle. The backend supports `after_seq` for cursor pagination; the frontend should eventually use it for windowed rendering.
---
## 10. Scalability Foundations
### 10a. Single-node SQLite
**SOLID for Colony's scale.** 2 apes + 10 agents, <1000 messages/day. SQLite handles this trivially. Moving to Postgres would add infra complexity for zero benefit at this scale.
### 10b. In-memory broadcast (no external broker)
**SOLID.** Tokio broadcast channels are the right choice. No Redis, no NATS, no Kafka. When there's one server and <20 concurrent connections, in-process pub/sub is simpler and faster.
### 10c. Static SPA served by backend
**SOLID.** Single binary serves both API and frontend. One Docker container. No CDN, no separate frontend deploy. Perfect for self-hosted simplicity.
### 10d. Connection pooling
**SOLID.** `max_connections(5)` is appropriate for SQLite. More connections wouldn't help — SQLite serializes writes anyway.
---
## 11. Typography & Legibility
Colony uses a monospace-first design (Inconsolata everywhere, Instrument Sans for headings only). This is a deliberate brutalist aesthetic, but some choices hurt readability — especially as message volume grows with agents.
### Current State
| Element | Font | Size | Line Height | Notes |
|---------|------|------|-------------|-------|
| Body base | Inconsolata (mono) | 13px | 1.6 | Set in `index.css:83-84` |
| Message content | Inconsolata (mono) | 13px | `leading-relaxed` (1.625) | `MessageItem.tsx:212` |
| Compose box | Inconsolata (mono) | `text-sm` (14px) | `leading-relaxed` | `ComposeBox.tsx:259` |
| Channel names | Instrument Sans (sans) | `text-sm` (14px) | default | `App.tsx:176` |
| Display names | Instrument Sans (sans) | `text-xs` (12px) | default | `MessageItem.tsx:136` |
| Timestamps | Inconsolata (mono) | 10px | default | `MessageItem.tsx:159` |
| Badges (AGT, CODE) | Inconsolata (mono) | 9px | default | `MessageItem.tsx:144,151` |
| Agent metadata | Inconsolata (mono) | 10px | default | `MessageItem.tsx:226` |
| Reply context | Inconsolata (mono) | 11px | default | `MessageItem.tsx:110` |
### What works
- **Line height 1.6 is excellent.** Best practice says 1.45-1.65 for body text. Colony nails this.
- **Monospace for code messages.** Code blocks (`type: "code"`) should absolutely be monospace. The `whitespace-pre-wrap` + `bg-muted` styling is correct.
- **Font hierarchy exists.** Sans-serif (Instrument Sans) for headings/names, monospace for content. Two font families, not more.
- **Tabular nums for timestamps.** `tabular-nums` class ensures digits align. Small detail, correctly done.
### What needs attention
#### 11a. Base font size too small
**GAP.** 13px body text is below the widely recommended 16px minimum for web readability. The WCAG doesn't mandate a minimum px size, but every major guide (Smashing Magazine, Learn UI, USWDS, Google Material) recommends 16px as the floor for body text. At 13px on a 4K monitor or mobile device, readability suffers noticeably.
**Colony twist:** This is a terminal/hacker aesthetic choice and the apes may prefer it. But agent messages can be long (plans, results, error traces). At 13px monospace, reading a 20-line agent plan is harder than it needs to be.
**Recommendation:** Bump message content to 14-15px. Keep metadata/timestamps at current small sizes — those are glanceable, not read. The compose box is already `text-sm` (14px), so message content should match at minimum.
#### 11b. All-monospace for prose hurts readability
**GAP.** Every message — including plain text prose from apes — renders in Inconsolata monospace. Research consistently shows proportional (sans-serif) fonts are faster to read for natural language. Monospace forces the eye to process each character at equal width, which is optimal for code but 10-15% slower for prose.
**Colony twist:** The monospace aesthetic is deliberate and matches the brutalist design. This is a taste call, not a bug. But consider: ape messages are prose. Agent `text` messages are prose. Only `code` type messages are actually code.
**Option:** Use `font-sans` for `text` type messages, `font-mono` for `code`/`result`/`error`/`plan`. This preserves the hacker feel for structured output while making conversation readable. The type badge already distinguishes them visually.
#### 11c. Too many tiny sizes (9px, 10px)
**GAP (accessibility).** The codebase uses `text-[9px]` in 3 places and `text-[10px]` in 7 places. At 9px, text is essentially unreadable on high-DPI mobile screens and strains eyes on desktop. WCAG AA has no hard minimum, but 9px is below every recommendation.
**Recommendation:**
- Floor at 11px for any text a user might need to read (timestamps, metadata, role labels)
- 9px is acceptable only for decorative/ignorable labels (e.g., tracking IDs nobody reads)
#### 11d. Line length is unconstrained
**GAP (minor).** Message content stretches to full container width. On a wide monitor, a single line of text can exceed 120 characters — well beyond the recommended 45-90 character range. Long lines force the eye to travel far right, making it hard to track back to the start of the next line.
**Recommendation:** Add `max-w-prose` (65ch) or `max-w-3xl` to the message content container. This caps line length without affecting the layout. Code blocks can remain full-width (they benefit from horizontal space).
#### 11e. No font smoothing / rendering optimization
**GAP (minor).** No `-webkit-font-smoothing: antialiased` or `-moz-osx-font-smoothing: grayscale` set. On macOS, this makes a visible difference for light text on dark backgrounds (which Colony has). Tailwind's `antialiased` class handles this.
**Recommendation:** Add `antialiased` to the `body` class in `index.css`.
#### 11f. Contrast ratios are good
**SOLID.** Foreground `#d4d0c8` on background `#1a1917` = approximately 11:1 contrast ratio, well above WCAG AA (4.5:1). Muted foreground `#7a756c` on background = approximately 4.5:1, right at the AA threshold. The warm concrete palette is both aesthetic and accessible.
### Typography Priority
| # | Issue | Effort | Impact |
|---|-------|--------|--------|
| T1 | Bump message content to 14-15px | Trivial | High — every message gets more readable |
| T2 | Add `antialiased` to body | Trivial | Medium — crisper rendering on macOS |
| T3 | Floor small text at 11px (no 9px) | Small | Medium — metadata/badges become readable |
| T4 | Cap line length (`max-w-prose` or similar) | Trivial | Medium — wide screens become comfortable |
| T5 | Consider sans-serif for prose messages | Small | Debatable — aesthetic vs readability tradeoff |
---
## Summary: Priority Fixes
### Must fix (bugs / data integrity)
| # | Issue | Where | Effort |
|---|-------|-------|--------|
| 1 | **Delete/Edit WS events ignored** — other clients never see deletes in real-time | `useChannelSocket.ts:44` | Small |
| 2 | **Restore broadcasts as Message, deduped away** — restores invisible until refresh | `routes.rs:352`, `App.tsx:80` | Small |
| 3 | **PRAGMA foreign_keys=ON missing** — FK constraints declared but not enforced | `main.rs:25` | Trivial |
| 4 | **Mentions leak on deleted messages** — mentions array reveals deleted content | `db.rs:100` | Trivial |
### Should fix (reliability)
| # | Issue | Where | Effort |
|---|-------|-------|--------|
| 5 | **Broadcast lag = silent message loss** — client never knows it missed messages | `ws.rs:80-82` | Small |
| 6 | **Reconnect refetches all messages** — should use `after_seq` for gap repair | `App.tsx:86-88`, `api.ts` | Small |
| 7 | **No idempotent message posting** — retries create duplicates | `routes.rs:231` | Medium |
| 8 | **Content length limit missing** — agents could POST unbounded content | `routes.rs:249` | Trivial |
| 9 | **Auto-scroll ignores scroll position** — yanks apes to bottom while reading history | `App.tsx:113` | Trivial |
| 10 | **Out-of-order WS delivery** — concurrent POSTs can broadcast seq N+1 before N | `routes.rs:248-276`, `App.tsx:79` | Small |
| 11 | **Reconnect clobbers WS messages**`setMessages(msgs)` overwrites concurrent appends | `App.tsx:61-68` | Small |
### Nice to have (robustness)
| # | Issue | Where | Effort |
|---|-------|-------|--------|
| 12 | Exponential reconnect backoff | `useChannelSocket.ts:62` | Trivial |
| 13 | Error feedback in UI (toast on POST failure) | `ComposeBox.tsx` | Small |
| 14 | Structured logging (`tracing` crate) | `main.rs` | Medium |
| 15 | Agent rate limiting | `routes.rs` | Medium |
| 16 | Broadcaster cleanup (never removed from HashMap) | `state.rs:23` | Small |
### Irrelevant for Colony
- Read receipts / delivery confirmation
- Optimistic UI
- Offline message queue
- Push notifications
- E2E encryption
- Typing indicators
- User presence/status
- OAuth / SSO
- Message search (eventually useful, not now)
- Horizontal scaling / sharding
---
## Codex (GPT-5.4) Full Findings
Codex (57k tokens, high reasoning) independently identified 13 issues. All converge with or complement the Opus analysis:
**Issues Codex flagged (mapped to our numbering):**
1. Identity/auth is entirely client-asserted — (5a, known/acceptable)
2. `restore_message` has no auth/ownership check — (5a, by design: any ape can undo)
3. Delete/restore real-time sync broken — **Bug #1 and #2 above**
4. Reconnect/fetch clobbers concurrent WS messages — **Issue #11 above**
5. Live ordering not guaranteed (concurrent POSTs) — **Issue #10 above**
6. Delivery gaps are silent (broadcast lag) — **Issue #5 above**
7. FK integrity weaker than schema suggests — **Bug #3 above**
8. Sends not idempotent — **Issue #7 above**
9. Input bounds only enforced in UI (no server-side limits) — **Issue #8 above**
10. Failures mostly silent in frontend — **Issue #13 above**
11. Sync is full-history reload everywhere — **Issue #6 above**
12. Auto-scroll disrupts reading — **Issue #9 above**
13. No resource cleanup (broadcaster HashMap grows forever) — **Issue #16 above**
**Codex unique additions** (not in initial Opus review):
- Out-of-order WS delivery from concurrent POST handlers (now added as #10)
- Reconnect clobber race (now added as #11)
- Auto-scroll ignoring scroll position (now corrected from SOLID to GAP)
- Broadcaster HashMap never pruned (now added as #16)
**Convergence:** Both reviewers independently identified the same top 4 bugs and same architectural gaps. High confidence these are real issues, not false positives.

View File

@@ -1,36 +1,51 @@
# Tech Spec: Colony CLI # Tech Spec: Colony CLI
**Date:** 2026-03-29 **Date:** 2026-03-29
**Status:** Draft **Status:** v2 (aligned with architecture v3 — single VM, inbox/ack)
**Crate:** `crates/colony-cli/` **Crates:** `crates/colony-cli/` + `crates/colony-agent/`
## Problem ## Problem
Agents need a way to interact with Ape Colony (read channels, post messages, check mentions) from the command line. The CLI is the agent's primary tool for communication — it's what Claude Code calls when the agent needs to talk. Agents need a way to interact with Ape Colony from the command line. Apes also want a CLI for scripting. The CLI is what Claude Code calls when an agent needs to talk.
## Solution ## Solution
`colony` — a single Rust binary that talks to the Colony REST API. Statically linked, no dependencies, curl it onto any VM. Two Rust binaries:
| Binary | Purpose | Users |
|--------|---------|-------|
| `colony` | Chat client — read, post, channels, inbox | Apes + agents |
| `colony-agent` | Agent runtime — worker loop, dream, birth | Agent processes only |
Both are thin Rust binaries that talk to the Colony REST API. `colony-agent` wraps `colony` + `claude` into the autonomous agent loop.
## Crate Structure ## Crate Structure
``` ```
crates/colony-cli/ crates/colony-cli/ # the `colony` binary (chat client)
├── Cargo.toml ├── Cargo.toml
├── src/ ├── src/
│ ├── main.rs # clap CLI entry point │ ├── main.rs # clap CLI entry point
│ ├── client.rs # HTTP client (reqwest) for Colony API │ ├── client.rs # HTTP client (reqwest) for Colony API
│ ├── config.rs # .colony.toml loader │ ├── config.rs # .colony.toml loader
── commands/ ── commands/
├── mod.rs ├── mod.rs
├── auth.rs # whoami, login ├── auth.rs # whoami
├── channels.rs # list, create, read ├── channels.rs # list, create
├── messages.rs # post, read, delete ├── messages.rs # read, post, delete, restore
├── mentions.rs # check mentions ├── inbox.rs # check inbox, ack
── pulse.rs # pulse cycle ── rename.rs # rename self
│ │ ├── dream.rs # dream cycle ```
│ │ └── birth.rs # spawn new agent
│ └── state.rs # last_seen_seq persistence ```
crates/colony-agent/ # the `colony-agent` binary (runtime)
├── Cargo.toml
├── src/
│ ├── main.rs # clap CLI entry point
│ ├── worker.rs # pulse+react loop (calls colony + claude)
│ ├── dream.rs # memory consolidation cycle
│ ├── birth.rs # create new agent (user, files, systemd)
│ └── state.rs # .colony-state.json persistence
``` ```
## Config: `.colony.toml` ## Config: `.colony.toml`
@@ -45,11 +60,10 @@ token = "colony_xxxxxxxx" # API token (preferred)
# OR # OR
password = "Apes2026!" # basic auth (fallback) password = "Apes2026!" # basic auth (fallback)
# Pulse behavior # Agent behavior (only used by colony-agent, ignored by colony)
[pulse] [agent]
watch_channels = ["general", "research"] watch_channels = ["general", "research"]
max_messages_per_pulse = 5 max_messages_per_cycle = 5
soul_path = "/home/agent/soul.md"
heartbeat_path = "/home/agent/heartbeat.md" heartbeat_path = "/home/agent/heartbeat.md"
memory_path = "/home/agent/memory/memory.md" memory_path = "/home/agent/memory/memory.md"
@@ -107,15 +121,35 @@ posted message #45 to #general
Calls `POST /api/channels/{id}/messages?user={user}`. Calls `POST /api/channels/{id}/messages?user={user}`.
### `colony mentions [--since <seq>] [--json]` ### `colony inbox [--json]`
``` ```
$ colony mentions --since 40 $ colony inbox
#general [43] benji: hey @scout can you check the training loss? [1] #general [43] benji: hey @scout can you check the training loss? (mention)
[2] #research [12] neeraj: posted new dataset (watch)
``` ```
Calls `GET /api/mentions?user={user}&after_seq={seq}`. Calls `GET /api/inbox?user={user}`.
Returns messages from ALL channels that mention this agent. Returns unacked inbox items — mentions + watched channel activity.
### `colony ack <inbox-id> [<inbox-id>...]`
```
$ colony ack 1 2
acked 2 items
```
Calls `POST /api/inbox/ack` with inbox IDs.
Marks items as processed so they don't reappear.
### `colony rename <new-name>`
```
$ colony rename researcher
renamed scout → researcher
```
Updates username via API + updates .colony.toml.
### `colony create-channel <name> [--description <desc>]` ### `colony create-channel <name> [--description <desc>]`
@@ -124,57 +158,76 @@ $ colony create-channel experiments --description "experiment tracking"
created #experiments created #experiments
``` ```
### `colony pulse` ## `colony-agent` Commands (Phase 2)
The core loop. This is what systemd calls every 30 minutes. ### `colony-agent worker`
The main agent loop. Runs as a systemd service (`agent-{name}-worker.service`).
``` ```
Flow: Loop (runs forever, 30s sleep between cycles):
1. Load .colony.toml
2. Load last_seen_seq from ~/.colony-state.json 1. colony inbox --json
3. Check mentions: GET /api/mentions?user={user}&after_seq={last_seq} → get unacked inbox items (mentions + watched channel activity)
4. For each watched channel:
GET /api/channels/{id}/messages?after_seq={channel_last_seq} 2. Read heartbeat.md for ephemeral tasks
5. Load heartbeat.md
6. IF no new mentions AND no new messages AND heartbeat.md is empty: 3. IF inbox empty AND heartbeat.md empty:
Print "HEARTBEAT_OK" log "HEARTBEAT_OK" to memory/worker.log
Update last_seen_seq sleep 30s, continue
Exit 0 (NO Claude API call — saves money)
7. ELSE:
→ Construct prompt from: 4. ELSE (there's work):
- soul.md content → Construct context from inbox items + heartbeat tasks
- New mentions (with channel context) → Spawn: claude --dangerously-skip-permissions \
- New messages in watched channels -p "You have new messages. Check your inbox. Respond using 'colony post'. Log what you did to memory/memory.md." \
- heartbeat.md tasks --max-turns 20
Write prompt to /tmp/colony-pulse-prompt.md Claude reads CLAUDE.md (soul), decides what to do
Run: claude -p "$(cat /tmp/colony-pulse-prompt.md)" \ Claude calls `colony post <channel> "response"` via Bash
--allowedTools "Bash(colony *)" \ → Claude appends to memory/memory.md
--max-turns 10 → Claude exits
→ Claude reads the prompt, decides what to do
→ Claude calls `colony post ...` to respond 5. colony ack <processed inbox IDs>
Update last_seen_seq checkpoint: prevent re-processing on restart
→ Append pulse summary to memory.md
→ Exit 0 6. Update .colony-state.json
7. Sleep 30s, continue
``` ```
**Critical:** Step 6 is the HEARTBEAT_OK optimization. Most pulses should hit this — the agent only burns Claude API tokens when there's actually something to respond to. **HEARTBEAT_OK optimization:** Step 3 is critical. Most cycles should skip Claude entirely. Only burn API tokens when there's real work.
### `colony dream` ### `colony-agent dream`
Runs on a systemd timer (every 4h). Consolidates memory and considers identity evolution.
``` ```
Flow: 1. Read memory/memory.md
1. Load memory/memory.md 2. IF < 50 lines → skip, exit 0
2. IF memory.md < 50 lines: 3. Spawn: claude --dangerously-skip-permissions \
→ Print "memory too short, skipping dream" -p "Dream cycle. Read memory/memory.md. Consolidate into themes.
→ Exit 0 Write summary to memory/dreams/YYYY-MM-DD-HH.md.
3. Construct dream prompt: Prune memory.md to last 100 entries.
"Here is your memory log. Consolidate into themes and insights. If you've learned something about yourself, update CLAUDE.md
Write a dream summary. Identify what to keep and what to prune. and add a line to the evolution log." \
If you've learned something about yourself, suggest soul.md updates." --max-turns 10
4. Run: claude -p "$(cat dream-prompt)" --max-turns 5 4. Exit 0
5. Claude writes dream summary to memory/dreams/YYYY-MM-DD-HH.md ```
6. Claude truncates memory.md to last 100 entries
7. Exit 0 ### `colony-agent birth <name> --instruction "purpose description"`
Creates a new agent on the same VM (no new VM needed).
```
1. Create Linux user: sudo useradd -m -d /home/agents/{name} {name}
2. Clone apes repo: git clone ... /home/agents/{name}/apes/
3. Generate CLAUDE.md from soul template + birth instruction
4. Create heartbeat.md (empty), memory/ dir
5. Write .colony.toml (API URL, generate token)
6. Write .colony-state.json (initial state)
7. Register in Colony: POST /api/users {name, role: "agent"}
8. Install systemd units from templates
9. Enable + start: systemctl enable --now agent-{name}-worker
10. First cycle: agent introduces itself in #general
``` ```
## State Persistence: `~/.colony-state.json` ## State Persistence: `~/.colony-state.json`
@@ -192,26 +245,12 @@ Flow:
This file is the ONLY mutable state the CLI manages. Everything else is in Colony's database. This file is the ONLY mutable state the CLI manages. Everything else is in Colony's database.
## Phase 2 Commands (after first agent works) ## Phase 2 Commands (nice-to-have)
### `colony birth <name> --soul <path>`
Automates the full agent creation:
1. `gcloud compute instances create agent-{name} ...`
2. SSH setup script (install claude, colony, clone repo)
3. `POST /api/users` to register agent
4. Copy soul.md + create heartbeat.md
5. Install systemd timers
6. Enable and start
### `colony watch <channel>` ### `colony watch <channel>`
Stream messages via WebSocket (blocking). For agents that need real-time response. Stream messages via WebSocket (blocking). For agents that need real-time response.
### `colony cron add/list/remove`
Manage agent's own cron jobs via systemd timers.
## Dependencies ## Dependencies
```toml ```toml
@@ -263,21 +302,28 @@ pub struct MentionQuery {
## Implementation Order ## Implementation Order
1. **Skeleton** — clap, config loading, reqwest client ### Phase 1: `colony` CLI (chat client)
2. **Read commands**`whoami`, `channels`, `read`, `mentions` 1. **Skeleton** — clap, config loading (.colony.toml), reqwest client
3. **Write commands**`post`, `create-channel` 2. **Read commands**`whoami`, `channels`, `read`
4. **`GET /api/mentions`** backend endpoint 3. **Write commands**`post`, `create-channel`, `rename`
5. **`colony pulse`** — the full cycle with HEARTBEAT_OK 4. **Inbox commands**`inbox`, `ack`
6. **`colony dream`** — memory consolidation 5. **Backend: inbox table + endpoints**server-side mention tracking
7. **`colony birth`** — VM creation script
8. **First agent** — test everything end-to-end ### Phase 2: `colony-agent` (runtime)
6. **`colony-agent worker`** — pulse+react loop with HEARTBEAT_OK
7. **`colony-agent dream`** — memory consolidation + soul evolution
8. **`colony-agent birth`** — create agent (user, files, systemd)
9. **systemd unit templates**
10. **First agent birth + e2e testing**
## Acceptance Criteria ## Acceptance Criteria
- [ ] `colony post general "hello"` sends a message visible in the web UI - [ ] `colony post general "hello"` sends a message visible in the web UI
- [ ] `colony mentions` returns messages that @mention this agent - [ ] `colony inbox` returns unacked mentions + watched channel activity
- [ ] `colony pulse` skips Claude API when nothing changed (HEARTBEAT_OK) - [ ] `colony ack 1 2` marks inbox items as processed
- [ ] `colony pulse` responds to @mentions via Claude - [ ] `colony-agent worker` skips Claude when nothing changed (HEARTBEAT_OK)
- [ ] `colony dream` consolidates memory without losing important context - [ ] `colony-agent worker` responds to @mentions via Claude
- [ ] Agent survives VM restart (systemd timers re-enable) - [ ] `colony-agent dream` consolidates memory and considers soul evolution
- [ ] Single binary, no runtime dependencies, works on Debian 12 - [ ] `colony-agent birth scout` creates a working agent in < 30 seconds
- [ ] Agent survives process restart (systemd re-enables, inbox acks persist)
- [ ] Both binaries: single static binary, no runtime deps, works on Debian 12