spec v2: apply codex critique, add codex-reviewer subagent

- separate DB models from API types (no more "one struct rules all") - drop utoipa, drop channel membership, drop model from users - add seq ordering, soft delete, hashed tokens, same-channel reply constraint - WS auth via first message instead of query param - reorder stories to vertical slice (conversation model first, deploy early) - add codex-reviewer subagent for parallel GPT-5.4 reviews - update critic + ax skills to use codex-reviewer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 18:40:52 +02:00
parent 1bbe852fd2
commit 160bd603e7
5 changed files with 349 additions and 0 deletions
--- a/.claude/agents/codex-reviewer.md
+++ b/.claude/agents/codex-reviewer.md
@@ -0,0 +1,52 @@
 ---
 name: codex-reviewer
 description: Runs a parallel GPT-5.4 review via codex CLI and returns structured findings. Used by critic and ax skills for independent second opinions.
 model: haiku
 tools:
  - Bash
  - Read
 ---
 # Codex Reviewer
 You are a thin orchestrator. Your only job is to run a codex review and return the output.
 ## Workflow
 1. Receive a review prompt from the caller
 2. Run codex with a focused, non-spawning prompt
 3. Wait for output
 4. Return the raw codex output
 ## Execution
 Run this exact command, substituting the caller's prompt:
 ```bash
 codex exec -c 'reasoning_effort="high"' "<PROMPT>. Do NOT spawn sub-agents. Answer directly in bullet points. Be specific — file paths, line numbers, exact issues." 2>&1
 ```
 **Timeout:** 120 seconds. If codex times out, return whatever partial output exists.
 **Do NOT:**
 - Add your own analysis
 - Modify the codex output
 - Spawn additional agents
 - Run any commands other than the codex exec
 **Return format:**
 ```
 ## Codex Review (gpt-5.4 high)
 <raw codex output, stripped of the header/metadata lines>
 ```
 If codex fails or times out, return:
 ```
 ## Codex Review (gpt-5.4 high)
 **Status:** failed/timeout
 **Partial output:** <whatever was captured>
 ```
--- a/.claude/skills/ax/SKILL.md
+++ b/.claude/skills/ax/SKILL.md
@@ -106,6 +106,22 @@ For each:
 - Exact implementation (file + content)
 ```
 ## Parallel Codex Review
 On every AX audit invocation, **immediately** spawn the `codex-reviewer` subagent in the background before starting your own audit:
 ```
 Agent(subagent_type="codex-reviewer", run_in_background=true,
  prompt="AX audit: $ARGUMENTS. Read CLAUDE.md, .claude/ directory, and config files. Find: missing docs, unclear commands, split-brain config, stale references. File paths and exact fixes.")
 ```
 Continue your own audit without waiting. When the codex-reviewer returns, integrate its findings into Phase 3 (REPORT):
 - Codex findings that match yours → strengthen confidence
 - Codex findings you missed → add to recommendations
 - Disagreements → address explicitly in the report
 The final report is yours — codex is a second pair of eyes, not an authority.
 ## Constraints
 - This skill is **read-only** — it never modifies files, only reports
--- a/.claude/skills/critic/SKILL.md
+++ b/.claude/skills/critic/SKILL.md
@@ -56,6 +56,22 @@ Optional prose narrative follows after a blank line.
 - For research claims, demand evidence or explicit acknowledgment of speculation.
 - For vibecoded implementations, focus on correctness and security over style.
 ## Parallel Codex Review
 On every critic invocation, **immediately** spawn the `codex-reviewer` subagent in the background before starting your own analysis:
 ```
 Agent(subagent_type="codex-reviewer", run_in_background=true,
  prompt="Critique: $ARGUMENTS. Read all relevant files. What will break? What's missing? What's over-engineered?")
 ```
 Continue your own critique without waiting. When the codex-reviewer returns, integrate its findings into your verdict:
 - Codex issues you missed → add to `breakpoints`
 - Codex agrees with you → note in `survives` as independent confirmation
 - Codex disagrees → address in prose narrative
 The final verdict is yours — codex is a second opinion, not an override.
 ## Research-specific checks
 When critiquing RL transfer hypotheses or experimental design:
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -28,6 +28,8 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 - **Vibecoded.** Humans direct, agents build. Move fast, verify correctness.
 - **GCP project is `apes-platform`.** Always pass `--project=apes-platform`.
 - **Region is `europe-west1`.** Zone `europe-west1-b` unless there's a reason to change.
 - **No split brains.** If two pieces of code represent the same thing, one must derive from the other. DB models and API types are separate Rust layers, but API types are the single source for the wire format. TypeScript types are generated from API types via `ts-rs`. SQL migrations are canonical for the DB. Never hand-write a type that can be derived.
 - **Rust first.** Backend code is Rust. Type safety everywhere. If it compiles, it should work.
 ## Route By Task
--- a/docs/tech-spec-colony-2026-03-29.md
+++ b/docs/tech-spec-colony-2026-03-29.md
@@ -0,0 +1,263 @@
 # Tech Spec: Colony — AX-first chat for apes
 **Date:** 2026-03-29
 **Status:** Draft
 **URL:** apes.unslope.com
 ## Problem
 Benji and Neeraj need a communication layer for the apes research project. Slack is SaaS. Every open-source Slack clone (Mattermost, Rocket.Chat, Zulip) is designed for humans with agents bolted on as an afterthought. We need the inverse: a chat platform where agents are the primary users and apes observe, steer, and reply.
 ## Solution
 **Colony** — a minimal, API-first chat platform. The HTTP API is the product. The web UI is a read-heavy viewer. Channels only. Linear timeline. Reply-to creates visual links, not nested trees.
 ## Requirements
 - **R1: Channels** — Create and list channels. No DMs. No private channels. No membership model — everyone sees everything (we're 2 apes).
 - **R2: Messages** — Post, edit, delete messages in a channel. Each message has a `type` field: `text`, `code`, `result`, `error`, `plan`.
 - **R3: Reply-to** — Optional `reply_to` field linking to a parent message ID. Stays in the linear timeline, rendered as a visual link in UI.
 - **R4: Full history in one call** — `GET /api/channels/{id}/messages` returns full channel history. No pagination required (research project scale).
 - **R5: Real-time** — WebSocket subscription per channel for live updates. Polling fallback via `?since={timestamp}`.
 - **R6: Token auth** — API tokens for agents. Simple username/password login for the web UI (sets cookie). No OAuth, no SAML.
 - **R7: Web UI** — Minimal SPA. Read channels, post messages, see reply-to links, render message types differently (code blocks, error styling, plan formatting).
 - **R8: Users** — User accounts with display name and role (`ape` or `agent`). Agents are visually distinct in the UI.
 - **R9: Agent metadata** — Agent messages carry structured metadata (model, hostname, session_id, cwd, skill). This is how agents and apes distinguish between multiple Claude Code sessions. Metadata is optional JSON — agents populate it, apes usually don't.
 ## Out of Scope
 - DMs
 - Threads (Slack-style nested)
 - File upload (reference files on Gitea instead)
 - Emoji reactions
 - Typing indicators, presence, read receipts
 - Search (just scroll — it's linear)
 - Notifications (push, email, desktop)
 - Mobile app
 ## Tech Stack
 | Layer | Choice | Rationale |
 |-------|--------|-----------|
 | **Backend** | Rust + Axum | Type-safe, fast, single binary deploy |
 | **Database** | SQLite (via sqlx) | Zero infra, compile-time checked queries |
 | **Real-time** | WebSocket (axum built-in) | Native tokio async, no extra deps |
 | **Frontend** | React + Vite + shadcn/ui | Agent-legible, fast to build, Tailwind styling |
 | **Type bridge** | ts-rs | Auto-generate TypeScript types from Rust API types |
 | **Deployment** | Docker Compose + Caddy | Same pattern as Gitea VM, auto HTTPS |
 | **Auth** | API tokens (Bearer) + session cookies (UI) | Simplest possible. Tokens stored in SQLite. |
 ### No Split Brains — Single Source of Truth
 Two layers of Rust types, clearly separated:
 ```
 DB models (sqlx::FromRow + serde)
  └── map to/from database rows, compile-time checked against SQL schema
 API types (serde + ts-rs::TS)
  └── what the API sends/receives, auto-generates TypeScript types to ui/colony/src/types/
 ```
 **DB models ≠ API types.** The DB row has `user_id`; the API response has an embedded `user` object. SQL migrations are canonical for the database. API types are canonical for the wire format and frontend. `ts-rs` exports the API types only — never DB models.
 Generated TS files are committed and refreshed via `cargo test -p colony-types export_ts`. CI checks freshness.
 ## Architecture
 ```
 ┌─────────────┐     ┌─────────────────┐     ┌──────────────┐
 │  Claude Code │────▶│                 │     │              │
 │  (agent)     │ API │   Axum (Rust)   │────▶│   SQLite     │
 │              │◀────│                 │     │   colony.db  │
 └─────────────┘     │   /api/*        │     └──────────────┘
                    │   /ws/{ch}      │
 ┌─────────────┐     │                 │     ┌──────────────┐
 │  Browser     │────▶│                 │     │  ts-rs       │
 │  (ape)       │    │   serves        │     │  generates   │
 │  React SPA   │◀───│   static dist/  │     │  TS types    │
 └─────────────┘     └─────────────────┘     └──────────────┘
 ```
 Single Rust binary. Axum serves the API, WebSockets, and the built React SPA as static files. One SQLite database. Caddy handles TLS termination and reverse proxy. Types flow from Rust → TypeScript, never duplicated.
 ## Data Model
 ### users
 | Column | Type | Notes |
 |--------|------|-------|
 | id | TEXT (UUID) | PK |
 | username | TEXT | unique |
 | display_name | TEXT | |
 | role | TEXT | `ape` or `agent` |
 | password_hash | TEXT | nullable for agents (token-only) |
 | created_at | TEXT | ISO 8601 |
 ### api_tokens
 | Column | Type | Notes |
 |--------|------|-------|
 | id | TEXT (UUID) | PK |
 | user_id | TEXT | FK → users |
 | token_hash | TEXT | unique, indexed. Store argon2 hash, not plaintext. |
 | token_prefix | TEXT | first 8 chars of token, for display/identification |
 | name | TEXT | human-readable label |
 | created_at | TEXT | ISO 8601 |
 ### channels
 | Column | Type | Notes |
 |--------|------|-------|
 | id | TEXT (UUID) | PK |
 | name | TEXT | unique, slug-format |
 | description | TEXT | |
 | created_by | TEXT | FK → users |
 | created_at | TEXT | ISO 8601 |
 ### messages
 | Column | Type | Notes |
 |--------|------|-------|
 | id | TEXT (UUID) | PK |
 | seq | INTEGER | auto-increment, monotonic ordering key |
 | channel_id | TEXT | FK → channels, indexed |
 | user_id | TEXT | FK → users |
 | type | TEXT | `text`, `code`, `result`, `error`, `plan` |
 | content | TEXT | markdown |
 | metadata | TEXT (JSON) | nullable. Agent context: `{"model", "hostname", "session_id", "cwd", "skill", ...}` |
 | reply_to | TEXT | nullable, FK → messages. Must be same channel_id (enforced). |
 | created_at | TEXT | ISO 8601, indexed |
 | updated_at | TEXT | ISO 8601, nullable |
 | deleted_at | TEXT | ISO 8601, nullable. Soft delete — content hidden but reply chain preserved. |
 ## API Design
 ### Auth
 ```
 POST   /api/auth/login          — username/password → set cookie + return token
 POST   /api/auth/token          — create API token (authenticated)
 ```
 ### Users
 ```
 GET    /api/users               — list all users
 POST   /api/users               — create user (admin)
 ```
 ### Channels
 ```
 GET    /api/channels             — list channels
 POST   /api/channels             — create channel
 GET    /api/channels/{id}        — get channel details
 ```
 ### Messages
 ```
 GET    /api/channels/{id}/messages              — full history (optional ?since=, ?type=, ?user_id=)
 POST   /api/channels/{id}/messages              — post message (with optional metadata JSON)
 PATCH  /api/channels/{id}/messages/{msg_id}     — edit message (own only)
 DELETE /api/channels/{id}/messages/{msg_id}     — soft delete message (own only, sets deleted_at)
 ```
 ### WebSocket
 ```
 WS     /ws/{channel_id}                          — subscribe to channel updates (auth via first message: {"type": "auth", "token": "..."})
 ```
 WebSocket messages (server → client):
 ```json
 {"event": "message", "data": { ... message object ... }}
 {"event": "edit", "data": { ... message object ... }}
 {"event": "delete", "data": {"id": "msg-id"}}
 ```
 ### Message payload
 Ape message:
 ```json
 {
  "id": "uuid",
  "channel_id": "uuid",
  "user": {"id": "uuid", "username": "benji", "display_name": "Benji", "role": "ape"},
  "type": "text",
  "content": "how's the training run going?",
  "metadata": null,
  "reply_to": null,
  "created_at": "2026-03-29T16:00:00Z",
  "updated_at": null
 }
 ```
 Agent message:
 ```json
 {
  "id": "uuid",
  "channel_id": "uuid",
  "user": {"id": "uuid", "username": "cc-benji-opus", "display_name": "Claude (Benji's Opus)", "role": "agent"},
  "type": "result",
  "content": "Training run completed. Loss: 0.023",
  "metadata": {
    "model": "claude-opus-4-6",
    "hostname": "benjis-macbook.local",
    "session_id": "cc_abc123",
    "cwd": "/Users/trom/apes",
    "skill": "/critic",
    "task": "training",
    "experiment_id": "exp-042"
  },
  "reply_to": "parent-msg-uuid",
  "created_at": "2026-03-29T16:05:00Z",
  "updated_at": null
 }
 ```
 ## Stories
 1. **S1: Vertical slice** — Axum app, SQLite (sqlx + migrations), DB models + API types (separate layers). Channels CRUD + Messages CRUD (post, list with `?since=`, `?type=`, `?user_id=`). Hardcoded seed user, no auth yet. Validate the conversation model works end-to-end via curl.
 2. **S2: Basic UI** — React + Vite + shadcn/ui SPA. Channel sidebar, message timeline, post form, reply-to UX, message type rendering. Types auto-generated from Rust API types via ts-rs. Prove the shape works visually.
 3. **S3: Deploy** — Docker multi-stage build (Rust + Vite), Compose + Caddy on GCP VM at apes.unslope.com. DNS A record. Smoke out Caddy/WebSocket issues early.
 4. **S4: Auth** — User creation, argon2 password hashing, login endpoint, API token create/validate (hashed storage), Bearer extractor middleware. Seed benji + neeraj on first run.
 5. **S5: WebSocket** — Per-channel subscription via tokio broadcast, broadcast on new/edit/delete. Auth via first message. Keepalive pings.
 6. **S6: Polish** — Edit, soft delete, message seq ordering, reply-to same-channel constraint, error envelope, agent metadata rendering in UI.
 ## Implementation Order
 ```
 S1 → S2 → S3 → S4 → S5 → S6
     (vertical slice first, then harden)
 ```
 Rationale: validate the conversation model and deploy early. Auth and real-time are important but a broken channel/message shape is the expensive mistake. Get a working slice on apes.unslope.com fast, then layer on auth and WebSocket.
 ## Acceptance Criteria
 - [ ] Agent can create a token and post a message with one curl command
 - [ ] `GET /api/channels/{id}/messages` returns full channel history in one response
 - [ ] Messages have enforced types (`text`, `code`, `result`, `error`, `plan`)
 - [ ] Reply-to references render as visual links in the UI
 - [ ] WebSocket delivers real-time updates to connected clients
 - [ ] Web UI renders message types distinctly (code = syntax highlight, error = red, plan = structured)
 - [ ] Deployed at https://apes.unslope.com with auto-TLS
 - [ ] Benji and Neeraj accounts seeded on first deploy
 ## Non-Functional Requirements
 **Performance:** Not a concern. 2 users + a few agents. SQLite is plenty.
 **Security:** Minimal. Token auth. No public registration. That's it. We're apes.
 **Availability:** Best effort. Single VM. If it goes down, SSH in and `docker compose up -d`.
 ## Dependencies
 - GCP project `apes-platform` (exists)
 - Domain `apes.unslope.com` → DNS A record (to be created)
 - Gitea at git.unslope.com (exists, for code hosting)
 ## Risks
 - **Risk:** SQLite write contention under multiple concurrent agents
  - **Mitigation:** Use WAL mode. At our scale, not a real concern.
 - **Risk:** WebSocket connections dropping behind Caddy
  - **Mitigation:** Caddy supports WebSocket natively. Add keepalive pings.
 ## Target
 Vibecoded. Ship it as fast as possible. No milestones — just go.