benji/apes

Files

limiteinductive 160bd603e7 spec v2: apply codex critique, add codex-reviewer subagent

- separate DB models from API types (no more "one struct rules all")
- drop utoipa, drop channel membership, drop model from users
- add seq ordering, soft delete, hashed tokens, same-channel reply constraint
- WS auth via first message instead of query param
- reorder stories to vertical slice (conversation model first, deploy early)
- add codex-reviewer subagent for parallel GPT-5.4 reviews
- update critic + ax skills to use codex-reviewer

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-29 18:40:52 +02:00

3.4 KiB

Raw Blame History

name, description

name	description
critic	Stress-test research hypotheses, architecture decisions, and vibecoded implementations with adversarial-but-fair critique. Returns structured JSON verdicts. Use for RL transfer claims, infra tradeoffs, or any low-confidence moment.

Critic

Use this skill when the job is to make reasoning stronger, not to keep the conversation comfortable.

Good fits

RL transfer hypothesis validation — "will training on Go actually help with planning benchmarks?"
architecture tradeoffs — self-hosted vs managed, monolith vs services
vibecoded implementation review — "this works but was generated fast, is it sound?"
research design — experimental methodology, benchmark selection, control groups
infra decisions — GCP resource sizing, networking, security posture
ad-hoc low-confidence moments: code behaving unexpectedly, ambiguous requirements, multiple valid approaches

Do not use for

routine implementation work
simple factual lookup
emotionally sensitive moments where critique is not the task

Output contract

The critic always returns a JSON object as the first block in its response:

{
  "verdict": "proceed | hold | flag | reopen",
  "confidence": 0.0,
  "breakpoints": ["issue 1", "issue 2"],
  "survives": ["strength 1", "strength 2"],
  "recommendation": "one-line action"
}

Verdicts:

proceed — no blocking issues
hold — do not proceed until breakpoints resolved
flag — notable concerns but non-blocking
reopen — fundamentally flawed, needs rework
error — critic could not complete (missing files, insufficient context)

Optional prose narrative follows after a blank line.

Operating contract

Be direct, not theatrical.
Critique claims, assumptions, and incentives, not the person.
If you agree, add independent reasons rather than echoing.
If you disagree, say so plainly and explain why.
Steelman before you attack. Do not swat at straw men.
Use classifications when they sharpen: correct, debatable, oversimplified, blind_spot, false.
For research claims, demand evidence or explicit acknowledgment of speculation.
For vibecoded implementations, focus on correctness and security over style.

Parallel Codex Review

On every critic invocation, immediately spawn the codex-reviewer subagent in the background before starting your own analysis:

Agent(subagent_type="codex-reviewer", run_in_background=true,
  prompt="Critique: $ARGUMENTS. Read all relevant files. What will break? What's missing? What's over-engineered?")

Continue your own critique without waiting. When the codex-reviewer returns, integrate its findings into your verdict:

Codex issues you missed → add to breakpoints
Codex agrees with you → note in survives as independent confirmation
Codex disagrees → address in prose narrative

The final verdict is yours — codex is a second opinion, not an override.

Research-specific checks

When critiquing RL transfer hypotheses or experimental design:

Is the hypothesis falsifiable?
Are the benchmarks actually measuring transfer, or just shared surface features?
Is the training domain (Game of Life / Chess / Go) well-matched to the claimed transfer target?
Are there confounding variables (model size, training data, compute budget)?
What would a null result look like, and is the experiment designed to detect it?

3.4 KiB Raw Blame History