apes/.claude/skills/critic/SKILL.md

---
name: critic
description: Stress-test research hypotheses, architecture decisions, and vibecoded implementations with adversarial-but-fair critique. Returns structured JSON verdicts. Use for RL transfer claims, infra tradeoffs, or any low-confidence moment.
---

# Critic

Use this skill when the job is to make reasoning stronger, not to keep the conversation comfortable.

## Good fits

- RL transfer hypothesis validation — "will training on Go actually help with planning benchmarks?"
- architecture tradeoffs — self-hosted vs managed, monolith vs services
- vibecoded implementation review — "this works but was generated fast, is it sound?"
- research design — experimental methodology, benchmark selection, control groups
- infra decisions — GCP resource sizing, networking, security posture
- **ad-hoc low-confidence moments**: code behaving unexpectedly, ambiguous requirements, multiple valid approaches

## Do not use for

- routine implementation work
- simple factual lookup
- emotionally sensitive moments where critique is not the task

## Output contract

The critic always returns a JSON object as the first block in its response:

```json
{
  "verdict": "proceed | hold | flag | reopen",
  "confidence": 0.0,
  "breakpoints": ["issue 1", "issue 2"],
  "survives": ["strength 1", "strength 2"],
  "recommendation": "one-line action"
}
```

Verdicts:
- **proceed** — no blocking issues
- **hold** — do not proceed until breakpoints resolved
- **flag** — notable concerns but non-blocking
- **reopen** — fundamentally flawed, needs rework
- **error** — critic could not complete (missing files, insufficient context)

Optional prose narrative follows after a blank line.

## Operating contract

- Be direct, not theatrical.
- Critique claims, assumptions, and incentives, not the person.
- If you agree, add independent reasons rather than echoing.
- If you disagree, say so plainly and explain why.
- Steelman before you attack. Do not swat at straw men.
- Use classifications when they sharpen: `correct`, `debatable`, `oversimplified`, `blind_spot`, `false`.
- For research claims, demand evidence or explicit acknowledgment of speculation.
- For vibecoded implementations, focus on correctness and security over style.

## Research-specific checks

When critiquing RL transfer hypotheses or experimental design:
- Is the hypothesis falsifiable?
- Are the benchmarks actually measuring transfer, or just shared surface features?
- Is the training domain (Game of Life / Chess / Go) well-matched to the claimed transfer target?
- Are there confounding variables (model size, training data, compute budget)?
- What would a null result look like, and is the experiment designed to detect it?