skillmake
← marketplace
aiconceptsha:ca50f84f872b3cc4manual

hamel-evals-skills

Use when auditing or building an LLM-product eval pipeline — Hamel Husain's evals skills cover error analysis, judge validation, synthetic data, and review interfaces, distilled from advising 50+ AI teams.

Install confidence
curl --create-dirs -fsSL https://skillmake.xyz/i/hamel-evals-skills -o ~/.claude/skills/hamel-evals-skills/SKILL.md
Pinned content
sha:ca50f84f872b3cc4
Generated with
manual
Source
github.com

The file served at /api/marketplace/hamel-evals-skills-ca50f84f/raw matches this hash. Inspect before install, then copy the command.

5,429 chars · ~1,357 tokens
---
name: hamel-evals-skills
description: "Use when auditing or building an LLM-product eval pipeline — Hamel Husain's evals skills cover error analysis, judge validation, synthetic data, and review interfaces, distilled from advising 50+ AI teams."
source: https://github.com/hamelsmu/evals-skills
generated: 2026-05-17T04:18:20.985Z
category: concept
audience: ai
---

## When to use

- Auditing an existing eval setup that 'feels off' — no error taxonomy, no judge calibration, no labeled data
- Bootstrapping evals from scratch on a new LLM product (chatbot, RAG, agent) with zero traces
- Designing an LLM-as-Judge for a subjective criterion (tone, helpfulness, faithfulness) and proving it tracks human labels
- Validating that a judge isn't biased before you trust it to gate releases or rank prompts
- Building a lightweight annotation interface so a domain expert can label traces quickly
- Evaluating a RAG pipeline end-to-end — retrieval quality separately from generation quality

## Key concepts

### error analysis (open coding)

Read raw traces, write a free-text note describing each failure, then cluster notes into a taxonomy. The error-analysis skill walks an agent through this loop so failures get named before metrics get invented. Without a taxonomy you cannot tell whether a prompt change actually helped or just shifted failure modes.

### LLM-as-Judge validation

A judge prompt is itself a model that needs evaluation. The validate-evaluator skill calibrates a judge against a held-out labeled set, reporting true-positive rate and true-negative rate, and corrects for class imbalance. A judge with 90% accuracy on a 95%-pass dataset is worthless — the skill forces you to measure TPR/TNR, not raw agreement.

### data splits for judges

Labeled traces get split into train (used to write the judge prompt), dev (used to iterate the judge), and test (held out, touched once). Touching test more than once leaks signal and inflates judge confidence. validate-evaluator enforces this split discipline.

### synthetic data via dimensions

generate-synthetic-data builds diverse inputs by enumerating tuples across dimensions (user persona x intent x edge case x channel) instead of asking an LLM for '100 examples,' which collapses to a narrow mode. Tuple-based generation surfaces failure modes you would never script by hand.

### eval-audit (severity-ranked findings)

eval-audit walks an existing pipeline in parallel diagnostic passes (taxonomy? labels? judge calibrated? data splits? regression suite?) and returns prioritized findings. It is the recommended entry point for teams who already have 'some evals' and want to know which gap matters most.

### review interface as a first-class artifact

build-review-interface generates a custom annotation UI tailored to the product's trace shape. A spreadsheet is not enough: labelers need to see tool calls, retrieved chunks, and full conversations inline. A purpose-built interface speeds labeling 5-10x and is the difference between 50 and 500 labels per session.

## API reference

```
/plugin marketplace add hamelsmu/evals-skills
```

Install the evals-skills plugin into Claude Code so all seven skills (eval-audit, error-analysis, generate-synthetic-data, write-judge-prompt, validate-evaluator, evaluate-rag, build-review-interface) become slash-commands.

```
/plugin marketplace add hamelsmu/evals-skills
/plugin install evals-skills@hamelsmu-evals-skills
```

```
npx skills add https://github.com/hamelsmu/evals-skills
```

Install the same skills via the framework-agnostic skills CLI for agents outside Claude Code. Run `npx skills update` to pull new versions.

```
npx skills add https://github.com/hamelsmu/evals-skills
npx skills update
```

```
/eval-audit
```

Run a severity-ranked diagnostic over an existing eval pipeline. Surfaces missing taxonomies, uncalibrated judges, leaked test sets, and metric/UX mismatches with prioritized fixes.

```
/error-analysis
```

Drive an agent through open-coding traces, clustering failures into a taxonomy, and producing a labeled dataset suitable for judge calibration.

```
/write-judge-prompt then /validate-evaluator
```

Author an LLM-as-Judge for a subjective criterion, then validate it against held-out human labels with TPR/TNR and bias correction before trusting it.

```
/evaluate-rag
```

Score retrieval and generation separately — retrieval gets ranking metrics on a labeled query set, generation gets a calibrated judge on the answers conditional on correct context.

## Gotchas

- Raw judge accuracy lies when classes are imbalanced — always report TPR and TNR separately, not a single number
- If you touch the test set more than once while tuning the judge, you have leaked it; cut a new test slice
- 'Generate 100 diverse examples' from an LLM collapses to a narrow distribution — enumerate dimensions and sample tuples instead
- Spreadsheets are not annotation tools — labelers need to see tool calls and retrieved chunks inline or they will skim
- Skipping error analysis and jumping straight to metrics produces eval scores that move without telling you why
- A judge that agrees with humans on training traces may still be biased on the production distribution — validate on held-out data drawn from real traffic

---
Generated by SkillMake from https://github.com/hamelsmu/evals-skills on 2026-05-17T04:18:20.985Z.
Verify against source before relying on details.

File: ~/.claude/skills/hamel-evals-skills/SKILL.md