The Guide · 2026-05-27

SkillOpt on skillmake.xyz

SkillOpt treats a SKILL.md file as a trainable parameter — frozen model, optimized markdown. This page is the single doc for understanding what is live on skillmake.xyz, how the A/B routing decides which version of a skill you see, and how to operate the system end to end. Read top to bottom: newcomer first, then product/PM, then engineer.

Phase 1 — live Phase 2 — next Phase 3–4 — cron
Layer 1 — newcomer

01What SkillOpt is

A SKILL.md is the markdown file an AI agent reads to learn how to do one specific task — install a library, refactor a file, render a video. The agent loads it on demand, treats it as authoritative, and behaves accordingly. skillmake.xyz is the public marketplace where these skills live; each one has an install URL like skillmake.xyz/i/superpowers that returns the raw markdown so any agent can fetch it.

The SkillOpt paper (Wang et al., arXiv 2605.23904) makes one simple move: treat the SKILL.md as a trainable parameter. The model stays frozen — we never fine-tune Claude or Cursor — but the markdown gets optimized over time the same way a neural net's weights would. Bounded edits per step. Held-out validation gate. Strict-improvement promotion. Sections you mark as "do not touch" are byte-locked against the edit machinery.

That framing matters because "self-improving agents" without those guardrails mostly produce slop. SkillOpt is the discipline that turns a freeform edit this skill button into a stable optimization loop.

Frozen model plus trained context window is the practical adaptation. We are not training Claude. We are training the document Claude reads.

02Three FAQs

What is a SKILL.md and what is skillmake.xyz?

A SKILL.md is a plain-text recipe an agent loads to do one job. skillmake.xyz hosts a curated marketplace of them — 94 skills today across engineers, creators, devops, design, AI, marketing, and general audiences. Each skill has a stable install URL (/i/<name>) that returns the raw markdown so any agent — Claude Code, Cursor, Codex, Aider, Cline — can install it in one curl. The site itself is a Next.js app on Cloudflare Workers (via OpenNext), with KV for storage and Analytics Engine for usage telemetry.

Are we collecting usage data?

Yes — every install hit, page view, and click writes a row to Cloudflare Workers Analytics Engine. No personally identifying info is stored. The daily visitor id is a SHA-256 of ip + ua + day truncated to 8 bytes — same person on the same day collapses to one id, but cannot be tracked across days. Search queries are hashed before write. There are 14 MetricEvent types in src/lib/metrics-core.ts:

#EventWhat it tells us
1install_hitSomeone fetched /i/<name>. Encoded as name@hash8 since Phase 1b so we can split by version. This is the gradient signal.
2marketplace_viewThe listing page for a skill loaded. Denominator for the conversion ratio.
3home_viewHomepage loaded. blob2 carries the audience filter slug if active.
4submit_startedTop of the submit funnel — someone began creating a new skill.
5submit_completedBottom of the submit funnel — submission saved.
6convert_successThe /api/convert extraction pipeline succeeded.
7convert_errorThe extraction pipeline failed. blob2 = error code, double2 = HTTP status.
8tricks_viewThe /tricks subsection loaded.
9powerhouse_viewThe /powerhouse subsection loaded.
10search_submittedA search query submitted (hashed). Volume + funnel signal, not literal text.
11github_clickOutbound click to the repo. Catches "intent leak" — read the listing, bounced to GitHub instead of installing.
12page_dwellBucketed dwell time: 0–5s / 5–15s / 15–30s / 30–60s / 60–300s / 300s+.
13scroll_depthBucketed reading depth: 0 / 25 / 50 / 75 / 100.
14api_errorAny 4xx/5xx from a server route. Mirrored to a structured JSON log via observability.ts.

What can go wrong?

Three failure modes are real enough that we built explicit mitigations:

03Glossary

TermDefinition
SKILL.mdThe plain-text recipe an agent loads to learn one task. The trainable parameter.
marketplaceThe browse surface at skillmake.xyz where humans pick skills; each entry has an /i/<name> install URL.
install_hitThe telemetry event fired when an agent fetches a skill's markdown. The gradient signal for the optimizer.
candidateA new version of a skill behind the A/B share. Not yet promoted.
currentThe version served to the majority of /i/<name> hits. There is exactly one per skill family.
retiredA previous version, kept queryable for 30 days for rollback and audit.
A/B splitDeterministic per-visitor routing between candidate and current — same visitor sees the same arm across reloads.
held-out gateThe dynamic conversion comparison that decides promotion. Welch's t-test, p < 0.05, strict improvement.
bounded editOne of the 9 atomic SkillEdit ops. Max 8 per candidate — the textual learning rate.
protected sectionA section id (whenToUse, keyConcepts, apiReference, or gotchas) that the edit machinery refuses to modify.
Layer 2 — product / PM

04SkillOpt figure mapping

The paper's Figure 1 walks through the gradient-descent analogy. Here is the exact translation we use on skillmake:

Paper conceptskillmake equivalent
Parameter θA published SKILL.md (MarketplaceEntry.markdown in KV)
Frozen model fWhatever agent the visitor runs (Claude Code, Cursor, Codex, …) — we never touch it
Gradient directionTrajectory-derived edits from the optimizer — 4–8 atomic SkillEdit ops derived from telemetry + failure signal
Learning rateBounded edit budget — MAX_EDITS_PER_STEP = 8; forbidden ops are replace_name, replace_apiReference.signature, replace_category, replace_audience
Validation checkHeld-out conversion gate — install_hit / marketplace_view strictly improves on A/B-routed traffic, p < 0.05
Batch / minibatchN-install threshold before the gate fires — default 200 installs per arm
Rejected side updatesCandidates that fail the static or dynamic gate — logged as negative feedback for the next optimizer pass
Protected section invariantprotectedSections + <!-- @protected:<id> --> markers, byte-identical between candidate and parent

05The four phases

Phase 1 — Foundation live

Versioned KV storage, bounded edits, static gate, protected sections, per-version Grafana panels.

Phase 2 — A/B routing pending

Deterministic per-visitor split at /i/<name>:

const candidate = await getCandidateSkillByName(name);
const useCandidate = candidate && shouldRouteCandidate(req, candidate.abTrafficShare);
const served = useCandidate ? candidate : current;

shouldRouteCandidate hashes (cf-connecting-ip + name), so the same visitor always sees the same arm. Default share: 20% candidate / 80% current. While a candidate exists for a skill, cache headers flip to private, no-store.

Phase 3 — Dynamic validation gate cron · hourly

Hourly Cloudflare Cron at /api/cron/validate-candidates reads Analytics Engine for the last 7 days. For each candidate: query marketplace_view count (route-agnostic, shared denominator), query install_hit per arm, require ≥ 200 installs per arm, Welch's t-test on conversion ratios with p < 0.05 AND strict improvement. Pass → promoteCandidate(name) atomically. Regress → retireCandidate(name) + write the diff to the negative-feedback store.

Phase 4 — Optimizer loop cron · weekly

Weekly cron /api/cron/propose-edits: find approved skills below the 25th percentile in conversion with no active candidate, pull 14 days of telemetry + last N rejected candidates, render to a narrow prompt, require a Zod-validated SkillEdit[] of length 4–8, apply → static gate → save as candidate. An SKILLOPT_ENABLED=true|false env var gates every cron — kill switch is one flip.

06A/B evaluation playbook — 7 steps

Once Phase 2 lands, here is exactly what running an experiment looks like from the operator side.

STEP 0

Pre-flight static gate

The candidate goes through skill-validator on insert. Token count under 1500 (the paper's compactness lesson — best skills land around 920). Description shares a keyword with each whenToUse item — the cheap router/body proxy. Protected sections byte-identical to the parent. Edit budget ≤ 8 ops. Fail = the candidate never enters the A/B; the curator sees the reasons and can revise or force-pass with a logged override.

STEP 1

Start at 20% candidate share

Default abTrafficShare: 0.2. shouldRouteCandidate hashes (cf-connecting-ip + name). Cache header drops to private, no-store for the duration — otherwise the edge cache poisons the split.

STEP 2

Wait for ≥ 200 installs per arm — or 14 days

The hourly cron reports "insufficient data" below 200 per arm. If 14 days pass without either arm crossing the threshold, the LLM-judge fallback decides instead. Curator can manually nudge the share higher (50% / 80%) to accelerate low-traffic skills.

STEP 3

Read the dashboard

Grafana exposes the per-version panels needed to sanity-check the test before the cron decides:

STEP 4

Cron decides — Welch's t-test, p < 0.05, strict improvement

Conversion ratios per arm, Welch's t-test (unequal variance — realistic for differently-sized arms), p < 0.05 AND strict improvement. Ties reject. Regressions retire the candidate and the diff lands in the negative-feedback store keyed at skill:<id>:rejected:v<hash>.

STEP 5

Post-promotion sanity check

Pointer flips atomically via promoteCandidate(name). Cache header restores to public, max-age=300, must-revalidate. Previous current moves to retired but stays queryable for 30 days from /admin/audit — every auto-promote is reversible. Operator pulls per-version panels for the next 24 hours and confirms the conversion lift holds at 100% traffic.

STEP 6

Per-skill effect size > corpus average

The test that matters. The paper's central insight: aggregate accuracy is the wrong unit. You can ship a candidate that raises the corpus mean by 2pp while making three high-traffic skills measurably worse. The cron decides per-skill; the operator spot-checks that this skill's effect size exceeds the corpus average over the last 14 days.

STEP 7

Failure modes to watch

07Live numbers — last 24h

Pulled live from Cloudflare Analytics Engine SQL API on 2026-05-27 against the skillmake_metrics dataset.

2,505Installs · 24h
153Versioned installs
6.1%Versioned share
191Distinct skills hit

The 6.1% versioned share is the rate at which install rows now carry the name@hash8 suffix. That number will climb as edge caches expire and older clients fetch fresh markdown — by the time Phase 2 lands and we actually run a real candidate, every install row will be versioned.

Top 5 skills by install count (last 24h)

  1. superpowers83
  2. claude-api77
  3. anthropic-frontend-design77
  4. mcp-builder77
  5. karpathy-claude-md72

Top 3 countries (last 24h)

  1. United States · US611
  2. India · IN607
  3. United Arab Emirates · AE511

Deploy verification

The first row that landed in Analytics Engine right after the Phase 1b deploy proved the slug encoder was wired correctly end to end:

blob1: "install_hit"
blob2: "github-actions-docs@cf993479"   // slug@hash8 — Phase 1b live
blob3: "US"
blob5: "curl"

Before Phase 1b, blob2 would have been the bare slug github-actions-docs. The @cf993479 suffix is the first 8 hex chars of the rendered markdown's SHA-256 — exactly what encodeVersionedSlug produces (§11).

Layer 3 — engineer

08The 9 SkillEdit ops

Each op is atomic and reversible. The edit budget (MAX_EDITS_PER_STEP = 8) is the textual analog of a learning rate — remove the cap and the optimizer destabilizes. Forbidden on this surface (would require a new submission, not an edit): replace_name (breaks /i/<name>), replace_apiReference.signature (changes the contract), replace_category, replace_audience (both change marketplace placement).

OpSignatureTouches
add_gotcha{ value: string(8..280) }gotchas[] (push)
delete_gotcha{ index: int }gotchas[] (splice)
replace_gotcha{ index: int, value: string(8..280) }gotchas[index]
add_keyConcept{ value: { term, explanation } }keyConcepts[] (push)
replace_keyConcept{ index, value: { term, explanation } }keyConcepts[index]
delete_keyConcept{ index: int }keyConcepts[] (splice)
replace_description{ value: string(20..220) }description (the router)
replace_whenToUse_item{ index, value: string(8..200) }whenToUse[index]
replace_apiReference_example{ index, value: string(0..2400) }apiReference[index].example only — signature stays locked

The Zod schema (verbatim from src/lib/skill-edit.ts):

export const SkillEditSchema = z.discriminatedUnion("op", [
  z.object({ op: z.literal("add_gotcha"), value: z.string().min(8).max(280) }),
  z.object({ op: z.literal("delete_gotcha"), index: z.number().int().nonnegative() }),
  z.object({
    op: z.literal("replace_gotcha"),
    index: z.number().int().nonnegative(),
    value: z.string().min(8).max(280),
  }),
  z.object({
    op: z.literal("add_keyConcept"),
    value: z.object({
      term: z.string().min(2).max(80),
      explanation: z.string().min(10).max(500),
    }),
  }),
  z.object({
    op: z.literal("replace_keyConcept"),
    index: z.number().int().nonnegative(),
    value: z.object({
      term: z.string().min(2).max(80),
      explanation: z.string().min(10).max(500),
    }),
  }),
  z.object({ op: z.literal("delete_keyConcept"), index: z.number().int().nonnegative() }),
  z.object({ op: z.literal("replace_description"), value: z.string().min(20).max(220) }),
  z.object({
    op: z.literal("replace_whenToUse_item"),
    index: z.number().int().nonnegative(),
    value: z.string().min(8).max(200),
  }),
  z.object({
    op: z.literal("replace_apiReference_example"),
    index: z.number().int().nonnegative(),
    value: z.string().max(2400),
  }),
]);

export const SkillEditsPayload = z
  .array(SkillEditSchema)
  .min(1, "at least one edit required")
  .max(MAX_EDITS_PER_STEP, `bounded to 8 ops per step (learning rate)`);

09The 5 validator checks

Runs synchronously on every candidate insert. Returns { pass, reasons, warnings, tokenCount }. A fail can still be force-merged by a curator (the override is logged), matching the paper's researcher-override pattern. From src/lib/skill-validator.ts:

CheckThresholdMitigation on fail
1. Schema shapeZod SkillSchema.safeParse must succeedHard reject. Fix the field and retry — the validator surfaces the path of every Zod issue.
2. Edit budgetedits.length ≤ 8Hard reject. Split into two candidates — the cap is the learning rate.
3. Protected sectionsNo op touches a section listed in protectedSectionsHard reject. Drop the offending op or remove the section from protectedSections (curator-only).
4. Token countWarn at 1200, hard fail at 1500 (target 920)Drop low-signal lines or split into two skills. Compactness wins.
5. Description ↔ whenToUse coherenceEach whenToUse item shares ≥ 1 keyword with the descriptionWarning only — surfaces likely router/body drift before it shows up in the conversion gate.
6. Protected-section byte identityRendered <!-- @protected:<id> --> blocks match parent byte-for-byteHard reject. Defense in depth at the rendered-markdown level.

The token estimator is intentionally cheap — chars/4, accurate to ~15% — fast enough to run on every candidate without an extra dependency:

function estimateTokens(markdown: string): number {
  // chars/4 ≈ tokens. Cheap, close enough for gating.
  return Math.ceil(markdown.length / 4);
}

And the protected-section enforcement (verbatim, the part that matters):

// 6. Protected-section byte-identity at the rendered-markdown level.
if (protectedIds.size > 0) {
  const parentBlocks = extractProtectedBlocks(parent.markdown);
  const candidateBlocks = extractProtectedBlocks(candidate.markdown);
  for (const id of protectedIds) {
    if (parentBlocks.get(id) !== candidateBlocks.get(id)) {
      reasons.push(
        `protected_byte_identity: section '${id}' bytes differ between parent and candidate.`
      );
    }
  }
}

10KV layout

The Phase 1a migration is additive — old reads keep working through the legacy findApprovedByName path. New writes also populate the family namespace.

KV namespace: MARKETPLACE_KV │ ├─ skill:<id> MarketplaceEntry (JSON) — one row per VERSION │ id = `${name}-${contentHash.slice(0, 8)}` │ ├─ index:all { ids: string[] } ├─ index:snapshot:v1 SnapshotShape — edge-cached for 1h │ └─ family:<name>: ├─ current { id } ← getCurrentSkillByName reads here ├─ candidate { id } ← present iff A/B in flight └─ history { ids: [] } ← ordered oldest → newest

Read paths (from src/lib/storage.ts):

Write paths:

11The slug@hash encoding

Every install_hit row's blob2 column carries <skill-name>@<contentHash[:8]>. The Analytics Engine schema is unchanged; the version is encoded inside the existing column by convention. Old slug-only dashboards still work via splitByChar('@', blob2)[1] — which returns the whole string on legacy rows written before versioning, so old data attributes correctly. From src/lib/metrics-core.ts:

export function encodeVersionedSlug(name: string, contentHash: string): string {
  return `${name}@${contentHash.slice(0, 8)}`;
}

export function parseVersionedSlug(slug: string): { name: string; versionHash: string | null } {
  const at = slug.indexOf("@");
  if (at < 0) return { name: slug, versionHash: null };
  return { name: slug.slice(0, at), versionHash: slug.slice(at + 1) };
}

Round-trip property: parseVersionedSlug(encodeVersionedSlug(name, hash)).versionHash === hash.slice(0, 8). Legacy slugs round-trip too — parseVersionedSlug("legacy-skill").versionHash === null. The metrics.test.ts suite covers both directions and the legacy fallback (7 cases, all green).

12Protected sections in action

Rendered SKILL.md output wraps protected sections in HTML-comment markers the validator can find with a regex. Here is what a candidate of the caveman skill would render if protectedSections: ["gotchas", "keyConcepts"] were set (synthesized from scripts/seeds/caveman.json via the renderSkillMarkdown logic in src/lib/skill-schema.ts):

---
name: caveman
description: "Use when you want Claude Code (or 30+ agents) to cut ~75% of output tokens..."
source: https://github.com/JuliusBrussee/caveman/blob/main/README.md
generated: 2026-05-27T12:00:00Z
category: tool
audience: engineers
---

<!-- @protected:keyConcepts -->
## Key concepts

### talk like caveman

Skill instructs the agent to drop filler words, hedging, and pleasantries —
keep substance in telegraphic fragments. Same fix, ~75% fewer output tokens,
100% technical accuracy. Brain still big, mouth small.

### four grunt levels

/caveman accepts a level argument: `lite`, `full`, `ultra`, `wenyan`.
Level persists until end of session or `normal mode`.
<!-- /@protected:keyConcepts -->

## API reference

```
One-line install (Mac / Linux / WSL / Git Bash)
```
Finds every agent on the machine and installs the caveman skill into each.

<!-- @protected:gotchas -->
## Gotchas

- Caveman mode persists for the whole session unless you say `normal mode`
- /caveman-compress is destructive — back up CLAUDE.md before running
- ULTRA level loses some nuance — fine for code, risky for prose
<!-- /@protected:gotchas -->
Deep dive: how the byte-identity check actually works

The validator extracts each <!-- @protected:<id> -->…<!-- /@protected:<id> --> block from both the parent and candidate markdown using this regex:

const re = /<!-- @protected:([a-zA-Z]+) -->\n?([\s\S]*?)\n?<!-- \/@protected:\1 -->/g;

The \1 backreference forces the closing marker's id to match the opening marker's id — so a stray @protected:gotchas opener paired with a /@protected:keyConcepts closer is treated as malformed, not as a valid block. The body between markers is captured as group 2 with no normalization. The validator then compares parentBlocks.get(id) !== candidateBlocks.get(id) byte-for-byte — a single trailing space difference fails the check. Defense in depth: even if the structured SkillEdit[] somehow let a protected op through, the rendered-markdown check catches it.

The marker emitter (in renderSkillMarkdown → pushSection) trims the trailing blank line that section emitters push, so the closing marker hugs the section content — that way reformatting the surrounding markdown (e.g. adding a blank line between sections) doesn't shift bytes inside a protected block.

13SQL recipes

All three queries run against the Cloudflare Analytics Engine SQL API. Full recipe collection in docs/analytics-queries.md. Analytics Engine samples on write, so always use sum(_sample_interval) — not count() — to get the true total.

All-time install leaderboard (with the splitByChar wrapper)

SELECT
  splitByChar('@', blob2)[1] AS slug,
  sum(_sample_interval) AS installs
FROM skillmake_metrics
WHERE index1 = 'install_hit' AND blob2 != ''
GROUP BY slug
ORDER BY installs DESC
LIMIT 50

Active A/B candidates (versions_seen > 1)

SELECT
  splitByChar('@', blob2)[1] AS slug,
  count(DISTINCT splitByChar('@', blob2)[2]) AS versions_seen,
  sum(_sample_interval) AS total_installs
FROM skillmake_metrics
WHERE index1 = 'install_hit'
  AND blob2 LIKE '%@%'
  AND timestamp >= NOW() - INTERVAL '14' DAY
GROUP BY slug
HAVING versions_seen > 1
ORDER BY total_installs DESC

When this returns rows, a candidate is live in production. versions_seen = 2 is a clean current+candidate A/B; 3+ means a candidate was promoted and a new candidate is already being tested against it.

Per-version conversion ratio for one skill (caveman example)

WITH
  installs AS (
    SELECT
      splitByChar('@', blob2)[2] AS version_hash,
      sum(_sample_interval) AS installs
    FROM skillmake_metrics
    WHERE index1 = 'install_hit'
      AND splitByChar('@', blob2)[1] = 'caveman'
      AND timestamp >= NOW() - INTERVAL '7' DAY
    GROUP BY version_hash
  ),
  views AS (
    SELECT sum(_sample_interval) AS views
    FROM skillmake_metrics
    WHERE index1 = 'marketplace_view'
      AND blob2 = 'caveman'
      AND timestamp >= NOW() - INTERVAL '7' DAY
  )
SELECT
  installs.version_hash AS version_hash,
  installs.installs AS installs,
  views.views AS detail_page_views,
  round(installs.installs / views.views, 3) AS conversion_ratio
FROM installs CROSS JOIN views
ORDER BY conversion_ratio DESC

marketplace_view is route-agnostic (same slug for both arms), so the denominator is shared. This is the exact ratio the Phase 3 cron runs Welch's t-test against.

14Test summary

All 25 tests green by suite. Vitest, no integration mocks needed — the libs are pure.

SuiteCasesWhat it covers
src/lib/metrics.test.ts7Round-trip of encodeVersionedSlug / parseVersionedSlug, legacy slug fallback (no @), buildMetricDataPoint shape for events with and without status.
src/lib/skill-edit.test.ts10Zod schema accept/reject for every op kind, edit-budget enforcement, applyEdits happy path + ApplyError on out-of-bounds index.
src/lib/skill-validator.test.ts8All 5 checks: schema fail, edit budget overflow, protected-section op rejection, token cap, coherence warning, and the byte-identity check for protected blocks.
Total25

15Commit ledger

Phase 1 shipped in four commits on main:

SHATitle
2c5a04efeat(skillopt): phase 1a+1b — versioned KV storage + slug@hash on install_hit
103480cfeat(skillopt): phase 1c+1d+1e — bounded edits, static validator, protected sections
fac8738feat(skillopt): phase 1f — per-version Grafana panels + slug@hash-aware SQL
03cd440Add steipete agent script skills (seed adds, same window)

16Deploy state

Worker version: c846c4c3-771f-49ec-8d75-6d678539d8ab
Live at: skillmake.xyz · www.skillmake.xyz
Account: Cloudflare f9d7efb12a2713ce5af52d882165c543
Runtime: Next.js on Cloudflare Workers via OpenNext + nodejs_compat
Data plane: KV namespace MARKETPLACE_KV · Analytics Engine dataset skillmake_metrics

17What's next

Phase 2 — A/B routing. The install route at /i/<name> needs to call getCandidateSkillByName in parallel with the current lookup, hash (cf-connecting-ip + name) to assign an arm, and flip cache headers to private, no-store for the duration of any in-flight A/B. The deterministic hash means visitors never bounce between arms, even across reloads or fresh tabs — the only thing that resets is when the candidate gets promoted or retired.

Phase 3 — dynamic validation gate. Hourly cron at /api/cron/validate-candidates reads the last 7 days of Analytics Engine, computes the conversion ratio per arm, runs Welch's t-test, and either calls promoteCandidate(name) or retireCandidate(name). The long-tail fallback to the skill-judge LLM-judge skill kicks in at 14 days without crossing the 200-installs-per-arm threshold. The cron is gated by SKILLOPT_ENABLED — one env flip kills the whole loop.

Phase 4 — optimizer loop. Weekly cron at /api/cron/propose-edits picks the bottom-quartile skills by conversion, pulls 14 days of telemetry, reads the rejected-edits store for that skill, and prompts a frontier model for a Zod-validated SkillEdit[] of length 4–8. Apply → static gate → save as candidate → enters A/B at 20% share. Budget at current corpus size is roughly $10/month.