The Guide · 2026-05-27

SkillOpt on skillmake.xyz

SkillOpt treats a SKILL.md file as a trainable parameter — frozen model, optimized markdown. This page is the single doc for understanding what is live on skillmake.xyz, how the A/B routing decides which version of a skill you see, and how to operate the system end to end. Read top to bottom: newcomer first, then product/PM, then engineer.

Phase 1 — live Phase 2 — next Phase 3–4 — cron

Layer 1 — newcomer

01What SkillOpt is

A SKILL.md is the markdown file an AI agent reads to learn how to do one specific task — install a library, refactor a file, render a video. The agent loads it on demand, treats it as authoritative, and behaves accordingly. skillmake.xyz is the public marketplace where these skills live; each one has an install URL like skillmake.xyz/i/superpowers that returns the raw markdown so any agent can fetch it.

The SkillOpt paper (Wang et al., arXiv 2605.23904) makes one simple move: treat the SKILL.md as a trainable parameter. The model stays frozen — we never fine-tune Claude or Cursor — but the markdown gets optimized over time the same way a neural net's weights would. Bounded edits per step. Held-out validation gate. Strict-improvement promotion. Sections you mark as "do not touch" are byte-locked against the edit machinery.

That framing matters because "self-improving agents" without those guardrails mostly produce slop. SkillOpt is the discipline that turns a freeform edit this skill button into a stable optimization loop.

Frozen model plus trained context window is the practical adaptation. We are not training Claude. We are training the document Claude reads.

02Three FAQs

What is a SKILL.md and what is skillmake.xyz?

A SKILL.md is a plain-text recipe an agent loads to do one job. skillmake.xyz hosts a curated marketplace of them — 94 skills today across engineers, creators, devops, design, AI, marketing, and general audiences. Each skill has a stable install URL (/i/<name>) that returns the raw markdown so any agent — Claude Code, Cursor, Codex, Aider, Cline — can install it in one curl. The site itself is a Next.js app on Cloudflare Workers (via OpenNext), with KV for storage and Analytics Engine for usage telemetry.

Are we collecting usage data?

Yes — every install hit, page view, and click writes a row to Cloudflare Workers Analytics Engine. No personally identifying info is stored. The daily visitor id is a SHA-256 of ip + ua + day truncated to 8 bytes — same person on the same day collapses to one id, but cannot be tracked across days. Search queries are hashed before write. There are 14 MetricEvent types in src/lib/metrics-core.ts:

#	Event	What it tells us
1	`install_hit`	Someone fetched `/i/<name>`. Encoded as `name@hash8` since Phase 1b so we can split by version. This is the gradient signal.
2	`marketplace_view`	The listing page for a skill loaded. Denominator for the conversion ratio.
3	`home_view`	Homepage loaded. `blob2` carries the audience filter slug if active.
4	`submit_started`	Top of the submit funnel — someone began creating a new skill.
5	`submit_completed`	Bottom of the submit funnel — submission saved.
6	`convert_success`	The `/api/convert` extraction pipeline succeeded.
7	`convert_error`	The extraction pipeline failed. `blob2` = error code, `double2` = HTTP status.
8	`tricks_view`	The /tricks subsection loaded.
9	`powerhouse_view`	The /powerhouse subsection loaded.
10	`search_submitted`	A search query submitted (hashed). Volume + funnel signal, not literal text.
11	`github_click`	Outbound click to the repo. Catches "intent leak" — read the listing, bounced to GitHub instead of installing.
12	`page_dwell`	Bucketed dwell time: `0–5s / 5–15s / 15–30s / 30–60s / 60–300s / 300s+`.
13	`scroll_depth`	Bucketed reading depth: `0 / 25 / 50 / 75 / 100`.
14	`api_error`	Any 4xx/5xx from a server route. Mirrored to a structured JSON log via `observability.ts`.

What can go wrong?

Three failure modes are real enough that we built explicit mitigations:

Cache poisoning the A/B split. Cloudflare's edge cache can pin a candidate or current response, breaking the per-visitor route. Mitigation: while a candidate exists for a skill, /i/<name> flips to Cache-Control: private, no-store. After promotion, headers restore to public, max-age=300, must-revalidate.
Low-traffic skills never decide. A long-tail skill might never cross 200 installs per arm. Mitigation: any candidate that fails to reach the threshold in 14 days routes to an LLM-judge fallback against the skill-judge peer skill — it must score strictly higher than the current before promotion.
Optimizer loops. The model proposing edits can get stuck in a local minimum, re-suggesting near-duplicate diffs that keep getting rejected. Mitigation: rejected candidates write to a negative-feedback store at skill:<id>:rejected:v<hash>. The optimizer prompt includes the last N rejections so it doesn't re-propose them.

03Glossary

Term	Definition
`SKILL.md`	The plain-text recipe an agent loads to learn one task. The trainable parameter.
marketplace	The browse surface at skillmake.xyz where humans pick skills; each entry has an `/i/<name>` install URL.
`install_hit`	The telemetry event fired when an agent fetches a skill's markdown. The gradient signal for the optimizer.
candidate	A new version of a skill behind the A/B share. Not yet promoted.
current	The version served to the majority of `/i/<name>` hits. There is exactly one per skill family.
retired	A previous version, kept queryable for 30 days for rollback and audit.
A/B split	Deterministic per-visitor routing between candidate and current — same visitor sees the same arm across reloads.
held-out gate	The dynamic conversion comparison that decides promotion. Welch's t-test, p < 0.05, strict improvement.
bounded edit	One of the 9 atomic `SkillEdit` ops. Max 8 per candidate — the textual learning rate.
protected section	A section id (`whenToUse`, `keyConcepts`, `apiReference`, or `gotchas`) that the edit machinery refuses to modify.

Layer 2 — product / PM

04SkillOpt figure mapping

The paper's Figure 1 walks through the gradient-descent analogy. Here is the exact translation we use on skillmake:

Paper concept	skillmake equivalent
Parameter `θ`	A published `SKILL.md` (`MarketplaceEntry.markdown` in KV)
Frozen model `f`	Whatever agent the visitor runs (Claude Code, Cursor, Codex, …) — we never touch it
Gradient direction	Trajectory-derived edits from the optimizer — 4–8 atomic `SkillEdit` ops derived from telemetry + failure signal
Learning rate	Bounded edit budget — `MAX_EDITS_PER_STEP = 8`; forbidden ops are `replace_name`, `replace_apiReference.signature`, `replace_category`, `replace_audience`
Validation check	Held-out conversion gate — `install_hit / marketplace_view` strictly improves on A/B-routed traffic, p < 0.05
Batch / minibatch	N-install threshold before the gate fires — default 200 installs per arm
Rejected side updates	Candidates that fail the static or dynamic gate — logged as negative feedback for the next optimizer pass
Protected section invariant	`protectedSections` + `<!-- @protected:<id> -->` markers, byte-identical between candidate and parent

05The four phases

Phase 1 — Foundation live

Versioned KV storage, bounded edits, static gate, protected sections, per-version Grafana panels.

Versioned storage: skill:<id> per-version rows, plus family pointers family:<name>:current, family:<name>:candidate, family:<name>:history. Old reads keep working — the migration is additive.
Version binding on install: install_hit.blob2 = name@hash8. Existing dashboards keep working via the splitByChar('@', blob2)[1] wrapper.
Bounded-edit endpoint: structured SkillEdit[] payload, 9 op kinds, ≤ 8 ops per candidate.
Static gate: 5 checks run synchronously on every candidate (see §09).
Protected sections: byte-identity enforced at both schema and rendered-markdown level.

Phase 2 — A/B routing pending

Deterministic per-visitor split at /i/<name>:

const candidate = await getCandidateSkillByName(name);
const useCandidate = candidate && shouldRouteCandidate(req, candidate.abTrafficShare);
const served = useCandidate ? candidate : current;

shouldRouteCandidate hashes (cf-connecting-ip + name), so the same visitor always sees the same arm. Default share: 20% candidate / 80% current. While a candidate exists for a skill, cache headers flip to private, no-store.

Phase 3 — Dynamic validation gate cron · hourly

Hourly Cloudflare Cron at /api/cron/validate-candidates reads Analytics Engine for the last 7 days. For each candidate: query marketplace_view count (route-agnostic, shared denominator), query install_hit per arm, require ≥ 200 installs per arm, Welch's t-test on conversion ratios with p < 0.05 AND strict improvement. Pass → promoteCandidate(name) atomically. Regress → retireCandidate(name) + write the diff to the negative-feedback store.

Phase 4 — Optimizer loop cron · weekly

Weekly cron /api/cron/propose-edits: find approved skills below the 25th percentile in conversion with no active candidate, pull 14 days of telemetry + last N rejected candidates, render to a narrow prompt, require a Zod-validated SkillEdit[] of length 4–8, apply → static gate → save as candidate. An SKILLOPT_ENABLED=true|false env var gates every cron — kill switch is one flip.

06A/B evaluation playbook — 7 steps

Once Phase 2 lands, here is exactly what running an experiment looks like from the operator side.

STEP 0

Pre-flight static gate

The candidate goes through skill-validator on insert. Token count under 1500 (the paper's compactness lesson — best skills land around 920). Description shares a keyword with each whenToUse item — the cheap router/body proxy. Protected sections byte-identical to the parent. Edit budget ≤ 8 ops. Fail = the candidate never enters the A/B; the curator sees the reasons and can revise or force-pass with a logged override.

STEP 1

Start at 20% candidate share

Default abTrafficShare: 0.2. shouldRouteCandidate hashes (cf-connecting-ip + name). Cache header drops to private, no-store for the duration — otherwise the edge cache poisons the split.

STEP 2

Wait for ≥ 200 installs per arm — or 14 days

The hourly cron reports "insufficient data" below 200 per arm. If 14 days pass without either arm crossing the threshold, the LLM-judge fallback decides instead. Curator can manually nudge the share higher (50% / 80%) to accelerate low-traffic skills.

STEP 3

Read the dashboard

Grafana exposes the per-version panels needed to sanity-check the test before the cron decides:

Top skill versions by install rate (14d) — candidate next to current
Conversion delta per skill version — install_hit / marketplace_view
UA split per version — is the candidate winning on agents (curl) but losing on browsers? Router/body drift
Top countries per version — geographic skew hides audience-fit problems
GitHub-click ratio per version — if github_click goes up while installs go down, the candidate is causing intent leak
Errors per hour — a candidate that breaks downstream agent flow shows up here first

STEP 4

Cron decides — Welch's t-test, p < 0.05, strict improvement

Conversion ratios per arm, Welch's t-test (unequal variance — realistic for differently-sized arms), p < 0.05 AND strict improvement. Ties reject. Regressions retire the candidate and the diff lands in the negative-feedback store keyed at skill:<id>:rejected:v<hash>.

STEP 5

Post-promotion sanity check

Pointer flips atomically via promoteCandidate(name). Cache header restores to public, max-age=300, must-revalidate. Previous current moves to retired but stays queryable for 30 days from /admin/audit — every auto-promote is reversible. Operator pulls per-version panels for the next 24 hours and confirms the conversion lift holds at 100% traffic.

STEP 6

Per-skill effect size > corpus average

The test that matters. The paper's central insight: aggregate accuracy is the wrong unit. You can ship a candidate that raises the corpus mean by 2pp while making three high-traffic skills measurably worse. The cron decides per-skill; the operator spot-checks that this skill's effect size exceeds the corpus average over the last 14 days.

STEP 7

Failure modes to watch

Low traffic + judge fallback drift. If a skill never crosses 200 installs/arm and the judge keeps promoting candidates that don't show up in conversion later, the judge prompt has drifted. Re-seed skill-judge.
Optimizer loops. Same skill, five rejected candidates in a row → telemetry input is too narrow. Widen the context.
Cache poisoning. Always confirm no-store is set before running. A cached /i/<name> at the CDN edge silently breaks the split.
Curator surprise. Every auto-promote writes to audit:<timestamp>:<id>. Retired versions kept 30 days — reverting a bad promote is one click.

07Live numbers — last 24h

Pulled live from Cloudflare Analytics Engine SQL API on 2026-05-27 against the skillmake_metrics dataset.

2,505Installs · 24h

153Versioned installs

6.1%Versioned share

191Distinct skills hit

The 6.1% versioned share is the rate at which install rows now carry the name@hash8 suffix. That number will climb as edge caches expire and older clients fetch fresh markdown — by the time Phase 2 lands and we actually run a real candidate, every install row will be versioned.

Top 5 skills by install count (last 24h)

superpowers83
claude-api77
anthropic-frontend-design77
mcp-builder77
karpathy-claude-md72

Top 3 countries (last 24h)

United States · US611
India · IN607
United Arab Emirates · AE511

Deploy verification

The first row that landed in Analytics Engine right after the Phase 1b deploy proved the slug encoder was wired correctly end to end:

blob1: "install_hit"
blob2: "github-actions-docs@cf993479"   // slug@hash8 — Phase 1b live
blob3: "US"
blob5: "curl"

Before Phase 1b, blob2 would have been the bare slug github-actions-docs. The @cf993479 suffix is the first 8 hex chars of the rendered markdown's SHA-256 — exactly what encodeVersionedSlug produces (§11).

Layer 3 — engineer

08The 9 SkillEdit ops

Each op is atomic and reversible. The edit budget (MAX_EDITS_PER_STEP = 8) is the textual analog of a learning rate — remove the cap and the optimizer destabilizes. Forbidden on this surface (would require a new submission, not an edit): replace_name (breaks /i/<name>), replace_apiReference.signature (changes the contract), replace_category, replace_audience (both change marketplace placement).

Op	Signature	Touches
`add_gotcha`	`{ value: string(8..280) }`	`gotchas[]` (push)
`delete_gotcha`	`{ index: int }`	`gotchas[]` (splice)
`replace_gotcha`	`{ index: int, value: string(8..280) }`	`gotchas[index]`
`add_keyConcept`	`{ value: { term, explanation } }`	`keyConcepts[]` (push)
`replace_keyConcept`	`{ index, value: { term, explanation } }`	`keyConcepts[index]`
`delete_keyConcept`	`{ index: int }`	`keyConcepts[]` (splice)
`replace_description`	`{ value: string(20..220) }`	`description` (the router)
`replace_whenToUse_item`	`{ index, value: string(8..200) }`	`whenToUse[index]`
`replace_apiReference_example`	`{ index, value: string(0..2400) }`	`apiReference[index].example` only — signature stays locked

The Zod schema (verbatim from src/lib/skill-edit.ts):

export const SkillEditSchema = z.discriminatedUnion("op", [
  z.object({ op: z.literal("add_gotcha"), value: z.string().min(8).max(280) }),
  z.object({ op: z.literal("delete_gotcha"), index: z.number().int().nonnegative() }),
  z.object({
    op: z.literal("replace_gotcha"),
    index: z.number().int().nonnegative(),
    value: z.string().min(8).max(280),
  }),
  z.object({
    op: z.literal("add_keyConcept"),
    value: z.object({
      term: z.string().min(2).max(80),
      explanation: z.string().min(10).max(500),
    }),
  }),
  z.object({
    op: z.literal("replace_keyConcept"),
    index: z.number().int().nonnegative(),
    value: z.object({
      term: z.string().min(2).max(80),
      explanation: z.string().min(10).max(500),
    }),
  }),
  z.object({ op: z.literal("delete_keyConcept"), index: z.number().int().nonnegative() }),
  z.object({ op: z.literal("replace_description"), value: z.string().min(20).max(220) }),
  z.object({
    op: z.literal("replace_whenToUse_item"),
    index: z.number().int().nonnegative(),
    value: z.string().min(8).max(200),
  }),
  z.object({
    op: z.literal("replace_apiReference_example"),
    index: z.number().int().nonnegative(),
    value: z.string().max(2400),
  }),
]);

export const SkillEditsPayload = z
  .array(SkillEditSchema)
  .min(1, "at least one edit required")
  .max(MAX_EDITS_PER_STEP, `bounded to 8 ops per step (learning rate)`);

09The 5 validator checks

Runs synchronously on every candidate insert. Returns { pass, reasons, warnings, tokenCount }. A fail can still be force-merged by a curator (the override is logged), matching the paper's researcher-override pattern. From src/lib/skill-validator.ts:

Check	Threshold	Mitigation on fail
1. Schema shape	Zod `SkillSchema.safeParse` must succeed	Hard reject. Fix the field and retry — the validator surfaces the path of every Zod issue.
2. Edit budget	`edits.length ≤ 8`	Hard reject. Split into two candidates — the cap is the learning rate.
3. Protected sections	No op touches a section listed in `protectedSections`	Hard reject. Drop the offending op or remove the section from `protectedSections` (curator-only).
4. Token count	Warn at 1200, hard fail at 1500 (target 920)	Drop low-signal lines or split into two skills. Compactness wins.
5. Description ↔ whenToUse coherence	Each `whenToUse` item shares ≥ 1 keyword with the description	Warning only — surfaces likely router/body drift before it shows up in the conversion gate.
6. Protected-section byte identity	Rendered `<!-- @protected:<id> -->` blocks match parent byte-for-byte	Hard reject. Defense in depth at the rendered-markdown level.

The token estimator is intentionally cheap — chars/4, accurate to ~15% — fast enough to run on every candidate without an extra dependency:

function estimateTokens(markdown: string): number {
  // chars/4 ≈ tokens. Cheap, close enough for gating.
  return Math.ceil(markdown.length / 4);
}

And the protected-section enforcement (verbatim, the part that matters):

// 6. Protected-section byte-identity at the rendered-markdown level.
if (protectedIds.size > 0) {
  const parentBlocks = extractProtectedBlocks(parent.markdown);
  const candidateBlocks = extractProtectedBlocks(candidate.markdown);
  for (const id of protectedIds) {
    if (parentBlocks.get(id) !== candidateBlocks.get(id)) {
      reasons.push(
        `protected_byte_identity: section '${id}' bytes differ between parent and candidate.`
      );
    }
  }
}

10KV layout

The Phase 1a migration is additive — old reads keep working through the legacy findApprovedByName path. New writes also populate the family namespace.

KV namespace: MARKETPLACE_KV │ ├─ skill:<id> MarketplaceEntry (JSON) — one row per VERSION │ id = `${name}-${contentHash.slice(0, 8)}` │ ├─ index:all { ids: string[] } ├─ index:snapshot:v1 SnapshotShape — edge-cached for 1h │ └─ family:<name>: ├─ current { id } ← getCurrentSkillByName reads here ├─ candidate { id } ← present iff A/B in flight └─ history { ids: [] } ← ordered oldest → newest

Read paths (from src/lib/storage.ts):

getCurrentSkillByName(name) — reads family:<name>:current, falls back to legacy name scan if no pointer exists.
getCandidateSkillByName(name) — reads family:<name>:candidate, returns null if no A/B is in flight.
listVersionsByName(name) — reads the history pointer, returns every recorded version oldest-first.

Write paths:

saveSkillVersion(parent, edits, input) — writes a new skill:<id> with versionStatus: "candidate", appends to history, sets the candidate pointer.
promoteCandidate(name) — flips candidate → current atomically, demotes previous current to retired, clears the candidate pointer.
retireCandidate(name) — marks candidate as retired and clears the candidate pointer. The row stays queryable for audit and negative-feedback reuse.
ensureFamilyIndexed(entry) — lazy migration: the first read of a legacy entry repairs the family index in place. No separate migration job.

11The `slug@hash` encoding

Every install_hit row's blob2 column carries <skill-name>@<contentHash[:8]>. The Analytics Engine schema is unchanged; the version is encoded inside the existing column by convention. Old slug-only dashboards still work via splitByChar('@', blob2)[1] — which returns the whole string on legacy rows written before versioning, so old data attributes correctly. From src/lib/metrics-core.ts:

export function encodeVersionedSlug(name: string, contentHash: string): string {
  return `${name}@${contentHash.slice(0, 8)}`;
}

export function parseVersionedSlug(slug: string): { name: string; versionHash: string | null } {
  const at = slug.indexOf("@");
  if (at < 0) return { name: slug, versionHash: null };
  return { name: slug.slice(0, at), versionHash: slug.slice(at + 1) };
}

Round-trip property: parseVersionedSlug(encodeVersionedSlug(name, hash)).versionHash === hash.slice(0, 8). Legacy slugs round-trip too — parseVersionedSlug("legacy-skill").versionHash === null. The metrics.test.ts suite covers both directions and the legacy fallback (7 cases, all green).

12Protected sections in action

Rendered SKILL.md output wraps protected sections in HTML-comment markers the validator can find with a regex. Here is what a candidate of the caveman skill would render if protectedSections: ["gotchas", "keyConcepts"] were set (synthesized from scripts/seeds/caveman.json via the renderSkillMarkdown logic in src/lib/skill-schema.ts):

---
name: caveman
description: "Use when you want Claude Code (or 30+ agents) to cut ~75% of output tokens..."
source: https://github.com/JuliusBrussee/caveman/blob/main/README.md
generated: 2026-05-27T12:00:00Z
category: tool
audience: engineers
---

<!-- @protected:keyConcepts -->
## Key concepts

### talk like caveman

Skill instructs the agent to drop filler words, hedging, and pleasantries —
keep substance in telegraphic fragments. Same fix, ~75% fewer output tokens,
100% technical accuracy. Brain still big, mouth small.

### four grunt levels

/caveman accepts a level argument: `lite`, `full`, `ultra`, `wenyan`.
Level persists until end of session or `normal mode`.
<!-- /@protected:keyConcepts -->

## API reference

```
One-line install (Mac / Linux / WSL / Git Bash)
```
Finds every agent on the machine and installs the caveman skill into each.

<!-- @protected:gotchas -->
## Gotchas

- Caveman mode persists for the whole session unless you say `normal mode`
- /caveman-compress is destructive — back up CLAUDE.md before running
- ULTRA level loses some nuance — fine for code, risky for prose
<!-- /@protected:gotchas -->

Deep dive: how the byte-identity check actually works

The validator extracts each … block from both the parent and candidate markdown using this regex:

const re = /<!-- @protected:([a-zA-Z]+) -->\n?([\s\S]*?)\n?<!-- \/@protected:\1 -->/g;

The \1 backreference forces the closing marker's id to match the opening marker's id — so a stray @protected:gotchas opener paired with a /@protected:keyConcepts closer is treated as malformed, not as a valid block. The body between markers is captured as group 2 with no normalization. The validator then compares parentBlocks.get(id) !== candidateBlocks.get(id) byte-for-byte — a single trailing space difference fails the check. Defense in depth: even if the structured SkillEdit[] somehow let a protected op through, the rendered-markdown check catches it.

The marker emitter (in renderSkillMarkdown → pushSection) trims the trailing blank line that section emitters push, so the closing marker hugs the section content — that way reformatting the surrounding markdown (e.g. adding a blank line between sections) doesn't shift bytes inside a protected block.

13SQL recipes

All three queries run against the Cloudflare Analytics Engine SQL API. Full recipe collection in docs/analytics-queries.md. Analytics Engine samples on write, so always use sum(_sample_interval) — not count() — to get the true total.

All-time install leaderboard (with the splitByChar wrapper)

SELECT
  splitByChar('@', blob2)[1] AS slug,
  sum(_sample_interval) AS installs
FROM skillmake_metrics
WHERE index1 = 'install_hit' AND blob2 != ''
GROUP BY slug
ORDER BY installs DESC
LIMIT 50

Active A/B candidates (versions_seen > 1)

SELECT
  splitByChar('@', blob2)[1] AS slug,
  count(DISTINCT splitByChar('@', blob2)[2]) AS versions_seen,
  sum(_sample_interval) AS total_installs
FROM skillmake_metrics
WHERE index1 = 'install_hit'
  AND blob2 LIKE '%@%'
  AND timestamp >= NOW() - INTERVAL '14' DAY
GROUP BY slug
HAVING versions_seen > 1
ORDER BY total_installs DESC

When this returns rows, a candidate is live in production. versions_seen = 2 is a clean current+candidate A/B; 3+ means a candidate was promoted and a new candidate is already being tested against it.

Per-version conversion ratio for one skill (caveman example)

WITH
  installs AS (
    SELECT
      splitByChar('@', blob2)[2] AS version_hash,
      sum(_sample_interval) AS installs
    FROM skillmake_metrics
    WHERE index1 = 'install_hit'
      AND splitByChar('@', blob2)[1] = 'caveman'
      AND timestamp >= NOW() - INTERVAL '7' DAY
    GROUP BY version_hash
  ),
  views AS (
    SELECT sum(_sample_interval) AS views
    FROM skillmake_metrics
    WHERE index1 = 'marketplace_view'
      AND blob2 = 'caveman'
      AND timestamp >= NOW() - INTERVAL '7' DAY
  )
SELECT
  installs.version_hash AS version_hash,
  installs.installs AS installs,
  views.views AS detail_page_views,
  round(installs.installs / views.views, 3) AS conversion_ratio
FROM installs CROSS JOIN views
ORDER BY conversion_ratio DESC

marketplace_view is route-agnostic (same slug for both arms), so the denominator is shared. This is the exact ratio the Phase 3 cron runs Welch's t-test against.

14Test summary

All 25 tests green by suite. Vitest, no integration mocks needed — the libs are pure.

Suite	Cases	What it covers
`src/lib/metrics.test.ts`	7	Round-trip of `encodeVersionedSlug` / `parseVersionedSlug`, legacy slug fallback (no `@`), `buildMetricDataPoint` shape for events with and without status.
`src/lib/skill-edit.test.ts`	10	Zod schema accept/reject for every op kind, edit-budget enforcement, `applyEdits` happy path + `ApplyError` on out-of-bounds index.
`src/lib/skill-validator.test.ts`	8	All 5 checks: schema fail, edit budget overflow, protected-section op rejection, token cap, coherence warning, and the byte-identity check for protected blocks.
Total	25

15Commit ledger

Phase 1 shipped in four commits on main:

SHA	Title
`2c5a04e`	feat(skillopt): phase 1a+1b — versioned KV storage + slug@hash on install_hit
`103480c`	feat(skillopt): phase 1c+1d+1e — bounded edits, static validator, protected sections
`fac8738`	feat(skillopt): phase 1f — per-version Grafana panels + slug@hash-aware SQL
`03cd440`	Add steipete agent script skills (seed adds, same window)

16Deploy state

Worker version: c846c4c3-771f-49ec-8d75-6d678539d8ab
Live at: skillmake.xyz · www.skillmake.xyz
Account: Cloudflare f9d7efb12a2713ce5af52d882165c543
Runtime: Next.js on Cloudflare Workers via OpenNext + nodejs_compat
Data plane: KV namespace MARKETPLACE_KV · Analytics Engine dataset skillmake_metrics

17What's next

Phase 2 — A/B routing. The install route at /i/<name> needs to call getCandidateSkillByName in parallel with the current lookup, hash (cf-connecting-ip + name) to assign an arm, and flip cache headers to private, no-store for the duration of any in-flight A/B. The deterministic hash means visitors never bounce between arms, even across reloads or fresh tabs — the only thing that resets is when the candidate gets promoted or retired.

Phase 3 — dynamic validation gate. Hourly cron at /api/cron/validate-candidates reads the last 7 days of Analytics Engine, computes the conversion ratio per arm, runs Welch's t-test, and either calls promoteCandidate(name) or retireCandidate(name). The long-tail fallback to the skill-judge LLM-judge skill kicks in at 14 days without crossing the 200-installs-per-arm threshold. The cron is gated by SKILLOPT_ENABLED — one env flip kills the whole loop.

Phase 4 — optimizer loop. Weekly cron at /api/cron/propose-edits picks the bottom-quartile skills by conversion, pulls 14 days of telemetry, reads the rejected-edits store for that skill, and prompts a frontier model for a Zod-validated SkillEdit[] of length 4–8. Apply → static gate → save as candidate → enters A/B at 20% share. Budget at current corpus size is roughly $10/month.

01What SkillOpt is

02Three FAQs

What is a SKILL.md and what is skillmake.xyz?

Are we collecting usage data?

What can go wrong?

03Glossary

04SkillOpt figure mapping

05The four phases

Phase 1 — Foundation live

Phase 2 — A/B routing pending

Phase 3 — Dynamic validation gate cron · hourly

Phase 4 — Optimizer loop cron · weekly

06A/B evaluation playbook — 7 steps

Pre-flight static gate

Start at 20% candidate share

Wait for ≥ 200 installs per arm — or 14 days

Read the dashboard

Cron decides — Welch's t-test, p < 0.05, strict improvement

Post-promotion sanity check

Per-skill effect size > corpus average

Failure modes to watch

07Live numbers — last 24h

Top 5 skills by install count (last 24h)

Top 3 countries (last 24h)

Deploy verification

08The 9 SkillEdit ops

09The 5 validator checks

10KV layout

11The slug@hash encoding

12Protected sections in action

13SQL recipes

All-time install leaderboard (with the splitByChar wrapper)

Active A/B candidates (versions_seen > 1)

Per-version conversion ratio for one skill (caveman example)

14Test summary

15Commit ledger

16Deploy state

17What's next

11The `slug@hash` encoding