SkillOpt treats a SKILL.md file as a trainable parameter — frozen model, optimized markdown. This page is the single doc for understanding what is live on skillmake.xyz, how the A/B routing decides which version of a skill you see, and how to operate the system end to end. Read top to bottom: newcomer first, then product/PM, then engineer.
A SKILL.md is the markdown file an AI agent reads to learn how to do one specific task — install a library, refactor a file, render a video. The agent loads it on demand, treats it as authoritative, and behaves accordingly. skillmake.xyz is the public marketplace where these skills live; each one has an install URL like skillmake.xyz/i/superpowers that returns the raw markdown so any agent can fetch it.
The SkillOpt paper (Wang et al., arXiv 2605.23904) makes one simple move: treat the SKILL.md as a trainable parameter. The model stays frozen — we never fine-tune Claude or Cursor — but the markdown gets optimized over time the same way a neural net's weights would. Bounded edits per step. Held-out validation gate. Strict-improvement promotion. Sections you mark as "do not touch" are byte-locked against the edit machinery.
That framing matters because "self-improving agents" without those guardrails mostly produce slop. SkillOpt is the discipline that turns a freeform edit this skill button into a stable optimization loop.
Frozen model plus trained context window is the practical adaptation. We are not training Claude. We are training the document Claude reads.
A SKILL.md is a plain-text recipe an agent loads to do one job. skillmake.xyz hosts a curated marketplace of them — 94 skills today across engineers, creators, devops, design, AI, marketing, and general audiences. Each skill has a stable install URL (/i/<name>) that returns the raw markdown so any agent — Claude Code, Cursor, Codex, Aider, Cline — can install it in one curl. The site itself is a Next.js app on Cloudflare Workers (via OpenNext), with KV for storage and Analytics Engine for usage telemetry.
Yes — every install hit, page view, and click writes a row to Cloudflare Workers Analytics Engine. No personally identifying info is stored. The daily visitor id is a SHA-256 of ip + ua + day truncated to 8 bytes — same person on the same day collapses to one id, but cannot be tracked across days. Search queries are hashed before write. There are 14 MetricEvent types in src/lib/metrics-core.ts:
| # | Event | What it tells us |
|---|---|---|
| 1 | install_hit | Someone fetched /i/<name>. Encoded as name@hash8 since Phase 1b so we can split by version. This is the gradient signal. |
| 2 | marketplace_view | The listing page for a skill loaded. Denominator for the conversion ratio. |
| 3 | home_view | Homepage loaded. blob2 carries the audience filter slug if active. |
| 4 | submit_started | Top of the submit funnel — someone began creating a new skill. |
| 5 | submit_completed | Bottom of the submit funnel — submission saved. |
| 6 | convert_success | The /api/convert extraction pipeline succeeded. |
| 7 | convert_error | The extraction pipeline failed. blob2 = error code, double2 = HTTP status. |
| 8 | tricks_view | The /tricks subsection loaded. |
| 9 | powerhouse_view | The /powerhouse subsection loaded. |
| 10 | search_submitted | A search query submitted (hashed). Volume + funnel signal, not literal text. |
| 11 | github_click | Outbound click to the repo. Catches "intent leak" — read the listing, bounced to GitHub instead of installing. |
| 12 | page_dwell | Bucketed dwell time: 0–5s / 5–15s / 15–30s / 30–60s / 60–300s / 300s+. |
| 13 | scroll_depth | Bucketed reading depth: 0 / 25 / 50 / 75 / 100. |
| 14 | api_error | Any 4xx/5xx from a server route. Mirrored to a structured JSON log via observability.ts. |
Three failure modes are real enough that we built explicit mitigations:
/i/<name> flips to Cache-Control: private, no-store. After promotion, headers restore to public, max-age=300, must-revalidate.skill-judge peer skill — it must score strictly higher than the current before promotion.skill:<id>:rejected:v<hash>. The optimizer prompt includes the last N rejections so it doesn't re-propose them.| Term | Definition |
|---|---|
SKILL.md | The plain-text recipe an agent loads to learn one task. The trainable parameter. |
| marketplace | The browse surface at skillmake.xyz where humans pick skills; each entry has an /i/<name> install URL. |
install_hit | The telemetry event fired when an agent fetches a skill's markdown. The gradient signal for the optimizer. |
| candidate | A new version of a skill behind the A/B share. Not yet promoted. |
| current | The version served to the majority of /i/<name> hits. There is exactly one per skill family. |
| retired | A previous version, kept queryable for 30 days for rollback and audit. |
| A/B split | Deterministic per-visitor routing between candidate and current — same visitor sees the same arm across reloads. |
| held-out gate | The dynamic conversion comparison that decides promotion. Welch's t-test, p < 0.05, strict improvement. |
| bounded edit | One of the 9 atomic SkillEdit ops. Max 8 per candidate — the textual learning rate. |
| protected section | A section id (whenToUse, keyConcepts, apiReference, or gotchas) that the edit machinery refuses to modify. |
The paper's Figure 1 walks through the gradient-descent analogy. Here is the exact translation we use on skillmake:
| Paper concept | skillmake equivalent |
|---|---|
Parameter θ | A published SKILL.md (MarketplaceEntry.markdown in KV) |
Frozen model f | Whatever agent the visitor runs (Claude Code, Cursor, Codex, …) — we never touch it |
| Gradient direction | Trajectory-derived edits from the optimizer — 4–8 atomic SkillEdit ops derived from telemetry + failure signal |
| Learning rate | Bounded edit budget — MAX_EDITS_PER_STEP = 8; forbidden ops are replace_name, replace_apiReference.signature, replace_category, replace_audience |
| Validation check | Held-out conversion gate — install_hit / marketplace_view strictly improves on A/B-routed traffic, p < 0.05 |
| Batch / minibatch | N-install threshold before the gate fires — default 200 installs per arm |
| Rejected side updates | Candidates that fail the static or dynamic gate — logged as negative feedback for the next optimizer pass |
| Protected section invariant | protectedSections + <!-- @protected:<id> --> markers, byte-identical between candidate and parent |
Versioned KV storage, bounded edits, static gate, protected sections, per-version Grafana panels.
skill:<id> per-version rows, plus family pointers family:<name>:current, family:<name>:candidate, family:<name>:history. Old reads keep working — the migration is additive.install_hit.blob2 = name@hash8. Existing dashboards keep working via the splitByChar('@', blob2)[1] wrapper.SkillEdit[] payload, 9 op kinds, ≤ 8 ops per candidate.Deterministic per-visitor split at /i/<name>:
const candidate = await getCandidateSkillByName(name);
const useCandidate = candidate && shouldRouteCandidate(req, candidate.abTrafficShare);
const served = useCandidate ? candidate : current;
shouldRouteCandidate hashes (cf-connecting-ip + name), so the same visitor always sees the same arm. Default share: 20% candidate / 80% current. While a candidate exists for a skill, cache headers flip to private, no-store.
Hourly Cloudflare Cron at /api/cron/validate-candidates reads Analytics Engine for the last 7 days. For each candidate: query marketplace_view count (route-agnostic, shared denominator), query install_hit per arm, require ≥ 200 installs per arm, Welch's t-test on conversion ratios with p < 0.05 AND strict improvement. Pass → promoteCandidate(name) atomically. Regress → retireCandidate(name) + write the diff to the negative-feedback store.
Weekly cron /api/cron/propose-edits: find approved skills below the 25th percentile in conversion with no active candidate, pull 14 days of telemetry + last N rejected candidates, render to a narrow prompt, require a Zod-validated SkillEdit[] of length 4–8, apply → static gate → save as candidate. An SKILLOPT_ENABLED=true|false env var gates every cron — kill switch is one flip.
Once Phase 2 lands, here is exactly what running an experiment looks like from the operator side.
The candidate goes through skill-validator on insert. Token count under 1500 (the paper's compactness lesson — best skills land around 920). Description shares a keyword with each whenToUse item — the cheap router/body proxy. Protected sections byte-identical to the parent. Edit budget ≤ 8 ops. Fail = the candidate never enters the A/B; the curator sees the reasons and can revise or force-pass with a logged override.
Default abTrafficShare: 0.2. shouldRouteCandidate hashes (cf-connecting-ip + name). Cache header drops to private, no-store for the duration — otherwise the edge cache poisons the split.
The hourly cron reports "insufficient data" below 200 per arm. If 14 days pass without either arm crossing the threshold, the LLM-judge fallback decides instead. Curator can manually nudge the share higher (50% / 80%) to accelerate low-traffic skills.
Grafana exposes the per-version panels needed to sanity-check the test before the cron decides:
install_hit / marketplace_viewcurl) but losing on browsers? Router/body driftgithub_click goes up while installs go down, the candidate is causing intent leakConversion ratios per arm, Welch's t-test (unequal variance — realistic for differently-sized arms), p < 0.05 AND strict improvement. Ties reject. Regressions retire the candidate and the diff lands in the negative-feedback store keyed at skill:<id>:rejected:v<hash>.
Pointer flips atomically via promoteCandidate(name). Cache header restores to public, max-age=300, must-revalidate. Previous current moves to retired but stays queryable for 30 days from /admin/audit — every auto-promote is reversible. Operator pulls per-version panels for the next 24 hours and confirms the conversion lift holds at 100% traffic.
The test that matters. The paper's central insight: aggregate accuracy is the wrong unit. You can ship a candidate that raises the corpus mean by 2pp while making three high-traffic skills measurably worse. The cron decides per-skill; the operator spot-checks that this skill's effect size exceeds the corpus average over the last 14 days.
skill-judge.no-store is set before running. A cached /i/<name> at the CDN edge silently breaks the split.audit:<timestamp>:<id>. Retired versions kept 30 days — reverting a bad promote is one click.The 6.1% versioned share is the rate at which install rows now carry the name@hash8 suffix. That number will climb as edge caches expire and older clients fetch fresh markdown — by the time Phase 2 lands and we actually run a real candidate, every install row will be versioned.
The first row that landed in Analytics Engine right after the Phase 1b deploy proved the slug encoder was wired correctly end to end:
blob1: "install_hit"
blob2: "github-actions-docs@cf993479" // slug@hash8 — Phase 1b live
blob3: "US"
blob5: "curl"
Before Phase 1b, blob2 would have been the bare slug github-actions-docs. The @cf993479 suffix is the first 8 hex chars of the rendered markdown's SHA-256 — exactly what encodeVersionedSlug produces (§11).
Each op is atomic and reversible. The edit budget (MAX_EDITS_PER_STEP = 8) is the textual analog of a learning rate — remove the cap and the optimizer destabilizes. Forbidden on this surface (would require a new submission, not an edit): replace_name (breaks /i/<name>), replace_apiReference.signature (changes the contract), replace_category, replace_audience (both change marketplace placement).
| Op | Signature | Touches |
|---|---|---|
add_gotcha | { value: string(8..280) } | gotchas[] (push) |
delete_gotcha | { index: int } | gotchas[] (splice) |
replace_gotcha | { index: int, value: string(8..280) } | gotchas[index] |
add_keyConcept | { value: { term, explanation } } | keyConcepts[] (push) |
replace_keyConcept | { index, value: { term, explanation } } | keyConcepts[index] |
delete_keyConcept | { index: int } | keyConcepts[] (splice) |
replace_description | { value: string(20..220) } | description (the router) |
replace_whenToUse_item | { index, value: string(8..200) } | whenToUse[index] |
replace_apiReference_example | { index, value: string(0..2400) } | apiReference[index].example only — signature stays locked |
The Zod schema (verbatim from src/lib/skill-edit.ts):
export const SkillEditSchema = z.discriminatedUnion("op", [
z.object({ op: z.literal("add_gotcha"), value: z.string().min(8).max(280) }),
z.object({ op: z.literal("delete_gotcha"), index: z.number().int().nonnegative() }),
z.object({
op: z.literal("replace_gotcha"),
index: z.number().int().nonnegative(),
value: z.string().min(8).max(280),
}),
z.object({
op: z.literal("add_keyConcept"),
value: z.object({
term: z.string().min(2).max(80),
explanation: z.string().min(10).max(500),
}),
}),
z.object({
op: z.literal("replace_keyConcept"),
index: z.number().int().nonnegative(),
value: z.object({
term: z.string().min(2).max(80),
explanation: z.string().min(10).max(500),
}),
}),
z.object({ op: z.literal("delete_keyConcept"), index: z.number().int().nonnegative() }),
z.object({ op: z.literal("replace_description"), value: z.string().min(20).max(220) }),
z.object({
op: z.literal("replace_whenToUse_item"),
index: z.number().int().nonnegative(),
value: z.string().min(8).max(200),
}),
z.object({
op: z.literal("replace_apiReference_example"),
index: z.number().int().nonnegative(),
value: z.string().max(2400),
}),
]);
export const SkillEditsPayload = z
.array(SkillEditSchema)
.min(1, "at least one edit required")
.max(MAX_EDITS_PER_STEP, `bounded to 8 ops per step (learning rate)`);
Runs synchronously on every candidate insert. Returns { pass, reasons, warnings, tokenCount }. A fail can still be force-merged by a curator (the override is logged), matching the paper's researcher-override pattern. From src/lib/skill-validator.ts:
| Check | Threshold | Mitigation on fail |
|---|---|---|
| 1. Schema shape | Zod SkillSchema.safeParse must succeed | Hard reject. Fix the field and retry — the validator surfaces the path of every Zod issue. |
| 2. Edit budget | edits.length ≤ 8 | Hard reject. Split into two candidates — the cap is the learning rate. |
| 3. Protected sections | No op touches a section listed in protectedSections | Hard reject. Drop the offending op or remove the section from protectedSections (curator-only). |
| 4. Token count | Warn at 1200, hard fail at 1500 (target 920) | Drop low-signal lines or split into two skills. Compactness wins. |
| 5. Description ↔ whenToUse coherence | Each whenToUse item shares ≥ 1 keyword with the description | Warning only — surfaces likely router/body drift before it shows up in the conversion gate. |
| 6. Protected-section byte identity | Rendered <!-- @protected:<id> --> blocks match parent byte-for-byte | Hard reject. Defense in depth at the rendered-markdown level. |
The token estimator is intentionally cheap — chars/4, accurate to ~15% — fast enough to run on every candidate without an extra dependency:
function estimateTokens(markdown: string): number {
// chars/4 ≈ tokens. Cheap, close enough for gating.
return Math.ceil(markdown.length / 4);
}
And the protected-section enforcement (verbatim, the part that matters):
// 6. Protected-section byte-identity at the rendered-markdown level.
if (protectedIds.size > 0) {
const parentBlocks = extractProtectedBlocks(parent.markdown);
const candidateBlocks = extractProtectedBlocks(candidate.markdown);
for (const id of protectedIds) {
if (parentBlocks.get(id) !== candidateBlocks.get(id)) {
reasons.push(
`protected_byte_identity: section '${id}' bytes differ between parent and candidate.`
);
}
}
}
The Phase 1a migration is additive — old reads keep working through the legacy findApprovedByName path. New writes also populate the family namespace.
Read paths (from src/lib/storage.ts):
getCurrentSkillByName(name) — reads family:<name>:current, falls back to legacy name scan if no pointer exists.getCandidateSkillByName(name) — reads family:<name>:candidate, returns null if no A/B is in flight.listVersionsByName(name) — reads the history pointer, returns every recorded version oldest-first.Write paths:
saveSkillVersion(parent, edits, input) — writes a new skill:<id> with versionStatus: "candidate", appends to history, sets the candidate pointer.promoteCandidate(name) — flips candidate → current atomically, demotes previous current to retired, clears the candidate pointer.retireCandidate(name) — marks candidate as retired and clears the candidate pointer. The row stays queryable for audit and negative-feedback reuse.ensureFamilyIndexed(entry) — lazy migration: the first read of a legacy entry repairs the family index in place. No separate migration job.slug@hash encodingEvery install_hit row's blob2 column carries <skill-name>@<contentHash[:8]>. The Analytics Engine schema is unchanged; the version is encoded inside the existing column by convention. Old slug-only dashboards still work via splitByChar('@', blob2)[1] — which returns the whole string on legacy rows written before versioning, so old data attributes correctly. From src/lib/metrics-core.ts:
export function encodeVersionedSlug(name: string, contentHash: string): string {
return `${name}@${contentHash.slice(0, 8)}`;
}
export function parseVersionedSlug(slug: string): { name: string; versionHash: string | null } {
const at = slug.indexOf("@");
if (at < 0) return { name: slug, versionHash: null };
return { name: slug.slice(0, at), versionHash: slug.slice(at + 1) };
}
Round-trip property: parseVersionedSlug(encodeVersionedSlug(name, hash)).versionHash === hash.slice(0, 8). Legacy slugs round-trip too — parseVersionedSlug("legacy-skill").versionHash === null. The metrics.test.ts suite covers both directions and the legacy fallback (7 cases, all green).
Rendered SKILL.md output wraps protected sections in HTML-comment markers the validator can find with a regex. Here is what a candidate of the caveman skill would render if protectedSections: ["gotchas", "keyConcepts"] were set (synthesized from scripts/seeds/caveman.json via the renderSkillMarkdown logic in src/lib/skill-schema.ts):
---
name: caveman
description: "Use when you want Claude Code (or 30+ agents) to cut ~75% of output tokens..."
source: https://github.com/JuliusBrussee/caveman/blob/main/README.md
generated: 2026-05-27T12:00:00Z
category: tool
audience: engineers
---
<!-- @protected:keyConcepts -->
## Key concepts
### talk like caveman
Skill instructs the agent to drop filler words, hedging, and pleasantries —
keep substance in telegraphic fragments. Same fix, ~75% fewer output tokens,
100% technical accuracy. Brain still big, mouth small.
### four grunt levels
/caveman accepts a level argument: `lite`, `full`, `ultra`, `wenyan`.
Level persists until end of session or `normal mode`.
<!-- /@protected:keyConcepts -->
## API reference
```
One-line install (Mac / Linux / WSL / Git Bash)
```
Finds every agent on the machine and installs the caveman skill into each.
<!-- @protected:gotchas -->
## Gotchas
- Caveman mode persists for the whole session unless you say `normal mode`
- /caveman-compress is destructive — back up CLAUDE.md before running
- ULTRA level loses some nuance — fine for code, risky for prose
<!-- /@protected:gotchas -->
The validator extracts each <!-- @protected:<id> -->…<!-- /@protected:<id> --> block from both the parent and candidate markdown using this regex:
const re = /<!-- @protected:([a-zA-Z]+) -->\n?([\s\S]*?)\n?<!-- \/@protected:\1 -->/g;
The \1 backreference forces the closing marker's id to match the opening marker's id — so a stray @protected:gotchas opener paired with a /@protected:keyConcepts closer is treated as malformed, not as a valid block. The body between markers is captured as group 2 with no normalization. The validator then compares parentBlocks.get(id) !== candidateBlocks.get(id) byte-for-byte — a single trailing space difference fails the check. Defense in depth: even if the structured SkillEdit[] somehow let a protected op through, the rendered-markdown check catches it.
The marker emitter (in renderSkillMarkdown → pushSection) trims the trailing blank line that section emitters push, so the closing marker hugs the section content — that way reformatting the surrounding markdown (e.g. adding a blank line between sections) doesn't shift bytes inside a protected block.
All three queries run against the Cloudflare Analytics Engine SQL API. Full recipe collection in docs/analytics-queries.md. Analytics Engine samples on write, so always use sum(_sample_interval) — not count() — to get the true total.
SELECT
splitByChar('@', blob2)[1] AS slug,
sum(_sample_interval) AS installs
FROM skillmake_metrics
WHERE index1 = 'install_hit' AND blob2 != ''
GROUP BY slug
ORDER BY installs DESC
LIMIT 50
SELECT
splitByChar('@', blob2)[1] AS slug,
count(DISTINCT splitByChar('@', blob2)[2]) AS versions_seen,
sum(_sample_interval) AS total_installs
FROM skillmake_metrics
WHERE index1 = 'install_hit'
AND blob2 LIKE '%@%'
AND timestamp >= NOW() - INTERVAL '14' DAY
GROUP BY slug
HAVING versions_seen > 1
ORDER BY total_installs DESC
When this returns rows, a candidate is live in production. versions_seen = 2 is a clean current+candidate A/B; 3+ means a candidate was promoted and a new candidate is already being tested against it.
WITH
installs AS (
SELECT
splitByChar('@', blob2)[2] AS version_hash,
sum(_sample_interval) AS installs
FROM skillmake_metrics
WHERE index1 = 'install_hit'
AND splitByChar('@', blob2)[1] = 'caveman'
AND timestamp >= NOW() - INTERVAL '7' DAY
GROUP BY version_hash
),
views AS (
SELECT sum(_sample_interval) AS views
FROM skillmake_metrics
WHERE index1 = 'marketplace_view'
AND blob2 = 'caveman'
AND timestamp >= NOW() - INTERVAL '7' DAY
)
SELECT
installs.version_hash AS version_hash,
installs.installs AS installs,
views.views AS detail_page_views,
round(installs.installs / views.views, 3) AS conversion_ratio
FROM installs CROSS JOIN views
ORDER BY conversion_ratio DESC
marketplace_view is route-agnostic (same slug for both arms), so the denominator is shared. This is the exact ratio the Phase 3 cron runs Welch's t-test against.
All 25 tests green by suite. Vitest, no integration mocks needed — the libs are pure.
| Suite | Cases | What it covers |
|---|---|---|
src/lib/metrics.test.ts | 7 | Round-trip of encodeVersionedSlug / parseVersionedSlug, legacy slug fallback (no @), buildMetricDataPoint shape for events with and without status. |
src/lib/skill-edit.test.ts | 10 | Zod schema accept/reject for every op kind, edit-budget enforcement, applyEdits happy path + ApplyError on out-of-bounds index. |
src/lib/skill-validator.test.ts | 8 | All 5 checks: schema fail, edit budget overflow, protected-section op rejection, token cap, coherence warning, and the byte-identity check for protected blocks. |
| Total | 25 |
Phase 1 shipped in four commits on main:
| SHA | Title |
|---|---|
2c5a04e | feat(skillopt): phase 1a+1b — versioned KV storage + slug@hash on install_hit |
103480c | feat(skillopt): phase 1c+1d+1e — bounded edits, static validator, protected sections |
fac8738 | feat(skillopt): phase 1f — per-version Grafana panels + slug@hash-aware SQL |
03cd440 | Add steipete agent script skills (seed adds, same window) |
Worker version: c846c4c3-771f-49ec-8d75-6d678539d8ab
Live at: skillmake.xyz · www.skillmake.xyz
Account: Cloudflare f9d7efb12a2713ce5af52d882165c543
Runtime: Next.js on Cloudflare Workers via OpenNext + nodejs_compat
Data plane: KV namespace MARKETPLACE_KV · Analytics Engine dataset skillmake_metrics
Phase 2 — A/B routing. The install route at /i/<name> needs to call getCandidateSkillByName in parallel with the current lookup, hash (cf-connecting-ip + name) to assign an arm, and flip cache headers to private, no-store for the duration of any in-flight A/B. The deterministic hash means visitors never bounce between arms, even across reloads or fresh tabs — the only thing that resets is when the candidate gets promoted or retired.
Phase 3 — dynamic validation gate. Hourly cron at /api/cron/validate-candidates reads the last 7 days of Analytics Engine, computes the conversion ratio per arm, runs Welch's t-test, and either calls promoteCandidate(name) or retireCandidate(name). The long-tail fallback to the skill-judge LLM-judge skill kicks in at 14 days without crossing the 200-installs-per-arm threshold. The cron is gated by SKILLOPT_ENABLED — one env flip kills the whole loop.
Phase 4 — optimizer loop. Weekly cron at /api/cron/propose-edits picks the bottom-quartile skills by conversion, pulls 14 days of telemetry, reads the rejected-edits store for that skill, and prompts a frontier model for a Zod-validated SkillEdit[] of length 4–8. Apply → static gate → save as candidate → enters A/B at 20% share. Budget at current corpus size is roughly $10/month.