Skillcheck Cloud — forced-injection A/B evaluation for agent skills 10 free runs · no credit card
Agent skill evaluation

Does your skill actually help?

Skillcheck runs a forced-injection A/B test: your runner model solves generated tasks with and without the skill, grades blind, and returns an effect in percentage points with a confidence interval. Sign in, get an API key, and check any skill from the CLI.

View the methodology
npm install -g @sx4im/skillcheck skillcheck check ./SKILL.md
How it works

Three steps to a verdict

1

Sign in

Continue with Google or GitHub. We create your account and issue a Skillcheck API key tied to it.

2

Get 10 free runs

Your key includes 10 free skillcheck check runs. The model provider key stays on our server — never in your terminal.

3

Run skillcheck

Paste two commands, point the CLI at your key, and measure any SKILL.md, AGENTS.md, CLAUDE.md, or .cursorrules.

What you get

Evidence, not anecdotes

Skillcheck produces numbers you can cite. Every run is reproducible, every score has a confidence interval, and every skill gets a verdict.

Forced-injection A/B

We inject your skill as a system prompt and run the same tasks twice — with and without the skill. The delta is your effect in percentage points.

Blind grading

A separate grader model scores outputs without knowing which arm produced them. No self-evaluation bias, no cherry-picking.

Bootstrap CI

We resample 1,000 times to build a 95% confidence interval. If it does not overlap zero, the skill helps.

Rot detection

Re-run the same skill on new model releases. If the verdict flips from helps to placebo, you know the skill rotted.

Reproducible

Every result commits the skill hash, task suite, model version, and config. Anyone can re-run and verify the number.

Token-aware

We count the extra tokens the skill costs and compute value-per-1k-tokens. A big effect that burns context is not always a win.

Supported formats

One command for every skill file

Drop any supported skill file and Skillcheck normalises it into a common shape for evaluation.

SKILL.md

SKILL.md

The Anthropic skill-creator format. Extracts instructions, domain, and bundled assets into a normalised skill object.

AGENTS.md

AGENTS.md

The multi-agent format with role definitions and routing rules. Each role is evaluated independently.

CLAUDE.md

CLAUDE.md

The Claude-specific project context format. Injected as system context during the runner phase.

.cursorrules

.cursorrules

Cursor IDE rules files. Parsed as instruction sets and evaluated for code-generation and reasoning tasks.

URL

Remote URL

Point the CLI at a raw GitHub URL or gist. Skillcheck fetches, normalises, and evaluates without a local clone.

FOLDER

Directory scan

Pass a folder path and Skillcheck discovers all supported skill files inside, evaluating each one.

Pricing

Start free. Upgrade when it pays off.

Free
$0
  • 10 Skillcheck runs included
  • Full CLI: check, eval, verify
  • Blind grading + bootstrap CI
  • Community support

Price shown is an example — set your own in Stripe. Upgrade lives in your dashboard after sign-in.

Ready to measure your skills?

Sign in with Google or GitHub and get your API key in seconds.