Pricing

Proof, not vibes

Stop guessing whether your prompt got better.

Rubrkit grades your instruction against a rubric, rewrites it, and runs the eval that proves the new version passes what the old one failed. Score delta, diff, evidence — not vibes.

specimen / RBR-114

example

38

BEFORE

+48

86

AFTER

The proof screen

What a rewrite cannot give you.

One specimen, four marks: a structural finding with a line number, the diff, the eval case that flips from fail to pass, and the score delta on the dimension that moved. Shown here as an example.

1 · Finding

L1 No success criteria — “good launch post” has no testable definition of done.

2 · Diff

− Write a good launch post for my product.

+ Write a launch post for [AUDIENCE] introducing [PRODUCT]… end with a single [CTA].

3 · Eval case

v1 · FAIL
v2 · PASS

“A first-time reader can name the product, problem, outcome, and CTA in under 30 seconds.”

4 · Score delta

38

86

Output specification +4 → pass

What a rewrite will not catch

Deterministic findings, not opinions.

An inline "make it better" hands back a new paragraph and a vibe. Rubrkit grades against a rubric, so the weaknesses it surfaces are checks you can re-run — and the proof is a behaviour change, not a nicer sentence.

Structural gaps

A missing output contract, no success criteria, an unbounded step — each flagged with the line it occurs on, not described in the abstract.

Scored dimensions

Objective clarity, output specification, evaluation criteria, and more — each on a 0–5 scale with the evidence behind the mark.

A behaviour change

The eval case the old version fails and the new one passes. Improvement you can watch happen rather than assert.

Reproducible and versioned

Run the same grade and eval on the same version and get the same result. The delta is anchored to a specific artifact, not a moment.

FAQ

Answers before you start.

A rewrite gives you a new version with no account of what changed or whether it is actually better. Rubrkit grades the original against a rubric, names the weak dimensions with evidence, and runs an eval case the old version fails and the new one passes. You see the score delta and the diff, not just a different paragraph.

Know which instructions are ready to run.

Grade an instruction
Newsletter

Follow the review loop as it ships.

Notes on AI artifact testing, rubr_flow conversion, evals, and proof reports.