Instructions you can trust to run.
Rubrkit grades prompts, agent specs, commands, skills, workflows, and rubr_flow procedures against a clear rubric, then turns the weak ones into testable, version-tracked rewrites. We exist so that “this prompt is ready” is a claim you can prove, not a hope.
AI instructions ship on vibes.
A prompt or agent spec goes straight from a draft into production with no review a reasonable engineer would accept for code. There is no score, no diff against a standard, and no test that proves it behaves. When it drifts, no one can say what changed or whether the fix actually worked.
We think the instructions agents follow deserve the same rigor as the software around them. So Rubrkit makes the standard explicit: a rubric to grade against, a rewrite that fixes the specific weaknesses, and an eval that proves the rewrite holds — with version history so you always know what is running and why.
The principles behind the review loop.
Clarity over vibes
An instruction is only ready when its objective, scope, and success criteria are explicit. We grade against a rubric, not a gut feeling, so "good enough" has a definition.
Testable or it didn't happen
Every rewrite ships with the eval that proves it holds. If we cannot show the old version failing and the new one passing, we have not improved anything — we have just moved words around.
Version everything
Prompts, agents, and skills are production code. They deserve history, provenance, and a single source of truth you can roll back, not a paste buried in a chat log.
Built for people who ship
Rubrkit is for prompt engineers, AI power users, and developers who need their instructions to hold under real workloads. The tooling meets you where you work: API, CLI, and MCP.
See it on your own instructions.
Grade a prompt, command, skill, agent spec, or workflow, watch it get rewritten, and get the eval that proves the new version is better. Browse worked before-and-after specimens, or run the loop yourself.