Evals Are Your Competitive Edge: DIY Eval System vs. Eval Platform

Evals Are Your Competitive Edge: DIY Eval System vs. Eval Platform

Vibecoding an eval system is easy. Assisted by your favorite coding agent, you can have LLM-as-judge scorers running against your RAG pipeline, passing assertions, and returning justifications. A perfect starting point.

The harder question is whether the vibe-coded evals are actually measuring what you think they are and whether you'd know if they weren't.

This is the distinction that matters for build-vs-buy on evals. Not whether you can build it, but what you get when you do, and what gaps remain. Those gaps are not all the same kind. Some are engineering problems you could solve with more time. Your basic stuff such as where to store the scores, how to compare, basic analysis, etc. Others are research problems that take weeks of calibration work to get right, and where a wrong answer is worse than no answer, because it gives you false confidence.

The Case Study

The example app is a simple RAG agent (

) that answers questions about the Scorable documentation. It:

  1. Takes a user question
  2. Creates an embedding and retrieves the top 8 matching doc sections from pgvector
  3. Passes those sections to an OpenAI model to generate an answer

Granted, yet another toy example. Much like the ones you can find from the "awesome-llm-apps" type GitHub repos. The point being, it is the kind of thing that ends up in production and then gets "improved" until something quietly breaks.

What We'll Do

Two passes at adding evals to this app:

Pass 1: Barebones eval system. Use Claude Code to write LLM-as-judge scorers from scratch. No eval framework, just prompts and Python. This is a deliberately naive baseline - not what a serious team would ship - but it makes the individual components visible.

Pass 2: Platform-backed eval system. Add Scorable with pre-built evaluators. Same questions, same agent, different infrastructure.

What's actually being compared

The surface API for both approaches looks similar: call something with a question, context, and answer and get back a score and a justification. What is different though is evaluator reliability (is the score actually measuring what the name implies?), what the platform knows about your system, and what you can ask the system after the scores come in.


Chapter 2: Pass 1, Barebones Eval System

What We Built

, a pytest file with three LLM-as-judge scorers, each returning a structured
.

Three metrics, chosen because they cover the two failure modes a RAG system has:

MetricWhat it catches
FaithfulnessAnswer invents facts not present in the retrieved context
Answer relevanceAnswer is correct but off-topic, or ignores the question
Tool call qualityRetrieval wasn't triggered, or the search query was poor

Each scorer is a

with
and a system prompt describing the rubric:

Running the agent on

gets back a score and a one-sentence justification. Extracting the retrieved context requires walking
and pulling
content, workable but not obvious.

How It Runs

9 test cases (3 questions × 3 metrics). Each asserts

. The justification is included in the assertion message so failures are self-explaining.

What It Took

Honestly? About 15 minutes of prompting. The scorers are just Pydantic AI agents which are essentially opinionated wrappers around calling LLMs. I had to iterate it a few times, tell it to check the latest documentation with Context7.

What's Already Missing

After writing these evals tests, there are a few obvious TODO items that you need to do.

  • No history. You get a pass/fail today. You have no idea if scores improved or regressed since last week.
  • No aggregates. Average faithfulness across the dataset? Requires more code.
  • Flaky by nature. LLM-as-judge scores vary run to run. A threshold of
    will flap. Managing that requires more code. The fancy term for what we want to optimize is "confidence interval".
  • Cost is invisible. Each test run makes 6 LLM calls per question (1 RAG + 3 scorers, run twice across the 3 test functions that re-run the agent). No tracking.

But there's a subtler problem that doesn't show up until later: you don't know if these scorers are any good. A faithfulness scorer that returns 0.8 looks fine. What you can't easily tell from a prompt-engineered scorer: does it catch partial hallucinations or only obvious fabrications? Does it agree with how a human expert would score the same response? Is it stable across different model versions, or does the score shift when you upgrade the underlying judge? These aren't prompt-writing questions; they're calibration questions, and answering them requires sustained empirical work against labeled data.

A note on the baseline: this example is intentionally naive. It's the quickest possible thing that produces a number. In practice, no serious team would start here. They'd reach for the open source ecosystem and look for tools like RAGAS or OpenEvals for pre-built metrics, DSPy or Pydantic evals for structured scoring and confidence estimation, and MLflow for run tracking and experiment comparison. That stack is genuinely competitive on the evaluator and run-tracking dimensions, and for many teams it's the right call. The fair comparison for a platform is against that, not the barebones version above. The point of doing it from scratch here is just to make the individual components visible before evaluating what a platform adds on top.


Chapter 3: Pass 2, Platform-Backed Eval System

What Changed

, same pytest structure, but the scorer agents are gone. Instead of prompting an LLM ourselves, we call Scorable's pre-built context-aware evaluators:

Two evaluators used:

EvaluatorWhat it measures
FaithfulnessIs the answer grounded in the retrieved context?
Context RecallDoes the retrieved context contain enough information to produce the correct answer?

requires
(a ground truth answer), which forced us to actually write down what correct answers look like, something the vibe-coded version skipped entirely. That forcing function is part of the value: a well-designed evaluator tells you what you need to know before you can use it correctly.

What the Built-In Evaluators Actually Are

The API call to

looks simple, but what's behind it is not a system prompt you could reproduce in 10 minutes. The difference shows up when you ask: how do you know if this evaluator is any good?

Scorable evaluators have calibration sets attached: collections of scored examples (a request, a response, and a known correct score) that serve as ground truth. When you run calibration, you get back a total deviance score — the average error between what the evaluator predicted and what the ground truth says it should be. That tells you concretely how reliable the evaluator is, and which specific samples it gets wrong. See the full calibration docs for how to interpret deviance and improve from there.

With a vibecoded scorer, you have no equivalent. You write a rubric, run it, get numbers back. But you have no external anchor for whether those numbers mean anything until something goes wrong in production.

Custom Evaluators

The Scorable built-in evaluators cover common RAG cases, but you can also define domain-specific ones. The same calibration mechanism applies: you attach scored examples, run calibration, and see the deviance. You can also pull samples from production execution logs into your calibration set as edge cases surface.

One practical problem with calibration sets is that building one from scratch is tedious. You need scored examples spread across the 0.0–1.0 range, and hand-crafting 10+ examples that meaningfully cover the spectrum takes time. Scorable has a ladder algorithm that handles this: given a predicate (say, "the response is grounded in the retrieved context") and one or two anchor examples, it generates synthetic examples at the missing score levels — filling left, right, or mid gaps in the range — and validates their consistency using a separate LLM check. It's the kind of tooling that only makes sense to build if you're doing calibration seriously, and it's what makes the calibration workflow practical rather than theoretical. See the custom evaluator docs for the full workflow.

A note on ownership: Scorable does not own your evaluation logic. Your rubrics, datasets, and custom evaluator definitions are yours: exportable, versionable, and reimplementable elsewhere if needed.


Chapter 4: What You Get vs. What You Build

Let's be concrete about what each approach actually delivers.

A: Running evals and getting scores. Both approaches do this. You can have an LLM-as-judge scorer returning numbers within an afternoon. This is not a differentiator.

B: History, persistence, dashboards. The DIY version has none. You can add this with open-source run tracking, but it's plumbing you build and maintain rather than use. The platform ships it.

C: Evaluator quality and reliability. A well-calibrated evaluator is one that agrees with human judgment, handles edge cases correctly, and gives stable scores across model versions. Most developers building evals for the first time don't know what they don't know here. They'll write a rubric, see scores come back, and assume the scorer is working. Often it is. Sometimes it isn't, and the failure mode is that your evals pass while production degrades.

More practically: with a vibecoded scorer you have no way to measure this. You get a number; you don't know if the number is right. Scorable's calibration tooling gives you a concrete deviance score against known ground truth, and you can feed in production samples over time as your system encounters edge cases. Algorithms like GEPA can then optimize your evaluator prompts automatically against that labeled data. You can build this yourself, but you'd be building the measurement infrastructure before you can even start improving the evaluator.

D: Production instrumentation, alerting, and insights at scale. Getting scores from offline test runs is different from getting scores from live production traffic. Wiring that up requires instrumenting your agent, keeping dev and prod runs from polluting each other, building alerting that doesn't fire on every score fluctuation, and surfacing patterns across thousands of runs (which questions regress, which user segments score low, which prompt changes moved what). It's all solvable; you can use your existing infrastructure, like your existing OTel sink (and you likely should), but it may lack LLM-specific features.

What "Doing It Yourself" Actually Means

A serious DIY stack, structured scoring, run tracking, calibrated evaluators and all, is achievable. It's also a significant ongoing investment. The evaluators alone require empirical validation against labeled data to be trustworthy. The infrastructure requires building the production instrumentation layer from scratch. Both require maintenance as your system evolves.

That's a legitimate choice if eval infrastructure is central to what you're building. For most product teams, the question is whether that's where their data science and engineering capacity should go, or whether it's overhead on the thing the eval loop exists to protect: the product.

The Score in Context

A score of 0.73 on faithfulness means almost nothing by itself. What you actually want to know:

  • Is 0.73 better or worse than last week?
  • Which questions score consistently low, and why?
  • Did the score drop after you changed the system prompt on Tuesday?
  • Is faithfulness lower for users on the free tier vs. paid?
  • When a user complains, can you pull up the exact run, its retrieved context, and its scores in one place?

None of these questions are answerable from a pytest output. They require both the infrastructure layer (B, D) and evaluators you can actually trust (C).

Conclusion

The eval loop that actually protects your product has three components: metrics worth tracking, evaluators that reliably measure them, and infrastructure that surfaces patterns over time.

Vibecoding gets you the first quickly. It can get you a version of the second, but with no external calibration, you won't know how good it is and finding out usually involves something going wrong in production. The third takes real engineering time regardless of which tools you use.

Owning your evaluation strategy (policies, datasets, your domain expertise) is the part that's genuinely yours and genuinely valuable. It encodes your understanding of your users and your failure modes, and no platform can replicate it.

For the evaluators themselves, the concrete question is: can you measure whether yours is reliable? Building a calibration set, running deviance checks, and improving the evaluator against ground truth is the actual work. A platform like Scorable gives you the tooling to do that. Without it, you're left inferring evaluator quality from production behavior, which is a slow and expensive feedback loop.