Evaluation Pipeline Design: What CI Evals Miss and How to Cover It
CI evals catch regressions in code. They don't catch production drift, prompt sensitivity, or behavioral changes in upstream models. Building an eval system that covers both requires a different architecture.
The evaluation problem for ML systems has two distinct components that get conflated. CI evaluation — running a test suite before deployment — is one component. Production evaluation — measuring model quality against real traffic — is another. Most teams have the former; few have the latter. The systems have different architectures, different failure modes, and different operational requirements.
This covers both, with emphasis on the production evaluation system that teams consistently underinvest in.
CI evaluation: what it covers and what it doesn’t
A CI eval suite runs on a frozen dataset, catches code regressions, and gates deployment. What it cannot catch:
- Changes in production traffic distribution (your eval set is from last quarter)
- Upstream model behavior changes (the provider updated the model; your frozen inputs produce subtly different outputs)
- Prompt sensitivity under inputs that weren’t in your eval set
- Composition effects (a retrieval change that looks neutral in isolation but interacts badly with the model)
A CI eval suite is a regression test. It’s valuable, but it doesn’t tell you whether the model is working for today’s users.
The production eval architecture
Production evaluation requires a different system:
Online sampling: A fraction of production requests are routed to an evaluation pathway. The fraction depends on your traffic volume — for high-volume systems, 1% is sufficient. For low-volume systems, you may need 100% sampling with async processing.
Judge model scoring: A judge LLM scores each sampled response on the dimensions you care about. Common dimensions: relevance, accuracy, groundedness (for RAG systems), tone compliance, refusal appropriateness.
Stratification by feature/intent: Aggregate scores by the feature or user intent category. A single aggregate quality score hides the reality that quality for Feature A might be excellent while Feature B is degrading.
Anomaly detection on rolling aggregates: Alert when a dimension’s rolling 7-day average drops more than N standard deviations below the 30-day baseline. Do not alert on individual scores — too noisy.
Human review pipeline for judge-flagged samples: When the judge flags a response as “poor,” it should route to a human review queue. This closes the loop: judge findings drive ground-truth labeling, which improves future training.
Implementation sketch
class ProductionEvalPipeline:
def __init__(self, judge_client, metrics_store):
self.judge = judge_client
self.metrics = metrics_store
self.sample_rate = 0.05 # 5% of production traffic
def maybe_eval(self, request, response, context):
if random.random() > self.sample_rate:
return
self._eval_async(request, response, context)
async def _eval_async(self, request, response, context):
scores = await self.judge.score(
prompt=request.prompt,
response=response.text,
context=context.retrieved_documents,
dimensions=["relevance", "groundedness", "completeness"]
)
self.metrics.record(
feature=request.feature_name,
user_segment=request.user_segment,
model_version=response.model_version,
scores=scores,
timestamp=datetime.utcnow()
)
# Route poor responses to human review
if any(s < 2.0 for s in scores.values()):
await self.human_queue.enqueue(request, response, scores)
The key design choices:
- Async: production request latency is not affected
- Stratified: scores recorded with enough dimensions to diagnose
- Feedback loop: poor responses reach human reviewers
Judge model selection
The judge model choice is consequential. Principles:
Use a different model family than production. If production is Claude, judge with GPT-4 class. If production is GPT-4, judge with Claude. Same-family judges have systematic agreement biases — when the production model makes a class of error, the same-family judge tends to make the same error.
Calibrate the judge. Run the judge against a human-labeled dataset and measure agreement. Track the calibration over time — models change, and judge accuracy drifts too.
Limit dimensions. Scoring on 10 dimensions produces noisy, uncalibrated results. Score on 3-4 dimensions with clear rubrics. Fewer dimensions, better calibration.
Don’t use the judge for individual decisions. Judge scores are noisy at the individual response level. They’re signal at the aggregate level. Use them to compute rolling averages and trend lines, not to make decisions about individual responses.
Connecting CI and production evaluation
The two systems should share the eval dataset infrastructure but differ in what triggers them and what they measure.
CI evaluation:
- Triggered by code change
- Runs on frozen golden dataset
- Measures: did behavior regress from baseline?
- Blocking gate
Production evaluation:
- Continuous, triggered by traffic
- Runs on live requests
- Measures: is behavior acceptable for current users?
- Alerting system, not gate
The golden dataset in CI evaluation should be periodically updated from the human-reviewed production samples. This is the connection: production evaluation generates labeled examples that improve the CI eval set’s representativeness.
What good looks like operationally
An operational production eval system:
- A dashboard per major feature showing 7-day rolling quality scores per dimension
- Anomaly alerts that trigger when a score drops 1.5 standard deviations below the rolling mean
- A human review queue with a defined SLA (all flagged samples reviewed within 48 hours)
- A monthly review meeting where eval trends are discussed and the golden dataset is updated
The last item is the hardest. The tooling is available. The discipline of the monthly review is what distinguishes teams that maintain eval quality from teams whose eval set is 18 months stale.
For production monitoring of the infrastructure layer alongside quality metrics, MLOps Platforms tracks how the platforms differ ↗ in their eval integration story — some embed it, some expect you to bring your own.
Sources
MLOps Platforms — in your inbox
Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works
Training data versioning sounds like an ML engineering nicety. In practice it's the prerequisite for reproducible models, auditable compliance, and debugging production failures.
Training Infrastructure Cost Control: Where ML Spend Actually Goes
Cloud training bills surprise teams that model costs at the benchmark level. Real training cost includes wasted compute, storage, egress, and idle GPUs. Here's how to audit and reduce it.
Model Registry Patterns That Hold in Production
A model registry is supposed to be the source of truth for what's deployed. Most implementations drift from that ideal within six months. Here's what breaks and how to prevent it.