Evaluation Pipeline Design: What CI Evals Miss and How to Cover It

The evaluation problem for ML systems has two distinct components that get conflated. CI evaluation — running a test suite before deployment — is one component. Production evaluation — measuring model quality against real traffic — is another. Most teams have the former; few have the latter. The systems have different architectures, different failure modes, and different operational requirements.

This covers both, with emphasis on the production evaluation system that teams consistently underinvest in.

CI evaluation: what it covers and what it doesn’t

A CI eval suite runs on a frozen dataset, catches code regressions, and gates deployment. What it cannot catch:

Changes in production traffic distribution (your eval set is from last quarter)
Upstream model behavior changes (the provider updated the model; your frozen inputs produce subtly different outputs)
Prompt sensitivity under inputs that weren’t in your eval set
Composition effects (a retrieval change that looks neutral in isolation but interacts badly with the model)

A CI eval suite is a regression test. It’s valuable, but it doesn’t tell you whether the model is working for today’s users.

The production eval architecture

Production evaluation requires a different system:

Online sampling: A fraction of production requests are routed to an evaluation pathway. The fraction depends on your traffic volume — for high-volume systems, 1% is sufficient. For low-volume systems, you may need 100% sampling with async processing.

Judge model scoring: A judge LLM scores each sampled response on the dimensions you care about. Common dimensions: relevance, accuracy, groundedness (for RAG systems), tone compliance, refusal appropriateness.

Stratification by feature/intent: Aggregate scores by the feature or user intent category. A single aggregate quality score hides the reality that quality for Feature A might be excellent while Feature B is degrading.

Anomaly detection on rolling aggregates: Alert when a dimension’s rolling 7-day average drops more than N standard deviations below the 30-day baseline. Do not alert on individual scores — too noisy.

Human review pipeline for judge-flagged samples: When the judge flags a response as “poor,” it should route to a human review queue. This closes the loop: judge findings drive ground-truth labeling, which improves future training.

Implementation sketch

class ProductionEvalPipeline:
    def __init__(self, judge_client, metrics_store):
        self.judge = judge_client
        self.metrics = metrics_store
        self.sample_rate = 0.05  # 5% of production traffic

    def maybe_eval(self, request, response, context):
        if random.random() > self.sample_rate:
            return
        self._eval_async(request, response, context)

    async def _eval_async(self, request, response, context):
        scores = await self.judge.score(
            prompt=request.prompt,
            response=response.text,
            context=context.retrieved_documents,
            dimensions=["relevance", "groundedness", "completeness"]
        )
        
        self.metrics.record(
            feature=request.feature_name,
            user_segment=request.user_segment,
            model_version=response.model_version,
            scores=scores,
            timestamp=datetime.utcnow()
        )
        
        # Route poor responses to human review
        if any(s < 2.0 for s in scores.values()):
            await self.human_queue.enqueue(request, response, scores)

The key design choices:

Async: production request latency is not affected
Stratified: scores recorded with enough dimensions to diagnose
Feedback loop: poor responses reach human reviewers

Judge model selection

The judge model choice is consequential. Principles:

Use a different model family than production. If production is Claude, judge with GPT-4 class. If production is GPT-4, judge with Claude. Same-family judges have systematic agreement biases — when the production model makes a class of error, the same-family judge tends to make the same error.

Calibrate the judge. Run the judge against a human-labeled dataset and measure agreement. Track the calibration over time — models change, and judge accuracy drifts too.

Limit dimensions. Scoring on 10 dimensions produces noisy, uncalibrated results. Score on 3-4 dimensions with clear rubrics. Fewer dimensions, better calibration.

Don’t use the judge for individual decisions. Judge scores are noisy at the individual response level. They’re signal at the aggregate level. Use them to compute rolling averages and trend lines, not to make decisions about individual responses.

Connecting CI and production evaluation

The two systems should share the eval dataset infrastructure but differ in what triggers them and what they measure.

CI evaluation:

Triggered by code change
Runs on frozen golden dataset
Measures: did behavior regress from baseline?
Blocking gate

Production evaluation:

Continuous, triggered by traffic
Runs on live requests
Measures: is behavior acceptable for current users?
Alerting system, not gate

The golden dataset in CI evaluation should be periodically updated from the human-reviewed production samples. This is the connection: production evaluation generates labeled examples that improve the CI eval set’s representativeness.

What good looks like operationally

An operational production eval system:

A dashboard per major feature showing 7-day rolling quality scores per dimension
Anomaly alerts that trigger when a score drops 1.5 standard deviations below the rolling mean
A human review queue with a defined SLA (all flagged samples reviewed within 48 hours)
A monthly review meeting where eval trends are discussed and the golden dataset is updated

The last item is the hardest. The tooling is available. The discipline of the monthly review is what distinguishes teams that maintain eval quality from teams whose eval set is 18 months stale.

For production monitoring of the infrastructure layer alongside quality metrics, MLOps Platforms tracks how the platforms differ ↗ in their eval integration story — some embed it, some expect you to bring your own.

Evaluation Pipeline Design: What CI Evals Miss and How to Cover It

CI evaluation: what it covers and what it doesn’t

The production eval architecture

Implementation sketch

Judge model selection

Connecting CI and production evaluation

What good looks like operationally

Sources

MLOps Platforms — in your inbox

Related

Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works

Training Infrastructure Cost Control: Where ML Spend Actually Goes

Model Registry Patterns That Hold in Production

Comments