MLOps Platforms
Evaluation pipeline architecture
ops

Evaluation Pipeline Design: What CI Evals Miss and How to Cover It

CI evals catch regressions in code. They don't catch production drift, prompt sensitivity, or behavioral changes in upstream models. Building an eval system that covers both requires a different architecture.

By Priya Anand · · 8 min read

The evaluation problem for ML systems has two distinct components that get conflated. CI evaluation — running a test suite before deployment — is one component. Production evaluation — measuring model quality against real traffic — is another. Most teams have the former; few have the latter. The systems have different architectures, different failure modes, and different operational requirements.

This covers both, with emphasis on the production evaluation system that teams consistently underinvest in.

CI evaluation: what it covers and what it doesn’t

A CI eval suite runs on a frozen dataset, catches code regressions, and gates deployment. What it cannot catch:

A CI eval suite is a regression test. It’s valuable, but it doesn’t tell you whether the model is working for today’s users.

The production eval architecture

Production evaluation requires a different system:

Online sampling: A fraction of production requests are routed to an evaluation pathway. The fraction depends on your traffic volume — for high-volume systems, 1% is sufficient. For low-volume systems, you may need 100% sampling with async processing.

Judge model scoring: A judge LLM scores each sampled response on the dimensions you care about. Common dimensions: relevance, accuracy, groundedness (for RAG systems), tone compliance, refusal appropriateness.

Stratification by feature/intent: Aggregate scores by the feature or user intent category. A single aggregate quality score hides the reality that quality for Feature A might be excellent while Feature B is degrading.

Anomaly detection on rolling aggregates: Alert when a dimension’s rolling 7-day average drops more than N standard deviations below the 30-day baseline. Do not alert on individual scores — too noisy.

Human review pipeline for judge-flagged samples: When the judge flags a response as “poor,” it should route to a human review queue. This closes the loop: judge findings drive ground-truth labeling, which improves future training.

Implementation sketch

class ProductionEvalPipeline:
    def __init__(self, judge_client, metrics_store):
        self.judge = judge_client
        self.metrics = metrics_store
        self.sample_rate = 0.05  # 5% of production traffic

    def maybe_eval(self, request, response, context):
        if random.random() > self.sample_rate:
            return
        self._eval_async(request, response, context)

    async def _eval_async(self, request, response, context):
        scores = await self.judge.score(
            prompt=request.prompt,
            response=response.text,
            context=context.retrieved_documents,
            dimensions=["relevance", "groundedness", "completeness"]
        )
        
        self.metrics.record(
            feature=request.feature_name,
            user_segment=request.user_segment,
            model_version=response.model_version,
            scores=scores,
            timestamp=datetime.utcnow()
        )
        
        # Route poor responses to human review
        if any(s < 2.0 for s in scores.values()):
            await self.human_queue.enqueue(request, response, scores)

The key design choices:

Judge model selection

The judge model choice is consequential. Principles:

Use a different model family than production. If production is Claude, judge with GPT-4 class. If production is GPT-4, judge with Claude. Same-family judges have systematic agreement biases — when the production model makes a class of error, the same-family judge tends to make the same error.

Calibrate the judge. Run the judge against a human-labeled dataset and measure agreement. Track the calibration over time — models change, and judge accuracy drifts too.

Limit dimensions. Scoring on 10 dimensions produces noisy, uncalibrated results. Score on 3-4 dimensions with clear rubrics. Fewer dimensions, better calibration.

Don’t use the judge for individual decisions. Judge scores are noisy at the individual response level. They’re signal at the aggregate level. Use them to compute rolling averages and trend lines, not to make decisions about individual responses.

Connecting CI and production evaluation

The two systems should share the eval dataset infrastructure but differ in what triggers them and what they measure.

CI evaluation:

Production evaluation:

The golden dataset in CI evaluation should be periodically updated from the human-reviewed production samples. This is the connection: production evaluation generates labeled examples that improve the CI eval set’s representativeness.

What good looks like operationally

An operational production eval system:

The last item is the hardest. The tooling is available. The discipline of the monthly review is what distinguishes teams that maintain eval quality from teams whose eval set is 18 months stale.

For production monitoring of the infrastructure layer alongside quality metrics, MLOps Platforms tracks how the platforms differ in their eval integration story — some embed it, some expect you to bring your own.

Sources

  1. Braintrust Evaluation Framework
  2. OpenAI Evals Framework
  3. LangSmith Evaluation
#evaluation #evals #llm-testing #mlops #ci-cd #production-ml
Subscribe

MLOps Platforms — in your inbox

Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments