MLOps Platforms
Inference latency breakdown chart
ops

Online Inference Latency: Where the Budget Actually Goes

P99 latency is a product problem as much as an engineering one. Breaking down the inference budget — model compute, preprocessing, retrieval, postprocessing — is the prerequisite for fixing it.

By Priya Anand · · 8 min read

Most latency debugging conversations start in the wrong place. The team runs a profiler, sees that model forward-pass time is the dominant component, and concludes that hardware is the problem. Buy bigger GPUs. Problem solved.

This is sometimes correct and often wrong. The model compute is visible because it’s the thing teams measure. The overhead before and after the model — the preprocessing, feature hydration, postprocessing, serialization, network, and queueing — is often larger, and almost always more fixable.

Where the latency actually goes

A production inference pipeline has more stages than a benchmark. From request receipt to response delivery:

  1. Ingress / load balancer (0.5–2ms for well-configured systems; can spike to 50ms+ under misconfigured timeouts)
  2. Request parsing and validation (0.1–1ms for structured inputs; higher for large payloads)
  3. Feature hydration (1–50ms+, depending on whether features are cached, in Redis, or require synchronous computation)
  4. Model preprocessing (tokenization for NLP, image resize/normalize for vision, embedding lookup — 1–20ms)
  5. Model forward pass (this is the one people measure)
  6. Postprocessing (output parsing, threshold application, calibration — often neglected)
  7. Downstream calls (if the inference result triggers another API, add its latency)
  8. Response serialization and egress (rarely significant, unless you’re returning large embeddings or multi-modal outputs)

The typical patterns we’ve observed:

Instrumenting to find the real bottleneck

Effective latency debugging requires end-to-end tracing, not just model timing. The minimum instrumentation:

import time
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def run_inference(request):
    with tracer.start_as_current_span("inference_pipeline"):
        
        t0 = time.perf_counter()
        features = hydrate_features(request.user_id, request.context)
        with tracer.start_as_current_span("feature_hydration") as span:
            span.set_attribute("cache_hit", features.from_cache)
        t1 = time.perf_counter()
        
        with tracer.start_as_current_span("preprocessing"):
            model_input = preprocess(features)
        t2 = time.perf_counter()
        
        with tracer.start_as_current_span("model_forward"):
            raw_output = model(model_input)
        t3 = time.perf_counter()
        
        with tracer.start_as_current_span("postprocessing"):
            result = postprocess(raw_output)
        t4 = time.perf_counter()
        
    return result, {
        "feature_hydration_ms": (t1 - t0) * 1000,
        "preprocessing_ms": (t2 - t1) * 1000,
        "model_forward_ms": (t3 - t2) * 1000,
        "postprocessing_ms": (t4 - t3) * 1000,
    }

This gives you a breakdown per request. Aggregate by percentile (P50, P95, P99) per stage. You will find surprises.

Hardware is the last resort, not the first

The common mistake is throwing hardware at a software problem:

When compute genuinely is the bottleneck:

The P99 target question

What’s the right latency budget? The correct answer is product-dependent, not benchmarked against the industry.

Useful questions:

For synchronous user-facing features, 200ms p99 is a reasonable ceiling before UX research starts showing measurable drop-off. For background jobs, latency budgets can be orders of magnitude looser.

Set the target from product requirements, not from benchmarks. Then instrument, find the bottleneck, and fix it in the right layer. That sequence applies to monitoring platforms too — ML Monitoring Report covers the tooling for tracking latency alongside quality metrics in the same dashboard.

Deployment infrastructure choices

Ray Serve: Good for Python-native ML stacks. Handles dynamic batching, has a coherent deployment config, and integrates with the Ray ecosystem. Reasonable operational overhead.

Triton Inference Server: Best-in-class GPU utilization for GPU-resident models. Handles model ensembles (preprocessing model + main model + postprocessing model) as a single pipeline. Overkill for CPU or small-scale deployments.

TorchServe: Natural fit if your models are PyTorch-native and your team knows PyTorch. The handler pattern is predictable; the operational overhead is moderate.

Serverless (AWS Lambda, Cloud Run): Correct for sparse traffic patterns where cold-start is acceptable. Wrong for latency-sensitive production workloads with predictable traffic.

The right choice depends on traffic shape, model size, and team familiarity. There is no universally correct answer.

Sources

  1. Ray Serve Documentation
  2. Triton Inference Server
  3. vLLM Documentation
#inference #latency #mlops #production-ml #performance #serving
Subscribe

MLOps Platforms — in your inbox

Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments