Online Inference Latency: Where the Budget Actually Goes

Most latency debugging conversations start in the wrong place. The team runs a profiler, sees that model forward-pass time is the dominant component, and concludes that hardware is the problem. Buy bigger GPUs. Problem solved.

This is sometimes correct and often wrong. The model compute is visible because it’s the thing teams measure. The overhead before and after the model — the preprocessing, feature hydration, postprocessing, serialization, network, and queueing — is often larger, and almost always more fixable.

Where the latency actually goes

A production inference pipeline has more stages than a benchmark. From request receipt to response delivery:

Ingress / load balancer (0.5–2ms for well-configured systems; can spike to 50ms+ under misconfigured timeouts)
Request parsing and validation (0.1–1ms for structured inputs; higher for large payloads)
Feature hydration (1–50ms+, depending on whether features are cached, in Redis, or require synchronous computation)
Model preprocessing (tokenization for NLP, image resize/normalize for vision, embedding lookup — 1–20ms)
Model forward pass (this is the one people measure)
Postprocessing (output parsing, threshold application, calibration — often neglected)
Downstream calls (if the inference result triggers another API, add its latency)
Response serialization and egress (rarely significant, unless you’re returning large embeddings or multi-modal outputs)

The typical patterns we’ve observed:

Compute-bound workloads: Large language models with long context windows, large transformer classifiers on CPU. Model compute is genuinely the bottleneck. Hardware upgrades or quantization help.
Feature-bound workloads: Models that require real-time feature hydration — joining user features from a Redis cluster, computing recency signals. The model is fast; the feature lookup is slow. Better cache design, denormalization, or async prefetch are the solutions.
Queue-bound workloads: Bursty traffic patterns where requests pile up faster than the server can drain them. The model is fast at steady state; the burst handling is broken. Autoscaling configuration and queue depth alerting are the fixes.
Network-bound workloads: Models deployed in the wrong region relative to the request source, or with uncompressed large payloads.

Instrumenting to find the real bottleneck

Effective latency debugging requires end-to-end tracing, not just model timing. The minimum instrumentation:

import time
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def run_inference(request):
    with tracer.start_as_current_span("inference_pipeline"):
        
        t0 = time.perf_counter()
        features = hydrate_features(request.user_id, request.context)
        with tracer.start_as_current_span("feature_hydration") as span:
            span.set_attribute("cache_hit", features.from_cache)
        t1 = time.perf_counter()
        
        with tracer.start_as_current_span("preprocessing"):
            model_input = preprocess(features)
        t2 = time.perf_counter()
        
        with tracer.start_as_current_span("model_forward"):
            raw_output = model(model_input)
        t3 = time.perf_counter()
        
        with tracer.start_as_current_span("postprocessing"):
            result = postprocess(raw_output)
        t4 = time.perf_counter()
        
    return result, {
        "feature_hydration_ms": (t1 - t0) * 1000,
        "preprocessing_ms": (t2 - t1) * 1000,
        "model_forward_ms": (t3 - t2) * 1000,
        "postprocessing_ms": (t4 - t3) * 1000,
    }

This gives you a breakdown per request. Aggregate by percentile (P50, P95, P99) per stage. You will find surprises.

Hardware is the last resort, not the first

The common mistake is throwing hardware at a software problem:

Feature cache miss rate is 40%: Buy more Redis memory. Wrong answer: the issue is likely a cold start problem or a key expiry policy. Fix the policy, not the hardware.
P99 is 300ms, P50 is 20ms: The tail latency distribution looks like a garbage collection pause or a lock contention issue. More GPU doesn’t help with GC pauses.
Throughput is fine, latency spikes on burst: Autoscaling is reactive. The burst hits before new instances are healthy. Solution: better scale-up triggers (custom metrics or scheduled pre-warming), not more baseline capacity.

When compute genuinely is the bottleneck:

Quantization (INT8, INT4): For many transformer models, 4-bit quantization reduces memory by 4x and increases throughput substantially with minimal accuracy loss. vLLM ↗ handles this for LLMs. Test accuracy on your eval set before deploying.
Batching: Dynamic batching groups multiple requests for a single forward pass. For GPU-resident models with small individual inputs, this can multiply throughput by 10x. Ray Serve and Triton both support configurable batch sizes.
Model distillation: A smaller model that meets your accuracy bar is always worth evaluating. Distillation produces a student model that approximates the teacher at a fraction of the size.

The P99 target question

What’s the right latency budget? The correct answer is product-dependent, not benchmarked against the industry.

Useful questions:

What’s the user-facing impact of a 200ms response vs. a 500ms response for this specific feature?
Is this synchronous (user is waiting) or asynchronous (background job)?
What does the cost curve look like? Is there a budget target per inference that constrains the hardware?

For synchronous user-facing features, 200ms p99 is a reasonable ceiling before UX research starts showing measurable drop-off. For background jobs, latency budgets can be orders of magnitude looser.

Set the target from product requirements, not from benchmarks. Then instrument, find the bottleneck, and fix it in the right layer. That sequence applies to monitoring platforms too — ML Monitoring Report covers the tooling ↗ for tracking latency alongside quality metrics in the same dashboard.

Deployment infrastructure choices

Ray Serve: Good for Python-native ML stacks. Handles dynamic batching, has a coherent deployment config, and integrates with the Ray ecosystem. Reasonable operational overhead.

Triton Inference Server: Best-in-class GPU utilization for GPU-resident models. Handles model ensembles (preprocessing model + main model + postprocessing model) as a single pipeline. Overkill for CPU or small-scale deployments.

TorchServe: Natural fit if your models are PyTorch-native and your team knows PyTorch. The handler pattern is predictable; the operational overhead is moderate.

Serverless (AWS Lambda, Cloud Run): Correct for sparse traffic patterns where cold-start is acceptable. Wrong for latency-sensitive production workloads with predictable traffic.

The right choice depends on traffic shape, model size, and team familiarity. There is no universally correct answer.

Online Inference Latency: Where the Budget Actually Goes

Where the latency actually goes

Instrumenting to find the real bottleneck

Hardware is the last resort, not the first

The P99 target question

Deployment infrastructure choices

Sources

MLOps Platforms — in your inbox

Related

Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works

Evaluation Pipeline Design: What CI Evals Miss and How to Cover It

Training Infrastructure Cost Control: Where ML Spend Actually Goes

Comments