Online Inference Latency: Where the Budget Actually Goes
P99 latency is a product problem as much as an engineering one. Breaking down the inference budget — model compute, preprocessing, retrieval, postprocessing — is the prerequisite for fixing it.
Most latency debugging conversations start in the wrong place. The team runs a profiler, sees that model forward-pass time is the dominant component, and concludes that hardware is the problem. Buy bigger GPUs. Problem solved.
This is sometimes correct and often wrong. The model compute is visible because it’s the thing teams measure. The overhead before and after the model — the preprocessing, feature hydration, postprocessing, serialization, network, and queueing — is often larger, and almost always more fixable.
Where the latency actually goes
A production inference pipeline has more stages than a benchmark. From request receipt to response delivery:
- Ingress / load balancer (0.5–2ms for well-configured systems; can spike to 50ms+ under misconfigured timeouts)
- Request parsing and validation (0.1–1ms for structured inputs; higher for large payloads)
- Feature hydration (1–50ms+, depending on whether features are cached, in Redis, or require synchronous computation)
- Model preprocessing (tokenization for NLP, image resize/normalize for vision, embedding lookup — 1–20ms)
- Model forward pass (this is the one people measure)
- Postprocessing (output parsing, threshold application, calibration — often neglected)
- Downstream calls (if the inference result triggers another API, add its latency)
- Response serialization and egress (rarely significant, unless you’re returning large embeddings or multi-modal outputs)
The typical patterns we’ve observed:
- Compute-bound workloads: Large language models with long context windows, large transformer classifiers on CPU. Model compute is genuinely the bottleneck. Hardware upgrades or quantization help.
- Feature-bound workloads: Models that require real-time feature hydration — joining user features from a Redis cluster, computing recency signals. The model is fast; the feature lookup is slow. Better cache design, denormalization, or async prefetch are the solutions.
- Queue-bound workloads: Bursty traffic patterns where requests pile up faster than the server can drain them. The model is fast at steady state; the burst handling is broken. Autoscaling configuration and queue depth alerting are the fixes.
- Network-bound workloads: Models deployed in the wrong region relative to the request source, or with uncompressed large payloads.
Instrumenting to find the real bottleneck
Effective latency debugging requires end-to-end tracing, not just model timing. The minimum instrumentation:
import time
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def run_inference(request):
with tracer.start_as_current_span("inference_pipeline"):
t0 = time.perf_counter()
features = hydrate_features(request.user_id, request.context)
with tracer.start_as_current_span("feature_hydration") as span:
span.set_attribute("cache_hit", features.from_cache)
t1 = time.perf_counter()
with tracer.start_as_current_span("preprocessing"):
model_input = preprocess(features)
t2 = time.perf_counter()
with tracer.start_as_current_span("model_forward"):
raw_output = model(model_input)
t3 = time.perf_counter()
with tracer.start_as_current_span("postprocessing"):
result = postprocess(raw_output)
t4 = time.perf_counter()
return result, {
"feature_hydration_ms": (t1 - t0) * 1000,
"preprocessing_ms": (t2 - t1) * 1000,
"model_forward_ms": (t3 - t2) * 1000,
"postprocessing_ms": (t4 - t3) * 1000,
}
This gives you a breakdown per request. Aggregate by percentile (P50, P95, P99) per stage. You will find surprises.
Hardware is the last resort, not the first
The common mistake is throwing hardware at a software problem:
- Feature cache miss rate is 40%: Buy more Redis memory. Wrong answer: the issue is likely a cold start problem or a key expiry policy. Fix the policy, not the hardware.
- P99 is 300ms, P50 is 20ms: The tail latency distribution looks like a garbage collection pause or a lock contention issue. More GPU doesn’t help with GC pauses.
- Throughput is fine, latency spikes on burst: Autoscaling is reactive. The burst hits before new instances are healthy. Solution: better scale-up triggers (custom metrics or scheduled pre-warming), not more baseline capacity.
When compute genuinely is the bottleneck:
- Quantization (INT8, INT4): For many transformer models, 4-bit quantization reduces memory by 4x and increases throughput substantially with minimal accuracy loss. vLLM ↗ handles this for LLMs. Test accuracy on your eval set before deploying.
- Batching: Dynamic batching groups multiple requests for a single forward pass. For GPU-resident models with small individual inputs, this can multiply throughput by 10x. Ray Serve and Triton both support configurable batch sizes.
- Model distillation: A smaller model that meets your accuracy bar is always worth evaluating. Distillation produces a student model that approximates the teacher at a fraction of the size.
The P99 target question
What’s the right latency budget? The correct answer is product-dependent, not benchmarked against the industry.
Useful questions:
- What’s the user-facing impact of a 200ms response vs. a 500ms response for this specific feature?
- Is this synchronous (user is waiting) or asynchronous (background job)?
- What does the cost curve look like? Is there a budget target per inference that constrains the hardware?
For synchronous user-facing features, 200ms p99 is a reasonable ceiling before UX research starts showing measurable drop-off. For background jobs, latency budgets can be orders of magnitude looser.
Set the target from product requirements, not from benchmarks. Then instrument, find the bottleneck, and fix it in the right layer. That sequence applies to monitoring platforms too — ML Monitoring Report covers the tooling ↗ for tracking latency alongside quality metrics in the same dashboard.
Deployment infrastructure choices
Ray Serve: Good for Python-native ML stacks. Handles dynamic batching, has a coherent deployment config, and integrates with the Ray ecosystem. Reasonable operational overhead.
Triton Inference Server: Best-in-class GPU utilization for GPU-resident models. Handles model ensembles (preprocessing model + main model + postprocessing model) as a single pipeline. Overkill for CPU or small-scale deployments.
TorchServe: Natural fit if your models are PyTorch-native and your team knows PyTorch. The handler pattern is predictable; the operational overhead is moderate.
Serverless (AWS Lambda, Cloud Run): Correct for sparse traffic patterns where cold-start is acceptable. Wrong for latency-sensitive production workloads with predictable traffic.
The right choice depends on traffic shape, model size, and team familiarity. There is no universally correct answer.
Sources
MLOps Platforms — in your inbox
Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works
Training data versioning sounds like an ML engineering nicety. In practice it's the prerequisite for reproducible models, auditable compliance, and debugging production failures.
Evaluation Pipeline Design: What CI Evals Miss and How to Cover It
CI evals catch regressions in code. They don't catch production drift, prompt sensitivity, or behavioral changes in upstream models. Building an eval system that covers both requires a different architecture.
Training Infrastructure Cost Control: Where ML Spend Actually Goes
Cloud training bills surprise teams that model costs at the benchmark level. Real training cost includes wasted compute, storage, egress, and idle GPUs. Here's how to audit and reduce it.