Inference Cost Optimization: Autoscaling, Batching, Spot
Inference cost is dominated by idle capacity and underused accelerators, not by the per-request price. Autoscaling on the right metric, dynamic batching
Most inference cost conversations start at the wrong number. The team looks at the per-request price or the per-hour instance rate, decides it’s reasonable, and moves on. Then the monthly bill arrives and it’s three times the back-of-envelope estimate. The gap is almost never the per-request price. It’s the capacity you paid for and didn’t use: accelerators sitting at 30% utilization, endpoints running at 3am with no traffic, instances over-provisioned because the autoscaler reacts to the wrong signal.
Inference cost optimization is the discipline of attacking that waste. The three highest-leverage levers — autoscaling on the right metric, dynamic batching, and disciplined spot usage — each target a different source of it. This is how to apply them without breaking the latency budget you also have to hit.
Where inference money actually goes
Three buckets dominate inference spend, and they’re rarely the ones teams instrument first:
Idle capacity. An endpoint provisioned for peak that runs at baseline most of the day is burning money on every idle instance-hour. For spiky or intermittent traffic, idle capacity can be the single largest line item.
Underused accelerators. A GPU processing one request at a time is using a fraction of what it costs. GPU economics reward keeping the device busy; a model serving small individual requests without batching leaves most of the silicon idle on every forward pass.
Mis-tuned autoscaling. An autoscaler that reacts late over-provisions to feel safe, and one that scales on the wrong metric (raw GPU utilization, when queue depth is what actually predicts load) either wastes capacity or fails to add it in time.
Fix these three and the per-request price stops mattering, because you’ve stopped paying for capacity you don’t use.
Lever 1: Autoscaling on the metric that predicts load
Autoscaling is the default cost lever, and the default configuration is usually wrong. The standard mistake is scaling on GPU or CPU utilization alone. Utilization is a lagging, noisy signal for inference: it tells you the device is busy now, not that a backlog is forming. Scaling on utilization alone tends to over-provision to stay safe.
The better signals are demand-shaped: requests-per-second, queue depth, and batch size align capacity with the actual load arriving at the endpoint. A request-count or queue-depth trigger reacts to the work waiting, not the work in flight. Managed platforms support this to varying degrees — SageMaker can scale on invocation-based metrics rather than only resource utilization, which is closer to the right signal out of the box.
The most impactful autoscaling improvement is at the bottom of the curve, not the top: scale to zero. For intermittent workloads, an endpoint that drops to zero instances during inactivity removes idle cost entirely. SageMaker added scale-down-to-zero for inference endpoints (announced at re:Invent 2024), and serverless serving designs achieve the same effect by construction. The trade is cold-start latency on the first request after idle — acceptable for internal or asynchronous workloads, often not for synchronous user-facing ones. Match the pattern to the traffic.
A practical autoscaling checklist:
- Scale on a demand signal (requests, queue depth), not raw utilization alone.
- Set the minimum replica count from your actual baseline traffic, not from fear.
- Use scale-to-zero for intermittent workloads where cold start is tolerable.
- Pre-warm ahead of known traffic spikes rather than relying on reactive scaling to catch a burst — by the time the autoscaler reacts, the burst has already hit.
Lever 2: Dynamic batching for accelerator efficiency
Batching is the single most effective way to raise GPU utilization. A GPU running one small request at a time wastes most of its throughput; combining multiple requests into one forward pass amortizes the fixed overhead across them. Dynamic batching does this server-side: it combines one or more inference requests into a single batch to maximize throughput, with a configurable delay window so the scheduler can collect a few more requests before dispatching.
NVIDIA Triton Inference Server implements this directly — dynamic batching is configured per model in the model’s config, with parameters like preferred batch size and a max queue delay that bounds how long the batcher waits to aggregate requests. Triton can also run multiple instances of the same model concurrently on the same GPU, packing more work onto each device. Ray Serve and other serving layers offer comparable dynamic-batching mechanisms.
The trade-off is explicit and tunable: batching raises throughput at the cost of a small latency increase, because requests wait briefly to be grouped. The max-queue-delay parameter is the control knob. As a general rule, batching is the most beneficial single change for GPU utilization — but you set the delay against your latency budget, not against a benchmark. Industry reports commonly cite double-digit-percentage throughput-per-accelerator improvements from batching and mixed-precision inference; treat your own measured numbers as authoritative.
When batching is combined with autoscaling, the two interact: a higher effective batch size means each instance handles more load, which lets the autoscaler hold fewer instances. Tune them together, not in isolation.
Lever 3: Spot capacity, used with discipline
Spot instances (AWS) and Spot VMs (GCP) offer steep discounts — commonly cited in the range of 50–80% off on-demand — in exchange for the provider’s right to reclaim the capacity. The discipline is in knowing where they fit.
Where spot works: fault-tolerant and interruption-tolerant workloads. Batch inference, asynchronous inference, and offline scoring jobs that can checkpoint and resume are ideal — an interruption costs you a restart, not a failed user request. Vertex AI documents Spot VM support for inference, and the same logic applies across providers.
Where spot does not work: latency-sensitive real-time inference behind a synchronous user request. A reclaim event mid-traffic is an outage. Real-time endpoints generally should not run on spot capacity unless you’ve engineered a robust on-demand fallback that absorbs reclaims without dropping traffic — at which point you’re managing meaningful complexity for the discount.
The honest framing: spot is a large discount on the interruptible part of your inference workload and a liability on the synchronous part. Split your workload by interruption tolerance and apply spot only to the half that can absorb it.
Putting the levers together
These levers compound, and the order matters:
- Right-size first. Before optimizing, know your actual baseline and peak traffic. Most over-provisioning is a sizing problem, not an autoscaling problem.
- Batch to raise per-accelerator throughput. This reduces how many instances you need at any given load.
- Autoscale on a demand signal, including scale-to-zero where the traffic pattern allows it. This removes idle capacity.
- Apply spot to interruption-tolerant inference only. This discounts the part of the bill that can safely take it.
Done in that order, the levers reinforce each other: batching shrinks the instance count, autoscaling trims the idle hours, and spot discounts what remains of the interruptible workload. Real-world case studies consistently report substantial inference-cost reductions from combining automatic scaling, batching, and model-level optimizations like quantization — though the magnitude depends entirely on your starting waste.
The cost-latency tension is the whole game
Every lever here trades against latency. Batching adds queue delay. Scale-to-zero adds cold starts. Aggressive scale-down adds the risk of a burst arriving before capacity does. The reason cost optimization is hard is not that the techniques are obscure — it’s that each one spends from a latency budget you also have to defend. Get the latency budget breakdown right first ↗ so you know how much slack you have to trade.
The same discipline applies on the training side, where idle GPUs and abandoned experiments dominate the bill — we covered that in training infrastructure cost control ↗. And none of this is safe to tune blind: you instrument utilization, queue depth, and per-stage latency, then optimize against the metrics ↗ rather than against intuition. Pair the cost work with continuous monitoring of latency alongside quality ↗ so a cost optimization that quietly degraded the model gets caught before your users find it.
Inference cost is a capacity-utilization problem wearing a per-request-price costume. Attack the utilization and the price takes care of itself.
Sources
MLOps Platforms — in your inbox
Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Training Infrastructure Cost Control: Where ML Spend Goes
Cloud training bills surprise teams that model costs at the benchmark level. Real training cost includes wasted compute, storage, egress, and idle GPUs.
Online Inference Latency: Where the Budget Actually Goes
P99 latency is a product problem as much as an engineering one. Breaking down the inference budget — model compute, preprocessing, retrieval
Model Serving Compared: SageMaker, Vertex AI, Databricks
All three managed platforms will serve a model behind an endpoint. The differences that matter show up in autoscaling behavior, multi-model density, and