LLM Inference TCO Calculator v2.4

Model & Scenario

Scenario Start here

Model Try options

GPU Type Compare H100/H200

Reset

VRAM per GPU (GB)

Throughput Multiplier × (prefill)

Throughput Multiplier × (decode)

GPUs per Replica (fits/precision)

Advanced Throughput (per-GPU)

Prefill TPS / GPU

Decode TPS / GPU

Scenario presets fill in the Decode TPS per GPU. Prefill TPS per GPU is set by you and is typically 5–20× higher than decode (due to better parallelism).

Stage Overlap (%)

How much prefill and decode run simultaneously.

Latency SLO (ms)

Workload Inputs

Concurrent Users (peak)

Requests per User per Minute

Examples: chat 0.1–0.5/min; knowledge search 0.02–0.1/min; load test/automation 1–2/min.

Average Concurrent Users (optional)

Duty Cycle % (avg ÷ peak)

Duty cycle is the fraction of time your system runs at peak traffic.
Sizing uses peak (with headroom). Unit costs use average (avg users if set, otherwise peak × duty cycle.)

Prompt Tokens / Request (prefill)

The input prompt: Less work for the model, so has less impact on the outcomes.

Generation Tokens / Request (decode)

The output: Likely between 100 and 250 depending on how "chatty" the model is configured.

Batching × (prefill)

Batching × (decode)

Prefill and decode batching are separate. Decode batching is clamped to 2.0 to respect typical latency SLOs.

Headroom

Overhead for surge capacity (and failover/SLO cushion/forecasting errors).

Tuning & Guardrails (Advanced)

TP penalty per extra GPU (prefill) %

TP penalty per extra GPU (decode) %

TP penalty cap %

Default prefill:decode TPS ratio ×

Overlap lower bound α (0–1)

Batch latency penalty k

Server idle IT power floor (fraction of peak)

Decode batching clamp (max)

All knobs are live: changing these updates calculations immediately.

Reliability, Topology & RAG (Optional)

Tail factor (p95) ×

Max context tokens

KV precision

Interconnect

PCIe topology penalty % (when GPUs/replica > 1)

HA: add N+1 replica

Warm capacity %

RAG

Retriever p95 (ms)

Vector DB queries / request

Vector DB $ / 1M queries

Embeddings tokens / request

Embeddings $ / 1k tokens

Turn on only what you use. Tail factor inflates decode demand. PCIe penalty applies only when GPUs/replica > 1. RAG adds latency and per-request cost.

Infrastructure & Cost

GPUs per Server

Server Cost (CAD)

IT Power / Server (kW) @ peak

PUE

Avg Utilization (%)

Electricity (CAD/kWh)

Support (% of CapEx / yr)

Staffing (CAD / yr)

Networking & Misc (CAD / yr)

Depreciation (years)

Results

—Total GPUs (fleet)

—Total Servers

—CapEx (CAD)

—Annual OpEx (CAD)

—Total Annual Cost (CAD)

Assumptions & Notes

Throughput is per-GPU. Scenario presets populate Decode TPS / GPU; Prefill TPS / GPU is user-set (typically 5–20× decode).
Demand: RPS = concurrent_users × requests/user/min ÷ 60. Token demand = RPS × (prompt + generation).
Batching is stage-specific. Separate controls for prefill and decode; decode batching is clamped to a tunable max (default ≤ 2.0×) to respect latency SLOs.
Overlap model: effective GPUs = overlap × max(prefill, decode) + (1−overlap) × (prefill + decode), then a lower bound α × (prefill + decode), then headroom.
Tensor-parallel penalty is stage-specific. Larger haircut on decode; penalties are capped to avoid unrealistic linear losses.
Topology & context effects: when GPUs/replica > 1 and interconnect is PCIe, a topology penalty is applied. KV precision (FP16/FP8/INT8/INT4) and context length add a decode-side penalty.
Tail factor (p95) applies to capacity only. tailFactor inflates decode demand for sizing; it is not applied to the p50 latency estimate.
Fleet sizing: GPUs are rounded up to replica multiples, then packed into full servers. Zero servers allowed when required GPUs = 0. Optional HA (N+1) and warm pool % increase required GPUs before packing.
Unit-economics denominators (requests & tokens) use average load and exclude utilization to avoid double-scaling demand.
Power & OpEx: IT_kW_avg = idleFrac × IT_kW_peak + (1−idleFrac) × IT_kW_peak × utilization. Annual energy = servers × IT_kW_avg × 8760 × PUE. OpEx adds support %, staffing, networking/misc, and (if enabled) RAG costs.
Latency SLO (p50): estimated as unbatched (pTok/prefillTPS + gTok/decodeTPS) adjusted by a batch-latency penalty; queueing is not modeled. If RAG is enabled, ragMs is added.
Memory fit (sanity only): crude VRAM check warns if the model may not fit given GPUs/replica × VRAM/GPU.
GPU type multipliers: H100/H200 prefill/decode multipliers are visible and overridable in the UI.

Biggest “unknowns” on performance

Stack & kernel optimization (TensorRT, CUDA, Triton, vLLM, paged-KV efficiency).
Precision & quantization (BF16/FP8/INT8/INT4): accuracy vs throughput vs memory fit.
Internal batching efficacy vs latency SLOs and user experience.
Speculative / assisted decoding and caching strategies.
Parallelism across multiple GPUs per replica: sharding topology, interconnect, KV traffic.
Request shape & variance (prompt vs output tails; long tails dominate compute).
Arrival patterns (burstiness, duty-cycle realism) vs steady-state assumptions.
KV-cache hit rate, context window size, and eviction policy impacts.
RAG retrieval variability (index quality, fan-out, cache hit rate, vector-DB latency).

Modelling limitations Important

No detailed KV-cache, context-window, or quantization memory accounting (single-GPU fits may be impossible despite warnings).
No queueing theory (batching vs SLO trade-off and burst absorption are hand-wavy).
H100/H200 throughput multipliers are heuristics (exposed and overridable; not hardware-vendor guarantees).
Flat networking & staffing (do not scale with fleet size, traffic, or compliance posture).
Straight-line depreciation; no financing costs (material impact on annualized CapEx not modeled).
Energy linearization: fan/PSU efficiency and PUE vs load are non-linear; seasons not modeled.
Missing/partial costs: storage (object/block), logging/observability, licenses, egress, spares/N+1 operations, change management, SLAs, incident response, indexing/ETL for RAG (only per-request retriever & embeddings are modeled).
No cost of capital / WACC and no price-elastic autoscaling dynamics (warm-up, cold starts, preemption).

Recent Benchmarks Illustrative

Performance varies widely by model, precision, batch, and inference stack; values are illustrative only.

Model (size)	Precision / Setup	H100 Config	Tokens/sec (what)	Source
Qwen3-32B	BF16 via vLLM `benchmark_serving`	1× H100 NVL (~95 GB)	653.61 output tok/s (total 1362.23 tok/s)	vLLM GitHub #17788
Qwen3-32B	FP8 via vLLM `benchmark_serving`	1× H100 NVL (~95 GB)	879.25 output tok/s (total 1829.67 tok/s)	vLLM GitHub #17788
DeepSeek-R1-Distill-Qwen-32B	FP16 via vLLM (online serving; 100-in / 600-out, 300 req)	1× H100 80 GB	1214.19 output tok/s (total 1481.62 tok/s)	DatabaseMart H100 vLLM

Use your own benchmarks for sizing; the calculator defaults are conservative and may be off by an order of magnitude depending on stack and SLO.