LLM Inference TCO Calculator v2.4

Model & Scenario

Reset
Advanced Throughput (per-GPU)

Scenario presets fill in the Decode TPS per GPU. Prefill TPS per GPU is set by you and is typically 5–20× higher than decode (due to better parallelism).

How much prefill and decode run simultaneously.

Workload Inputs

Examples: chat 0.1–0.5/min; knowledge search 0.02–0.1/min; load test/automation 1–2/min.

Duty cycle is the fraction of time your system runs at peak traffic.
Sizing uses peak (with headroom). Unit costs use average (avg users if set, otherwise peak × duty cycle.)

The input prompt: Less work for the model, so has less impact on the outcomes.

The output: Likely between 100 and 250 depending on how "chatty" the model is configured.

Prefill and decode batching are separate. Decode batching is clamped to 2.0 to respect typical latency SLOs.

Overhead for surge capacity (and failover/SLO cushion/forecasting errors).

Tuning & Guardrails (Advanced)

All knobs are live: changing these updates calculations immediately.

Reliability, Topology & RAG (Optional)

Turn on only what you use. Tail factor inflates decode demand. PCIe penalty applies only when GPUs/replica > 1. RAG adds latency and per-request cost.

Infrastructure & Cost

Results

Total GPUs (fleet)
Total Servers
CapEx (CAD)
Annual OpEx (CAD)
Total Annual Cost (CAD)

Assumptions & Notes

Biggest “unknowns” on performance

Modelling limitations Important

Recent Benchmarks Illustrative

Performance varies widely by model, precision, batch, and inference stack; values are illustrative only.

Model (size) Precision / Setup H100 Config Tokens/sec (what) Source
Qwen3-32B BF16 via vLLM benchmark_serving 1× H100 NVL (~95 GB) 653.61 output tok/s (total 1362.23 tok/s) vLLM GitHub #17788
Qwen3-32B FP8 via vLLM benchmark_serving 1× H100 NVL (~95 GB) 879.25 output tok/s (total 1829.67 tok/s) vLLM GitHub #17788
DeepSeek-R1-Distill-Qwen-32B FP16 via vLLM (online serving; 100-in / 600-out, 300 req) 1× H100 80 GB 1214.19 output tok/s (total 1481.62 tok/s) DatabaseMart H100 vLLM

Use your own benchmarks for sizing; the calculator defaults are conservative and may be off by an order of magnitude depending on stack and SLO.