Not All GPUs Are Equal
The GPU market spans from $1,500 consumer cards to $40,000 data center accelerators. Marketing claims about TOPS and TFLOPS obscure what matters for your workload: how many tokens per second at what latency for which models at what cost. We benchmark GPUs against your actual inference requirements to recommend the option that maximizes performance per dollar, not the option with the most impressive spec sheet.
NVIDIA H100
80 GB HBM3, 3.35 TB/s bandwidth, FP8 Transformer Engine. The highest-throughput GPU for LLM inference. NVLink 4.0 at 900 GB/s. Best for 70B+ models at high concurrency. $25,000-35,000 per GPU.
NVIDIA A100
80 GB HBM2e, 2 TB/s bandwidth. Proven and widely deployed. Available at significant discounts on the secondary market. NVLink 3.0 at 600 GB/s. Best cost-per-token for medium concurrency workloads. $10,000-15,000 per GPU.
NVIDIA L40S
48 GB GDDR6, 864 GB/s bandwidth. PCIe form factor fits standard servers. No NVLink but strong single-GPU inference performance. Best for organizations adding AI to existing server infrastructure. $7,000-10,000 per GPU.
Consumer GPUs (RTX 4090/5090)
24 GB GDDR6X. Excellent for development, testing, and low-volume inference. No ECC memory, no enterprise support. Not recommended for production. $1,500-2,000 per GPU. 10x cheaper than data center options.
GPU Selection Process
Profile
Define workload requirements
Benchmark
Test candidates with real workload
Analyze
Cost-performance comparison
Procure
Vendor selection and ordering
Profile
Define workload requirements
Benchmark
Test candidates with real workload
Analyze
Cost-performance comparison
Procure
Vendor selection and ordering
GPU Selection Guide
Benchmarking Methodology
We do not rely on vendor benchmarks or generic leaderboards. We benchmark GPUs against your specific models, quantization levels, batch sizes, and latency requirements.
Tokens per second per dollar. The primary metric for GPU selection. We measure output token throughput at your target latency SLA and divide by the annualized cost of the GPU (including hosting costs). An A100 at $12,000 generating 50 tokens/second may deliver better value than an H100 at $30,000 generating 100 tokens/second if your concurrency requirements are modest.
Time-to-first-token. For interactive applications, the time between sending a prompt and receiving the first output token determines perceived responsiveness. H100 with FP8 significantly reduces prefill latency for long-context queries. If your p95 TTFT target is under 500ms, this metric heavily influences GPU selection.
Maximum concurrent users. The combination of GPU memory (for KV-cache), memory bandwidth (for token generation), and compute (for prefill) determines how many simultaneous conversations a single GPU can handle. We model this for your expected session length and typing patterns.
Procurement Strategy
Where and how you buy GPUs significantly affects cost and lead time.
New from OEM. Dell, Supermicro, HPE, and Lenovo sell complete GPU servers with warranty and support. Longest lead time (4-16 weeks) but full vendor backing. Best for production deployments where support contracts are required.
Secondary market. Used A100 servers are available at 40-60% of new pricing from brokers and cloud provider hardware liquidation. Shorter lead time (1-2 weeks). No manufacturer warranty but third-party maintenance contracts are available. Best for cost-sensitive deployments where A100 performance is sufficient.
Cloud reserved instances. Zero lead time, no capital expenditure, but 1-3 year commitment. Best for organizations that want to start immediately while evaluating on-prem procurement in parallel.
Who This Is For
GPU selection consulting is for organizations making their first GPU hardware purchase or evaluating an upgrade from A100 to H100 generation. The right GPU choice saves tens of thousands of dollars over the hardware lifetime. The wrong choice either wastes budget on unnecessary capability or creates a bottleneck that limits AI adoption.
Contact us at ben@oakenai.tech
