Dedicated GPU Servers for AI Inference in 2026: VRAM, Throughput, and Pricing Across 10 Providers

Liutauras Morkaitis 2026-06-02

Serving a quantized 8B parameter model to a production API takes under 8 GB of VRAM. A single RTX A4000 handles it for EUR 360 a month. Serving a 70B model with 128K context at production concurrency takes over 100 GB, and the KV cache alone can exceed the model weights at that context length. These are different hardware problems with different economics, and treating them as one GPU shopping decision leads to either overspending on infrastructure or underprovisioning for the workload.

Most GPU server guides evaluate hardware for training. They rank by NVLink bandwidth, InfiniBand fabric, and multi-node scaling, dimensions that rarely matter for inference. For inference, memory bandwidth determines tokens per second more than raw compute TFLOPs. A single GPU handles the majority of production workloads up to 70B parameters at INT4 quantization. The cost metric is tokens per dollar rather than GPU-hours per training run, and the pricing model (fixed monthly, per-GPU-hour, or per-token API) matters as much as the GPU itself.

This guide evaluates ten providers specifically for inference workloads, grouped by infrastructure tier from EUR 184/month bare-metal entry to $88/hour hyperscaler MI300X nodes. Hostline is one of the ten providers compared; methodology and disclosure are documented below. The framework sections cover why inference hardware is fundamentally different from training hardware, how VRAM sizing works when KV cache enters the equation, what quantization changes for GPU selection, and when dedicated hardware beats per-token APIs.

Compare GPU server costs for 2026. From budget-friendly RTX A5000 dedicated nodes to hyperscale B200 clusters, find the right balance for your AI workload.

How This Comparison Was Built

Providers were evaluated across six inference-specific dimensions. VRAM capacity and memory bandwidth, because inference throughput is memory-bandwidth-bound and a GPU with more bandwidth generates tokens faster at the same VRAM capacity. Quantization support (FP8, INT8, INT4), because quantized models reduce VRAM requirements by 2x to 4x and change which hardware can serve which model. Inference software stack compatibility (vLLM, TensorRT-LLM, Triton, llama.cpp, Neuron SDK), because framework support determines whether a GPU can actually run production serving efficiently. Networking and latency for API serving. Pricing model (fixed monthly, per-GPU-hour, per-token). Compliance certifications.

Providers are grouped by infrastructure tier: EU bare-metal servers, EU sovereign cloud, specialized AI clouds, and hyperscaler inference platforms. Within each tier, ordering follows VRAM capacity from highest to lowest. No provider is ranked #1 overall. Hostline is the publisher and appears at position 2. Where Hostline’s hardware falls short of competitors on a specific dimension, this is stated directly.

All specifications verified against vendor documentation and NVIDIA/AMD datasheets in May 2026. Pricing cross-referenced against third-party trackers (Vantage, GPUPerHour, Spheron, CheckThat.ai). Vendor-published benchmarks are flagged as such.

Why Inference Hardware Is Not Training Hardware

Three facts shape every recommendation in this guide, and understanding them prevents the most expensive mistake in GPU procurement for inference: buying training-grade hardware for an inference workload.

First, inference is memory-bandwidth-bound, not compute-bound. Every token generated during LLM inference requires reading the entire model’s weights and the accumulated KV cache from VRAM into the tensor cores. The speed of that read, measured in TB/s of memory bandwidth, determines tokens per second far more than the GPU’s peak TFLOPS. This is why the H200, with 4.8 TB/s HBM3e bandwidth, delivers up to 1.9x the Llama 2 70B inference throughput of the H100, which has 3.35 TB/s HBM3, despite identical compute architecture. The NVIDIA H200 product brief documents this directly.

Second, single-GPU inference handles most production workloads. A 70B model at INT4 quantization fits in 35 to 46 GB of VRAM, which means a single L40S (48 GB), H100 (80 GB), or GEX131 (96 GB) can serve it without sharding across multiple GPUs. Training the same model requires NVLink, NVSwitch, and often InfiniBand across multiple nodes. For inference, NVLink matters only when the model exceeds a single GPU’s VRAM, and InfiniBand matters only when you shard across nodes. Both are necessary for serving 405B-class models at production concurrency, but the majority of production inference deployments (models in the 7B to 70B range at INT4) run on a single GPU without either.

Third, KV cache grows linearly with context length and batch size, and at long contexts it dominates VRAM usage. Serving Llama 3.1 70B at 128K context generates a KV cache that can exceed the model weights themselves. This is why GPUs with 141 to 192 GB of VRAM (H200, MI300X) are disproportionately represented in long-context production serving, even when the model weights alone would fit in 80 GB.

The practical implication: a EUR 360/month RTX A4000 with 16 GB VRAM is genuinely production-grade for serving quantized 7B to 8B models, and a EUR 903/month dual RTX A5000 handles parallel inference replicas of those models cheaper than any cloud alternative at sustained utilization. The hardware that makes sense for inference is frequently different from, and often cheaper than, the hardware optimized for training.

The VRAM Equation for Inference: Weights Plus KV Cache Plus Overhead

The VRAM required for inference is the sum of three components: model weights, KV cache, and framework overhead. The weights are fixed for a given model and precision. The KV cache depends on context length, batch size, and model architecture. The framework overhead (vLLM typically adds 1 to 3 GB for PagedAttention structures) is roughly constant.

Model weights scale directly with parameter count and precision. A 7B parameter model occupies roughly 14 GB at FP16, 7 GB at INT8, and 4 to 5 GB at INT4 (AWQ or GPTQ). A 70B model occupies roughly 140 GB at FP16, 70 GB at INT8, and 35 to 46 GB at INT4. A 405B model exceeds 800 GB at FP16 and still requires roughly 200 GB at INT4, meaning it needs multi-GPU serving on any current hardware.

Model	FP16	INT8	INT4	Fits on (single GPU)
Llama 3.1 8B / Mistral 7B	~16 GB	~8 GB	~5 GB	A4000 (16 GB), A5000 (24 GB), GEX44 (20 GB)
Phi-4 14B / Qwen 3 14	~28 GB	~14 GB	~8 GB	A5000 (24 GB), L40S (48 GB) and prototyping
Qwen 3 32B / Gemma 2 27B	~64 GB	~32 GB	~18 GB	L40S (48 GB), GEX131 (96 GB)
Llama 3.3 70B	~140 GB	~70 GB	~35-46 GB	GEX131 (96 GB), H100 (80 GB), H200 (141 GB)
Llama 3.1 405B	~810 GB	~405 GB	~200 GB	Multi-GPU: 3x H100, 2x MI300X

The KV cache adds to these figures and grows with context length. At 4K context on an 8B model, the KV cache is negligible (under 1 GB). At 128K context on a 70B model, the KV cache can exceed 40 GB. The rule of thumb for production serving: leave 15 to 20 percent of available VRAM as headroom beyond the weights for KV cache and batched requests. A GPU that fits the model weights exactly will run out of memory under production concurrency.

The A5000’s 24 GB VRAM illustrates how this math plays out in practice. DatabaseMart’s published vLLM benchmark shows the A5000 sustaining 2,714 tokens/s on Qwen2.5-3B-Instruct and 3,935 tokens/s on DeepSeek-R1-Distill-Qwen-1.5B at high concurrency with PagedAttention and continuous batching. For models in the 3B to 8B range, the A5000 is not a compromise option. It is a production-capable inference GPU with documented throughput data.

Quantization: How INT4 and FP8 Change What Hardware You Need

Quantization compresses model weights from FP16 (2 bytes per parameter) to INT8 (1 byte) or INT4 (0.5 bytes), reducing VRAM requirements proportionally. AWQ, GPTQ, and GGUF are the dominant INT4 formats in 2026, supported by vLLM, TensorRT-LLM, llama.cpp, and Ollama. The quality impact on chat and instruction-following tasks is typically small enough that most production deployments use INT4 or INT8 unless the application requires maximum accuracy.

FP8 is a hardware-accelerated format available on Hopper (H100, H200), Blackwell (B200, RTX PRO 6000), and Ada Lovelace (L4, L40S, RTX 4000 SFF Ada) GPUs. It provides roughly 2x throughput versus FP16 with smaller quality degradation than INT4. Ampere-generation GPUs, including the RTX A4000 and RTX A5000 in Hostline’s lineup, do not have native FP8 tensor core support. They run INT4 and INT8 quantization through software frameworks (vLLM, llama.cpp) rather than hardware acceleration. This works, and the throughput is production-viable as the A5000 benchmarks above demonstrate, but the per-watt efficiency is lower than on GPUs with native FP8 support.

The practical impact: a 70B model that would require two H100s at FP16 (140 GB total) fits on a single 48 GB L40S at INT4 (35 to 46 GB) or a single 96 GB GEX131 at INT8 (70 GB). Quantization does not just save money on VRAM. It eliminates the need for multi-GPU inference entirely for most models up to 70B parameters, which removes NVLink and tensor parallelism from the equation.

Dedicated Hardware vs Cloud GPU-Hours vs Per-Token APIs

Three pricing models compete for inference workloads, and the right choice depends on utilization pattern more than GPU generation.

Fixed monthly dedicated hardware (Hostline at EUR 360 to 1,220/month, Hetzner at EUR 184 to 889/month) charges a flat rate regardless of utilization. At 24/7 operation, Hostline’s dual A5000 at EUR 903/month works out to EUR 1.24/hr. Per-GPU-hour cloud (Lambda at $1.48 to $6.08/GPU-hr, RunPod at $0.34 to $5.98/GPU-hr, Nebius at $0.74 to $3.50/GPU-hr) charges only when the GPU is running. Per-token APIs (Lambda Inference API at $0.02 to $0.90/Mtok, RunPod Serverless) charge only for tokens generated, with zero cost when idle.

The break-even math is concrete. Hostline’s dual A5000 at EUR 903/month equals roughly EUR 1.24/hr at 24/7 utilization. Lambda’s A100 at $1.48/GPU-hr equals roughly $1,080/month at 24/7. Above approximately 60 percent utilization, dedicated bare metal is cheaper. Below approximately 40 percent, per-token APIs win because idle time costs nothing. The middle ground (40 to 60 percent utilization) is where per-GPU-hour cloud pricing with per-second billing is most competitive.

A chatbot that handles 50x daily traffic variance wastes 98 percent of a dedicated GPU’s capacity during off-peak hours. RunPod Serverless or Lambda’s Inference API charges only for active requests. A 24/7 internal API serving 8B inference at constant load is the ideal dedicated hardware workload: fixed EUR 360/month for a workload that would cost $1,080+ per month on cloud.

Best Dedicated GPU Servers for AI Inference Providers

Provider	Category	Top Inference GPU	VRAM/GPU	Mem BW	FP8	Pricing	Egress
Hetzner GEX44	EU bare-metal entry	RTX 4000 SFF Ada	20 GB GDDR6	320 GB/s	Yes	EUR 184/mo	None
Hostline	EU bare-metal multi-GPU	RTX A5000	24 GB GDDR6	768 GB/s	No	EUR 360-1,220/mo	None
Hetzner GEX131	EU bare-metal high-VRAM	RTX PRO 6000 Blackwell	96 GB GDDR7	1.79 TB/s	Yes	EUR 889/mo	None
OVHcloud	EU sovereign cloud	L40S	48 GB GDDR6	864 GB/s	Yes	~EUR 1.06-2.58/hr	None
Lambda	GPU cloud + inference API	H100 / B200	80-192 GB	3.35-8 TB/s	Yes	$1.48-6.08/GPU-hr	None
RunPod	Serverless + GPU pods	H100 / B200	80-192 GB	3.35-8 TB/s	Yes	$0.34-5.98/GPU-hr	None
Nebius	EU-sovereign AI cloud	H200 / L40S	48-141 GB	0.86-4.8 TB/s	Yes	$0.74-3.50/GPU-hr	None
CoreWeave	Blackwell-scale inference	B200 / GB200	192-288 GB	8+ TB/s	Yes	$6.16-10.50/GPU-hr	None
AWS Inf2 + G6	Purpose-built inference	Inferentia2 / L4	32-384 GB	varies	Neuron SDK	$0.758-12.98/hr	Per-GB
Azure ND MI300X	High-VRAM AMD inference	MI300X	192 GB HBM3	5.3 TB/s	Via ROCm	~$88/hr (8-GPU)	Per-GB

EU Bare-Metal Entry: Hetzner GEX44

Infrastructure tier: Dedicated single-GPU bare metal, Falkenstein/Nuremberg, Germany.
Operator: Hetzner Online GmbH. ISO 27001, GDPR.

Hetzner’s GEX44 is the cheapest dedicated inference box in this comparison. It ships an NVIDIA RTX 4000 SFF Ada Generation with 20 GB GDDR6 ECC and native FP8 tensor core support, paired with an Intel Core i5-13500, 64 GB DDR4, and dual 1.92 TB NVMe Gen3 SSDs in software RAID 1. Pricing is EUR 184/month (rising to roughly EUR 212/month from April 2026) with 1 Gbps networking and unlimited traffic. At 20 GB VRAM with FP8, the GEX44 comfortably serves Llama 3.1 8B at FP16, Mistral 7B, Phi-4 14B at INT4, and Qwen 3 32B at aggressive INT4 quantization.

Strengths

Lowest monthly cost in this comparison at EUR 184/month; native FP8 tensor core support (Ada Lovelace); 20 GB GDDR6 ECC VRAM; NVMe Gen3 storage; ISO 27001 and GDPR; unlimited traffic on 1 Gbps.

Limitations

Single GPU only with no multi-GPU configurations; 20 GB VRAM caps out at roughly 14B FP16 or 32B INT4; DDR4 platform; 1 Gbps networking; memory bandwidth is 320 GB/s (the SFF form factor uses a 160-bit memory bus, less than half the 768 GB/s of the A5000’s 384-bit bus); setup takes 1 to 3 business days.

EU Bare-Metal Multi-GPU: Hostline

Infrastructure tier: Dedicated bare-metal GPU servers with published EUR monthly pricing.
Operator: Hostline (hostline.io), Vilnius, Lithuania (EU). Operating since 2011. GDPR.

Hostline provides three dedicated GPU server configurations from its Vilnius data center, each designed for sustained inference workloads where fixed monthly billing eliminates variable cloud costs. The entry plan pairs a single RTX A4000 (16 GB GDDR6 ECC, Ampere) with an Intel Xeon Gold 6130, 64 GB DDR4 ECC RAM, and dual 960 GB SATA SSDs for EUR 360/month. The mid-tier plan provides two RTX A5000 GPUs (24 GB GDDR6 ECC each, 768 GB/s memory bandwidth per GPU, 48 GB aggregate VRAM) with two Xeon Gold 6130 CPUs, 128 GB DDR4 ECC RAM, and dual 1.92 TB SATA SSDs for EUR 903/month. The top configuration adds a third RTX A5000 (72 GB aggregate VRAM) and scales system RAM to 256 GB DDR4 ECC for EUR 1,220/month. All plans include iDRAC 9 Enterprise for out-of-band remote management, full root access with no shared tenancy, and zero egress fees with no per-hour metering.

Hostline is the only provider in this comparison with published fixed monthly EUR pricing, ECC RAM on all GPU plans including the entry tier, and no variable usage costs or egress fees on dedicated bare-metal GPU servers. It is also the only provider in this comparison offering multi-GPU bare-metal configurations (dual and triple RTX A5000) for running parallel inference replicas or tensor-parallel serving of larger quantized models at fixed monthly cost.

For inference, the 16 GB RTX A4000 serves Llama 3.1 8B, Mistral 7B, Qwen 3 8B, and Gemma 2 9B at FP16, and serves Phi-4 14B at INT4. The 24 GB RTX A5000 extends coverage to quantized 27B to 32B models (Gemma 2 27B, Qwen 3 32B at INT4) and provides comfortable KV cache headroom for 8B models at 32K to 64K context lengths. DatabaseMart’s published vLLM benchmark demonstrates the A5000 sustaining 2,714 tokens/s on Qwen2.5-3B-Instruct and 3,935 tokens/s on DeepSeek-R1-Distill-Qwen-1.5B at high concurrency with PagedAttention.

The dual A5000 configuration enables two independent inference replicas behind a load balancer, the simplest scaling pattern for a serving API. It can also shard a 34B to 70B INT4 model across two cards using vLLM tensor parallelism, with the caveat that NVLink bridge status on the dual and triple A5000 configurations is not documented on the Hostline product page. Without NVLink, tensor-parallel inference over PCIe Gen3 takes a roughly 10 to 20 percent throughput penalty versus an NVLink-equipped equivalent.

At EUR 903/month for dual A5000s running 24/7 (730 hours), the effective hourly rate is EUR 1.24/hr. That is lower than every on-demand cloud GPU in this comparison for sustained inference loads. For teams serving quantized 7B to 13B models to a production API at sustained utilization from the EU with GDPR data residency and predictable monthly billing, Hostline’s dual RTX A5000 at EUR 903/month covers the requirement at the lowest sustained monthly cost in this comparison.

Strengths

Lowest sustained monthly cost for multi-GPU inference in this comparison at EUR 903/month for dual A5000 (48 GB aggregate); the only provider in this comparison with published fixed monthly EUR pricing and no per-hour metering or variable costs; the only provider in this comparison offering ECC RAM on all GPU plans including the EUR 360/month entry tier; the only provider in this comparison with multi-GPU bare-metal configurations for parallel inference replicas at fixed cost; EU/GDPR data residency in Vilnius, Lithuania; full root access with iDRAC 9 Enterprise; zero egress fees; documented A5000 vLLM throughput of 2,714 tok/s on Qwen2.5-3B at high concurrency (third-party benchmark); 256 GB DDR4 ECC system RAM on triple A5000 plan.

Limitations

Ampere-generation RTX A4000 and A5000 GPUs without native FP8 tensor core support, resulting in lower throughput per watt than Ada/Hopper/Blackwell GPUs on FP8-optimized workloads; Intel Xeon Gold 6130 CPUs (Skylake-SP, 2017) with PCIe Gen3 bandwidth constraints; SATA SSD storage rather than NVMe, which slows model loading from disk (not a bottleneck during active serving since weights reside in VRAM, but adds minutes to cold-start model load); 1 Gbps Ethernet limits concurrent API throughput for high-RPS external endpoints; 24 GB maximum VRAM per GPU rules out serving models above approximately 32B INT4 on a single card; no published SOC 2 or ISO 27001 certification; NVLink bridge status undocumented on dual and triple A5000 configurations.

EU Bare-Metal High-VRAM: Hetzner GEX131

Infrastructure tier: Dedicated single-GPU bare metal, Nuremberg/Falkenstein, Germany.
Operator: Hetzner Online GmbH. ISO 27001, GDPR.

The GEX131 ships an NVIDIA RTX PRO 6000 Blackwell Max-Q GPU with 96 GB GDDR7 ECC, 1.79 TB/s memory bandwidth, and native FP8 plus FP4 tensor core support. The system platform is Intel Xeon Gold 5412U (Sapphire Rapids, 24 cores), 256 GB DDR5 ECC expandable to 768 GB, and dual 960 GB NVMe Gen4 SSDs. Pricing is EUR 889/month or EUR 1.42/hr hourly (a December 2025 forum reference cites EUR 1,057.91/month; verify live pricing before purchasing). Optional 10 Gbps uplink available.

The 96 GB VRAM is the defining feature for inference. Llama 3.3 70B at INT4 (35 to 46 GB) fits with substantial KV cache headroom. Llama 3.3 70B at INT8 (roughly 70 GB) fits with moderate headroom. A 30B to 40B model at FP16 fits comfortably. No other single bare-metal GPU in this comparison offers this VRAM capacity. The Blackwell Max-Q variant runs roughly 5 to 14 percent slower than the 600W Workstation Edition per Puget Systems testing, but for steady inference loads the power efficiency trade-off is favorable.

Strengths

96 GB VRAM in a single bare-metal GPU, the highest in this comparison; NVMe Gen4 storage for fast model loading; Blackwell-generation FP8 and FP4 tensor core support; 1.79 TB/s memory bandwidth; ISO 27001 and GDPR; optional 10 Gbps uplink.

Limitations

Single GPU only; GDDR7 bandwidth (1.79 TB/s) is fast for workstation memory but below HBM3/HBM3e on data center GPUs (H100 at 3.35 TB/s, H200 at 4.8 TB/s); 1 Gbps default networking; no published SLA percentage; pricing discrepancy between press release and forum reports.

EU Sovereign Cloud: OVHcloud

Infrastructure tier: Public cloud GPU instances and bare-metal HGR-AI. EU data centers.
Operator: OVHcloud. ISO 27001, SOC, HDS (French health data hosting).

OVHcloud positions its L40S instances (48 GB GDDR6, 864 GB/s, native FP8) as the inference sweet spot for models up to roughly 20B at FP16 or larger with quantization. The L4 line (24 GB GDDR6) targets compact models up to 7B and multimedia inference workloads. H100 and H200 bare-metal options are available for larger workloads through OVHcloud’s HGR-AI series.

The compliance portfolio is what sets OVHcloud apart from the specialized AI clouds in this comparison. OVHcloud holds HDS certification for French health data hosting, a requirement for any organization processing patient records or clinical data under French law. Neither RunPod, Lambda, nor Nebius offers HDS. OVHcloud claims a 64 percent cost reduction versus hyperscalers on Llama 3.1 inference (vendor-published figure, not independently verified). The published SLA is 99.99 percent monthly on GPU instances, the highest stated SLA among the EU-based providers in this comparison. Data centers span France, Germany, Poland, and other EU locations with full GDPR coverage. ISO 27001 and SOC certified.

Strengths

EU sovereignty with GDPR and HDS compliance; L40S inference sweet spot with 48 GB and native FP8; 99.99 percent SLA; ISO 27001 and SOC certified.

Limitations

Blackwell-generation GPUs not available as of May 2026; lineup anchored on Ada Lovelace and Hopper; EU-centric presence limits options for global latency optimization; no published per-token inference API.

Specialized AI Cloud: Lambda

Infrastructure tier: On-demand GPU instances plus per-token serverless inference API.
Operator: Lambda, Inc. SOC 2 Type II.

Lambda runs two inference models in parallel. The GPU cloud offers A100 from $1.48/GPU-hr, H100 SXM from $2.99 to $3.99/GPU-hr, and B200 SXM6 from $4.99 to $6.99/GPU-hr, with per-minute billing, zero egress, and the pre-configured Lambda Stack (PyTorch, CUDA, vLLM). The separate Inference API charges per token: $0.02/Mtok for Llama 3.2-3B-Instruct, $0.20/Mtok for Llama 3.3 70B Instruct at FP8, and $0.90/Mtok for Llama 3.1 405B Instruct, with OpenAI-compatible endpoints and no rate limits. Lambda claims this is “the lowest-priced serverless AI inference available anywhere” (vendor claim per the Lambda blog and VentureBeat coverage).

The dual model is the differentiator for inference. Teams can prototype on rented GPUs, then shift production serving to the per-token API when utilization patterns favor it, or vice versa, without changing vendors.

Strengths

Dual pricing model (GPU-hours and per-token API) from a single vendor; most transparent published GPU pricing among specialized AI clouds; per-token rates among the lowest documented for open-weight models; zero egress; per-minute billing; SOC 2 Type II.

Limitations

H200 available only in cluster configurations with no on-demand hourly rate; no EU data center regions; per-token API supports only a curated set of open-weight models; capacity constraints on H100 and B200 reported during peak demand.

Specialized AI Cloud: RunPod

Infrastructure tier: Serverless inference, GPU Pods, Secure Cloud and Community Cloud.
Operator: RunPod, Inc. SOC 2 Type II (Secure Cloud, since October 2025).

RunPod’s inference positioning centers on serverless endpoints with scale-to-zero. The platform ships a production-ready vLLM worker image (runpod/worker-vllm:latest) with an OpenAI-compatible wrapper, sub-200ms cold starts via FlashBoot, and per-second billing. RunPod’s 2026 State of AI Report notes that “vLLM has become the de facto standard for LLM serving, powering 40% of all LLM endpoints on the platform” (vendor-published via PRNewswire, March 2026). GPU Pod rates run from $0.34/hr for RTX 4090 in Community Cloud to $2.69/hr for H100 SXM to $5.98/hr for B200 Secure Cloud. A $25 free credit and Startup Program with up to 1,000 free H100 hours lower the trial barrier.

The scale-to-zero capability is the key inference feature. A chatbot that handles 1,000 concurrent users at 9 AM and 20 at 3 AM pays only for active compute. On dedicated hardware, 90 percent of GPU capacity sits idle during the overnight trough.

Strengths

Per-second billing with no minimums; serverless inference with scale-to-zero and sub-200ms cold starts; native vLLM worker images for rapid deployment; among the lowest on-demand H100 and B200 rates among specialized AI clouds; $25 free credit and startup program; zero egress.

Limitations

Community Cloud has no platform SLA on uptime; InfiniBand not guaranteed across clusters; networking specifications less documented than CoreWeave or Lambda; default 5 concurrent serverless workers per endpoint requires higher account balance to lift.

EU-Sovereign AI Cloud: Nebius

Infrastructure tier: Cloud GPU instances with EU data centers. Helsinki, Finland headquarters.
Operator: Nebius B.V. NVIDIA Exemplar Status.

Nebius combines competitive GPU pricing with EU data sovereignty. For inference specifically, the L40S at $0.74 to $1.82/GPU-hr (committed to on-demand) provides 48 GB GDDR6 with native FP8, and the RTX PRO 6000 at $0.95 to $1.80/GPU-hr offers 96 GB GDDR7 with Blackwell-generation tensor cores. The H200 at $1.45 to $3.50/GPU-hr delivers 141 GB HBM3e at 4.8 TB/s for bandwidth-intensive long-context serving. Nebius’s case study with Brave Search documents production AI summaries for over 11 million queries daily on Nebius infrastructure. InfiniBand at 3,200 Gbps on HGX configurations. Managed Kubernetes and Slurm available.

Strengths

Competitive L40S pricing at $0.74/GPU-hr committed; EU sovereignty with Helsinki headquarters; RTX PRO 6000 at $0.95/GPU-hr committed (96 GB GDDR7 Blackwell); H200 for bandwidth-intensive inference; Brave Search production case study; NVIDIA Exemplar Status.

Limitations

GB200 and GB300 not yet GA as of May 2026; quota workflow documented as a friction point for new accounts; compliance narrower than AWS or Azure; smaller global footprint for latency-optimized serving outside EU.

Blackwell-Scale Inference: CoreWeave

Infrastructure tier: Managed AI cloud with inference-specific pricing tier.
Operator: CoreWeave, Inc. (Nasdaq: CRWV). SOC 2, ISO 27001, HIPAA, FedRAMP.

CoreWeave operates at the performance ceiling of this comparison. On May 11, 2026, CoreWeave announced it ranked #1 across 11 inference providers in Artificial Analysis’s independent Kimi K2.6 benchmark, delivering 205 tokens/s at $0.70/Mtok blended cost (vendor claim per BusinessWire). The GPU lineup spans H100 SXM at $6.16/GPU-hr, H200, B200 at $8.60/GPU-hr, and GB200/GB300 NVL72. An inference-specific single-GPU pricing tier is published at $6.16/hr for H100 and $8.60/hr for B200. Networking runs 400 Gb/s NDR InfiniBand on Hopper and 800 Gb/s Quantum-X800 on Blackwell. Zero egress fees. Reserved discounts up to 60 percent.

For inference at scale, CoreWeave’s B200 with 192 GB HBM3e and 8+ TB/s bandwidth is the single strongest GPU in this comparison for serving 70B to 405B models at production throughput.

Strengths

#1 inference speed in Artificial Analysis independent benchmark (vendor claim); widest Blackwell GPU selection among specialized AI clouds; 800 Gb/s InfiniBand on Blackwell; inference-specific pricing tier; zero egress; broadest compliance among specialized AI clouds (SOC 2, ISO 27001, HIPAA, FedRAMP); reserved discounts to 60 percent.

Limitations

On-demand B200 at $8.60/GPU-hr is among the highest specialized cloud rates; minimum commitments on reserved capacity; Kubernetes-native platform requires container expertise; no spot or interruptible tier; economics favor cluster-scale customers over single-endpoint inference.

Purpose-Built Inference Accelerator: AWS Inf2 and G6

Infrastructure tier: EC2 instances and SageMaker endpoints.
Operator: Amazon Web Services. SOC 2, ISO 27001, HIPAA, FedRAMP, ITAR.

AWS is the only provider in this comparison with a purpose-built inference accelerator. Inferentia2 is positioned by AWS as “purpose built for deep learning inference” with the “highest performance at the lowest cost in Amazon EC2 for generative AI models.” The inf2.xlarge (1 Inferentia2 chip, 32 GB accelerator memory) runs at $0.758/hr, compared to $1.006/hr for the comparable g5.xlarge (A10G). Independent analyses document 25 to 40 percent lower cost per inference on supported transformer models. The inf2.48xlarge (12 chips, 384 GB via NeuronLink) supports models up to 175B parameters at $12.981/hr.

The catch is the Neuron SDK. Model compilation takes 10 to 45 minutes, standard transformer operations work, but custom CUDA kernels do not. vLLM has a Neuron backend, and Hugging Face Optimum-Neuron supports Llama, Mistral, and Gemma families. For teams already in the AWS ecosystem running supported models at high volume, the 25 to 40 percent cost savings on inference are substantial. For teams with custom architectures or CUDA-specific code, NVIDIA GPUs on the G6 line (L4, L40S) provide the standard inference path within AWS.

Strengths

Purpose-built inference chip with 25-40 percent cost-per-inference savings on supported models (vendor-published); inf2.xlarge at $0.758/hr is the lowest per-instance hourly rate in this comparison for a dedicated accelerator; SageMaker integration; broadest compliance; 384 GB aggregate memory on inf2.48xlarge.

Limitations

Requires Neuron SDK with 10-45 minute model compilation; custom CUDA kernels not supported; framework lock-in; per-GB egress adds to data-intensive workloads; Inferentia cost advantage applies only to models that compile cleanly on Neuron.

High-VRAM AMD Inference: Azure ND MI300X

Infrastructure tier: ND-series HPC VMs.
Operator: Microsoft Azure. SOC 2, ISO 27001, HIPAA, FedRAMP.

Azure’s ND MI300X v5 provides 8x AMD Instinct MI300X GPUs with 192 GB HBM3 and 5.3 TB/s memory bandwidth per GPU, totaling 1,536 GB aggregate HBM3 in a single node. For serving 405B-class models, this eliminates cross-node sharding entirely. Third-party trackers report approximately $88.49/hr on-demand starting price, though pricing varies by region and should be verified in the Azure Pricing Calculator. Quantum-2 InfiniBand at 3,200 Gbps with GPUDirect RDMA.

ROCm support for inference has matured significantly. AMD’s ROCm blog documents that “ROCm Becomes a First-Class Platform in the vLLM Ecosystem” with vLLM v0.14.0. SGLang also supports MI300X natively. For standard PyTorch inference code, MI300X works without modification. At approximately $88/hr for the 8-GPU node (roughly $11/GPU-hr on-demand), the VRAM-per-dollar ratio (192 GB per GPU) is competitive with H100 pricing from specialized clouds while providing more than double the VRAM per GPU.

Strengths

192 GB HBM3 per GPU, the highest VRAM-per-GPU in this comparison; 5.3 TB/s memory bandwidth; first-class vLLM support via ROCm; 3,200 Gbps InfiniBand; full compliance (SOC 2, ISO 27001, HIPAA, FedRAMP); eliminates cross-node sharding for 405B-class models.

Limitations

MI300X requires ROCm with potential porting effort for CUDA-only codebases; on-demand pricing approximately $88/hr for the 8-GPU node; per-GB egress; pricing varies by region and is not published in a static table by Microsoft.rk lock-in; per-GB egress adds to data-intensive workloads; Inferentia cost advantage applies only to models that compile cleanly on Neuron.

Honorable Mentions

Together AI operates a per-token inference API with Llama 3.3 70B at $0.88/Mtok and dedicated H100 endpoints at $6.49/hr. Their FlashAttention-3 implementation achieves 840 TFLOPS BF16 on H100, 85 percent utilization (vendor claim). For teams that want managed open-weight serving without infrastructure, Together is the most polished managed inference API in the market alongside Lambda.

Vast.ai’s GPU marketplace offers the lowest floor pricing across all providers evaluated with H100 PCIe from $1.97/hr and RTX 4090 from $0.31/hr. Reliability varies by host, networking is undocumented, and there is no platform SLA. For fault-tolerant batch inference on cost-sensitive workloads, Vast.ai fills a niche that managed providers do not serve.

Oracle OCI provides bare-metal H100 and H200 inference at $10/GPU-hr with up to 61.4 TB local NVMe per node, zero hypervisor overhead, and MI300X at $6/GPU-hr. Vultr publishes 36-month L40S at $0.848/GPU-hr and a free Serverless Inference product. Cherry Servers offers single-tenant A100/A40/A10 bare metal from $0.30/hr with 100 TB/month included egress.

Common Mistakes When Choosing GPU Servers for Inference

The most expensive mistake is buying H100-class hardware for an 8B model. Llama 3.1 8B at INT4 takes under 8 GB of VRAM. Running that on a $6.16/hr H100 with 80 GB wastes over 90 percent of the available memory and over 90 percent of the budget. Hostline’s RTX A4000 at EUR 360/month or Hetzner’s GEX44 at EUR 184/month handles the same workload.

The second mistake is sizing VRAM for model weights alone and forgetting the KV cache. A 70B model at INT4 occupies roughly 40 GB, which looks like it fits on an L40S (48 GB). Under production concurrency with 16K+ context, the KV cache pushes total usage above 48 GB and the server runs out of memory. Size for weights plus KV cache plus 15 to 20 percent headroom.

The third mistake is choosing by GPU-hr price without checking memory bandwidth. An A5000 at 768 GB/s and a GEX44 at 320 GB/s run the same model at different throughput even when both have sufficient VRAM. For latency-sensitive serving, bandwidth per dollar matters as much as VRAM per dollar.

The fourth mistake is running a dedicated GPU 24/7 for a workload that needs four hours of inference per day. At 17 percent utilization, a $0.20/Mtok per-token API almost always beats a dedicated $360/month server. Measure your actual utilization before committing to hardware.

The fifth mistake is assuming Ampere GPUs “cannot do inference” because they lack FP8. The A5000 benchmarks in this guide demonstrate production-viable throughput. FP8 improves throughput per watt, but INT4/INT8 quantization through vLLM runs on any NVIDIA GPU including Ampere.

Use Case Routing

Teams running steady-state inference on models up to 13B parameters have the clearest hardware decision. If the workload runs more than 12 to 16 hours per day from the EU with GDPR requirements and predictable billing, Hostline’s dual RTX A5000 at EUR 903/month provides 48 GB aggregate VRAM, ECC RAM, iDRAC 9 remote management, and zero egress at the lowest sustained monthly cost in this comparison. Hetzner GEX44 at EUR 184/month is the entry point for development and low-concurrency production endpoints where 20 GB VRAM and FP8 support matter more than multi-GPU scaling. For models in the 30B to 70B range that need 96 GB on a single GPU, Hetzner GEX131 at EUR 889/month with Blackwell-generation FP8/FP4 is the strongest bare-metal option.

Variable and spiky inference traffic changes the math entirely. RunPod Serverless with the vLLM worker image provides scale-to-zero at per-second billing, which means a chatbot that goes from 1,000 concurrent users to 20 overnight pays proportionally. Lambda’s Inference API at $0.02 to $0.90/Mtok serves the same pattern without infrastructure management. OVHcloud L40S instances provide an EU-sovereign middle ground with 48 GB GDDR6 and HDS compliance for health data workloads that neither RunPod nor Lambda address.

Large-model serving at 70B+ parameters with long context lengths narrows the set to providers with high-bandwidth HBM memory. CoreWeave’s B200 at $8.60/GPU-hr with 8+ TB/s bandwidth delivers the strongest documented inference throughput per the Artificial Analysis benchmark. Nebius H200 at $1.45 to $3.50/GPU-hr provides 141 GB HBM3e at 4.8 TB/s with EU sovereignty. AWS Inf2 offers the lowest per-instance cost at $0.758/hr if the model compiles on Neuron SDK. Azure MI300X provides 192 GB HBM3 per GPU at roughly $11/GPU-hr on-demand for teams open to AMD hardware, making it the only option in this comparison that serves a 405B-class model without cross-node sharding on two GPUs.

Three Findings

Inference infrastructure is a different purchasing decision from training infrastructure, and treating them as one decision is the root cause of most GPU overspending in production AI deployments.

First, memory bandwidth determines inference throughput more than raw compute. The H200 delivers up to 1.9x H100 inference throughput on Llama 2 70B at identical compute because of 4.8 TB/s versus 3.35 TB/s bandwidth. For large-model serving, selecting by memory bandwidth is more predictive of production throughput than selecting by TFLOPS or price-per-GPU-hour.
Second, quantization has lowered the hardware floor so far that a EUR 360/month RTX A4000 is production-grade for 8B model serving, and a EUR 903/month dual A5000 handles parallel 8B inference replicas at a monthly cost lower than any cloud alternative at sustained utilization. The models that most production teams actually serve (7B to 13B parameter chat, instruction, and RAG workloads) do not require data center GPUs. They require VRAM, a working vLLM installation, and uptime.
Third, the right pricing model depends on the utilization curve, not the GPU model. Dedicated bare metal for steady loads above 60 percent utilization. Per-token APIs for spiky traffic below 40 percent. Cloud GPU-hours for the middle ground. Hostline leads in fixed monthly bare-metal pricing for the EU. Lambda and RunPod lead in per-token and serverless inference. CoreWeave and Azure MI300X lead in large-model serving throughput. The decision framework, not any single provider, is what the reader should take away.

FAQ

GPU Servers for AI Training

Which GPU server in this comparison can run Llama 3.1 8B inference for under EUR 400/month?

Hostline’s single RTX A4000 at EUR 360/month serves Llama 3.1 8B at FP16 (roughly 16 GB, fitting within the A4000’s 16 GB VRAM) or at INT4 (under 8 GB, leaving over 8 GB for KV cache and batched requests). The plan includes 64 GB DDR4 ECC RAM, iDRAC 9 Enterprise remote management, full root access, and zero egress fees. Hetzner GEX44 at EUR 184/month provides 20 GB VRAM with native FP8 support and NVMe storage, but does not include ECC RAM.

How much VRAM do I need to serve a 70B parameter model?

At INT4 quantization (AWQ or GPTQ), Llama 3.3 70B occupies roughly 35 to 46 GB for weights. Add KV cache at your target context length and batch size, plus 2 to 3 GB framework overhead. At 4K context with moderate batching, total usage is roughly 40 to 55 GB, fitting on a single L40S (48 GB with tight headroom), H100 (80 GB), GEX131 (96 GB), or H200 (141 GB). At 128K context, KV cache can exceed 40 GB, pushing total above 80 GB and requiring H100 or larger.

Is a dedicated GPU server cheaper than a per-token API for LLM inference?

The answer depends on model size as well as utilization. For a 7B model, the break-even between Hostline A4000 (EUR 360/month) and Lambda’s $0.02/Mtok API for Llama 3.2-3B is roughly 18 billion tokens per month, which is substantial production volume. For a 70B model, the break-even between Hostline dual A5000 (EUR 903/month) and Lambda’s $0.20/Mtok for Llama 3.3 70B is roughly 4.5 billion tokens per month, a threshold many production chatbots exceed. The non-obvious factor is model loading time: per-token APIs have zero cold start for the user, while a dedicated server on SATA SSD may take 2 to 5 minutes to load a 13B model from disk after a restart. For workloads that need instant failover, the API’s always-warm advantage can matter more than the per-token cost.

Can I run production LLM inference on an RTX A5000?

Yes. The A5000’s 24 GB GDDR6 and 768 GB/s memory bandwidth handle INT4/INT8 quantized models up to 8B at FP16 and up to 32B at INT4 with production-level latency. The throughput gap versus an FP8-capable GPU (Ada Lovelace or Hopper) is roughly 1.5 to 2x on the same model at the same batch size, because FP8 halves the bytes read per parameter. For applications where time-to-first-token under 500ms matters more than peak throughput, the A5000 is competitive. For high-concurrency batch serving where tokens-per-second-per-dollar is the primary metric, a Hetzner GEX44 (20 GB, native FP8, EUR 184/month) may deliver better throughput despite lower memory bandwidth, because FP8 acceleration reduces the bandwidth bottleneck.

What is the difference between Inferentia2 and NVIDIA GPUs for inference?

AWS Inferentia2 is a purpose-built inference accelerator using the Neuron SDK rather than CUDA. Model compilation takes 10 to 45 minutes and standard transformer architectures (Llama, Mistral, Gemma) compile cleanly. Custom CUDA kernels do not work on Inferentia. AWS documents 25 to 40 percent lower cost per inference versus comparable NVIDIA instances on supported models. For teams already in the AWS ecosystem running standard architectures at high volume, Inferentia is cost-competitive. For teams with custom CUDA code, NVIDIA GPUs remain the practical choice.

How do I measure inference throughput before committing to a GPU server?

Provision the target GPU configuration (RunPod offers $25 free credit, Lambda bills per minute) and load your actual model at your target quantization. Run a load test at your expected peak concurrency using a tool like locust or k6 against the vLLM OpenAI-compatible endpoint. Measure three numbers: tokens per second per user at target batch size, time-to-first-token at the 95th percentile, and sustained GPU memory utilization under load. If GPU utilization stays below 80 percent at peak, you have headroom. If it exceeds 95 percent, the next request spike will hit an out-of-memory error.

References

NVIDIA Corporation. H100 SXM5 datasheet: 80 GB HBM3, 3.35 TB/s memory bandwidth, 989 TFLOPS FP16. H200 product brief: 141 GB HBM3e, 4.8 TB/s, documented up to 1.9x inference throughput improvement over H100 on Llama 2 70B at identical compute architecture. B200 SXM6 datasheet: 192 GB HBM3e, 8 TB/s. RTX A5000 specifications: 24 GB GDDR6, 768 GB/s, Ampere architecture, 27.8 TFLOPS FP32. RTX PRO 6000 Blackwell Max-Q specifications: 96 GB GDDR7, 1.79 TB/s, 5th-generation Tensor Cores with FP4 and FP8 support. All datasheets accessed from nvidia.com/en-us/data-center/ and nvidia.com/en-us/design-visualization/, May 2026.
AMD. Instinct MI300X product documentation: 192 GB HBM3, 5.3 TB/s memory bandwidth, 1,307 TFLOPS FP16. ROCm blog post: “ROCm Becomes a First-Class Platform in the vLLM Ecosystem,” documenting first-class status with vLLM v0.14.0 and native SGLang support. Accessed from rocm.blogs.amd.com and amd.com/en/products/accelerators/instinct/mi300x.html, May 2026.
Hostline (hostline.io). GPU dedicated server pricing page: single RTX A4000 at EUR 360/month (16 GB GDDR6 ECC, 1x Xeon Gold 6130, 64 GB DDR4 ECC, 2x 960 GB SSD, 1 Gbps, iDRAC 9 Enterprise); dual RTX A5000 at EUR 903/month (2x 24 GB GDDR6 ECC, 2x Xeon Gold 6130, 128 GB DDR4 ECC, 2x 1.92 TB SSD, 1 Gbps); triple RTX A5000 at EUR 1,220/month (3x 24 GB, 256 GB DDR4 ECC). Vilnius data center. No per-hour metering, no egress fees, no commitment minimum. GPU VPS product referenced as quote-based with variable SKU availability. hostline.io/dedicated-servers/gpu-servers/
Hetzner Online GmbH. GEX44 product page: RTX 4000 SFF Ada Generation, 20 GB GDDR6 ECC, Intel Core i5-13500, 64 GB DDR4, 2x 1.92 TB NVMe Gen3 RAID 1, EUR 184/month, EUR 79 setup fee on monthly. GEX131 press release: RTX PRO 6000 Blackwell Max-Q, 96 GB GDDR7, Intel Xeon Gold 5412U, 256 GB DDR5 ECC, 2x 960 GB NVMe Gen4, EUR 889/month or EUR 1.42/hr. December 2025 Hetzner Community forum reference citing EUR 1,057.91/month pricing discrepancy. Optional 10 Gbps uplink. ISO 27001 certified. hetzner.com/dedicated-rootserver/matrix-gpu/
DatabaseMart. “A5000 vLLM Benchmark: Performance Testing for Hugging Face LLMs.” Published benchmark documenting RTX A5000 inference throughput under vLLM with PagedAttention and continuous batching: Qwen2.5-3B-Instruct at 2,714.88 tokens/s, DeepSeek-R1-Distill-Qwen-1.5B at 3,935.99 tokens/s at high concurrency. Test conditions include PagedAttention memory management and continuous batching at varying concurrent request levels. databasemart.com/blog/vllm-gpu-benchmark-a5000
Lambda, Inc. GPU cloud pricing page (lambda.ai/pricing): A100 80 GB at $1.48/GPU-hr, H100 SXM at $2.99-$3.99/GPU-hr, B200 SXM6 at $4.99-$6.99/GPU-hr. Per-minute billing, zero egress, pre-configured Lambda Stack (PyTorch, CUDA, vLLM). Inference API pricing: $0.02/Mtok (Llama 3.2-3B-Instruct), $0.20/Mtok (Llama 3.3 70B Instruct FP8), $0.90/Mtok (Llama 3.1 405B Instruct). Lambda blog: “Introducing the Lambda Inference API: Lowest-Cost Inference Anywhere.” VentureBeat coverage: “Lambda launches inference-as-a-service API claiming lowest costs in AI industry.” SOC 2 Type II.
RunPod, Inc. GPU Pod pricing: RTX 4090 from $0.34/hr Community Cloud, A100 80 GB from $1.89/hr Secure Cloud, H100 SXM from $2.69/hr, B200 from $4.99-$5.98/hr. Serverless inference with scale-to-zero, sub-200ms FlashBoot cold starts, native vLLM worker image (runpod/worker-vllm:latest). “2026 State of AI Report” (PRNewswire, March 12, 2026): “vLLM has become the de facto standard for LLM serving, powering 40% of all LLM endpoints on the platform.” $25 free credit, Startup Program with up to 1,000 H100 hours. SOC 2 Type II since October 2025. runpod.io/pricing
CoreWeave, Inc. (Nasdaq: CRWV). Cloud pricing page: HGX H100 at $49.24/hr ($6.16/GPU-hr), HGX H200 at $50.44/hr, HGX B200 at $68.80/hr ($8.60/GPU-hr). Inference-specific single-GPU pricing tier published. 400 Gb/s NDR InfiniBand on Hopper, 800 Gb/s Quantum-X800 on Blackwell. BusinessWire (May 11, 2026): “CoreWeave Achieves #1 Ranking for Inference Speed and Price-Performance for Moonshot AI’s Kimi K2.6 Model in Independent Benchmark,” reporting 205 tok/s at $0.70/Mtok blended cost. Reserved discounts up to 60 percent. Zero egress. SOC 2, ISO 27001, HIPAA, FedRAMP. coreweave.com/pricing
Nebius B.V. (Helsinki, Finland). GPU pricing page (nebius.com/prices): L40S from $0.74-$1.82/GPU-hr, RTX PRO 6000 from $0.95-$1.80/GPU-hr, HGX H100 from $1.25-$2.95/GPU-hr, HGX H200 from $1.45-$3.50/GPU-hr. 3,200 Gbps InfiniBand on HGX configurations. Brave Search case study: production AI summaries for 11+ million daily queries. NVIDIA Exemplar Status. Managed Kubernetes and Slurm.
OVHcloud. L40S Cloud GPU instance documentation: 48 GB GDDR6, 864 GB/s, native FP8. L4 Cloud GPU instance: 24 GB GDDR6 for compact inference. HDS (French health data hosting) certified. ISO 27001, SOC. 99.99 percent monthly SLA on GPU instances. Vendor-published 64 percent cost reduction versus hyperscalers on Llama 3.1 inference (not independently verified). ovhcloud.com/en/public-cloud/gpu/
Amazon Web Services. Inferentia2 product page and Inf2 instance pricing: inf2.xlarge (1 chip, 32 GB) at $0.758/hr, inf2.48xlarge (12 chips, 384 GB via NeuronLink) at $12.981/hr. Neuron SDK documentation: model compilation workflow, supported architectures (Llama, Mistral, Gemma via Hugging Face Optimum-Neuron). AWS-published 25-40 percent cost-per-inference savings versus comparable NVIDIA instances on supported transformer models. G6/G6e instances (L4/L40S) for CUDA-based inference. Per-GB egress pricing. SOC 2, ISO 27001, HIPAA, FedRAMP, ITAR.
Microsoft Azure. ND MI300X v5 documentation: 8x AMD Instinct MI300X, 192 GB HBM3 per GPU (1,536 GB aggregate), 5.3 TB/s per GPU, Quantum-2 InfiniBand at 3,200 Gbps. On-demand pricing approximately $88.49/hr starting per Sparecores and CloudPrice third-party trackers (pricing varies by region; verify via Azure Pricing Calculator). SOC 2, ISO 27001, HIPAA, FedRAMP.
Puget Systems. RTX PRO 6000 Blackwell Max-Q versus Workstation Edition performance comparison: Max-Q variant (300W) runs approximately 5-14 percent slower than the 600W Workstation Edition across tested workloads. Referenced for the GEX131’s inference throughput expectations.
Spheron Network. “GPU Cloud Pricing 2026: H100 from $1.03/hr, B200 from $2.12/hr (15+ providers).” Cross-reference pricing tracker for Lambda, RunPod, Nebius, CoreWeave, and Vast.ai rates. spheron.network/blog/gpu-cloud-pricing-comparison-2026/
Together AI. Serverless inference pricing: Llama 3.3 70B at $0.88/Mtok input and output. Dedicated H100 80 GB endpoints at $6.49/hr. FlashAttention-3 achieving 840 TFLOPS BF16 on H100, 85 percent utilization (vendor-published). together.ai/pricing
Vast.ai. GPU marketplace pricing: H100 PCIe from $1.97/hr, H100 NVL from $1.76/hr, RTX 4090 from $0.31/hr. Per-second billing. No platform SLA. Networking varies by host. vast.ai

Editorial Note

This article is published on hostline.io by Hostline. Hostline is one of the ten providers compared and is positioned for EU bare-metal inference workloads at fixed monthly cost, appearing at position 2 in the provider list. The same evaluation template applies to every provider. Providers are grouped by infrastructure tier. No provider is ranked #1 or described as “best.”

Where competitors outperform Hostline on specific dimensions: Hetzner GEX44 offers lower entry pricing (EUR 184 vs EUR 360) with native FP8 and NVMe storage. Hetzner GEX131 offers 4x the VRAM per GPU (96 GB vs 24 GB) with Blackwell-generation tensor cores and NVMe. Lambda offers both on-demand GPU rentals and per-token inference API from a single vendor. RunPod offers serverless scale-to-zero that eliminates idle GPU cost. Nebius offers EU-sovereign H200 at $3.50/GPU-hr with 4.8 TB/s HBM3e bandwidth. CoreWeave ranked #1 in an independent inference benchmark. AWS Inf2 offers a purpose-built inference chip with 25-40 percent cost savings on supported models. Azure MI300X offers 192 GB HBM3 per GPU, the most VRAM per GPU for serving 405B-class models without cross-node sharding.