GPU Servers for LLM Training in 2026: VRAM, Interconnect, and Pricing Across 13 Providers

Agneta Venckutė 2026-06-02

Training a 7B parameter model on a single GPU is a solved problem. A 24 GB card, a monthly server rental, and a LoRA configuration file get the job done for under EUR 400 a month. Training a 70B model is a different category of problem entirely. The model’s weights alone exceed 140 GB at BF16 precision before optimizer states, activations, and gradient buffers are added. No single GPU holds that. The workload splits across multiple GPUs, and the moment it splits, the interconnect between those GPUs becomes the performance bottleneck rather than the GPUs themselves. Scaling further to pretraining at frontier scale introduces a third constraint: the network fabric connecting multiple servers, where the gap between a budget bare-metal server’s 1 Gbps Ethernet and a hyperscaler’s 3,200 Gbps InfiniBand RDMA spans three orders of magnitude.

Most comparison guides treat this as a single shopping decision. They list GPU cloud providers, rank them by price, and place themselves at position one. The reality is that the market contains at least four distinct infrastructure tiers, each serving a different stage of the LLM development lifecycle, and a provider that excels at one tier is often irrelevant to another. A team LoRA-tuning a 7B model on a single GPU does not need InfiniBand. A team pretraining a 70B model across 64 GPUs does not benefit from knowing which bare-metal server has the lowest monthly price.

This guide is built around the hardware constraints that determine which tier a workload belongs to before any provider comparison becomes useful. The sections below cover VRAM sizing math, the interconnect hierarchy from PCIe through InfiniBand, and the checkpoint I/O bottleneck that most buyers overlook. The thirteen provider evaluations that follow are grouped by infrastructure tier rather than ranked in a single hierarchy.

Across the thirteen providers evaluated, the spread in pricing and capability is enormous. At the entry tier, Hostline offers a dedicated GPU server for EUR 360/month with 16 GB GDDR6 in Vilnius, Lithuania. At the top end, AWS P6 charges approximately $14.24/GPU-hr for an on-demand B200.

In the specialized AI cloud tier, pricing has compressed significantly through 2025 and 2026. Together AI’s reserved H100 at $1.76/GPU-hr is the lowest documented rate from a managed provider. Nebius offers EU-sovereign H100 and H200 with committed pricing from $2.30/GPU-hr. Lambda publishes real-time pricing with on-demand B200 from $4.99/GPU-hr. RunPod charges per second starting at $2.69/GPU-hr for H100 SXM. CoreWeave offers the widest Blackwell GPU selection, including GB200 and GB300 NVL72.

Among the hyperscalers, Azure stands out by offering the only MI300X option at roughly $6/GPU-hr with 192 GB HBM3 per GPU. Google Cloud’s A3 Ultra delivers the highest documented fabric bandwidth at 3,600 Gbps. Oracle OCI provides bare-metal GPU instances with no hypervisor and up to 61.4 TB of local NVMe. FluidStack builds enterprise-scale custom clusters and has been selected by Anthropic for dedicated builds. Vast.ai offers the lowest marketplace floor with H100 NVL from $1.76/GPU-hr. Hetzner GEX131 provides 96 GB of Blackwell-generation VRAM in a single bare-metal GPU for EUR 889/month.

Large language model training demands high-bandwidth GPUs and scalable infrastructure. Compare 13 GPU servers by VRAM, interconnects, storage, and overall training performance.

Methodology

GPU specifications are verified against NVIDIA datasheets (H100 SXM5, H200, B200 SXM6, GB200 NVL72, GB300 NVL72) and AMD MI300X product documentation. Provider pricing was captured from public pricing pages and cross-referenced against third-party trackers (Vantage, GPUPerHour, Spheron, CheckThat.ai) during May 2026. Where pricing varies by instance size, region, or commitment term, ranges are cited. Compliance certifications are verified from provider trust pages. Where a provider does not publicly disclose a specification (CPU model, storage type, interconnect topology), this is noted rather than inferred. Where performance claims originate from a provider’s own benchmarks, commercial interest is flagged explicitly.

Providers are grouped by infrastructure tier: EU bare-metal servers, specialized AI clouds, hyperscalers, enterprise custom builders, and GPU marketplaces. Within each tier, ordering follows VRAM capacity per GPU from highest to lowest. No provider is ranked #1 overall.

Why VRAM Determines What You Can Train

VRAM is the first constraint that filters the entire provider set, because it is binary: if the model does not fit, the GPU cannot run the workload regardless of clock speed, memory bandwidth, or price.

The arithmetic is straightforward but frequently misunderstood. Mixed-precision training with AdamW requires approximately 16 to 18 bytes per parameter in aggregate: 2 bytes for the BF16 model weights, 2 bytes for BF16 gradients, 4 bytes for an FP32 master copy of the weights, 4 bytes for FP32 momentum (Adam’s first moment), and 4 bytes for FP32 variance (Adam’s second moment). The ZeRO paper (Rajbhandari et al., SC 2020) documents this as K=12 for optimizer states alone, totaling 16 bytes per parameter. HuggingFace’s training documentation cites 18 bytes per parameter including activation overhead.

For a 7B parameter model, this translates to roughly 112 to 126 GB of VRAM for the full training working set before activations. Adding activation memory for a reasonable batch size pushes the total above 130 GB. That is why full fine-tuning a 7B model requires a single GPU with at least 80 GB (an A100 or H100) or multiple smaller GPUs with ZeRO-3 sharding the optimizer states across them. LoRA sidesteps this by freezing the base weights and training only small adapter matrices, which drops the working set to 10 to 16 GB and makes a 24 GB GPU viable.

A 70B model at BF16 occupies roughly 140 GB for weights alone. At 16 bytes per parameter, the full training working set (weights, gradients, optimizer states) reaches approximately 1.12 TB before activations. No single GPU or even a pair of GPUs holds this. A full fine-tune of a 70B model requires at minimum a complete 8-GPU HGX node (640 GB on H100 or 1,128 GB on H200) with ZeRO-3 sharding optimizer states, gradients, and parameters across all eight GPUs, plus activation checkpointing to trade compute for memory. NVIDIA’s own DGX Cloud benchmarking recipes for Llama 2 70B specify “at least 64 GPUs with at least 80 GB memory each.” In practice, a single 8-GPU H200 node can fit a 70B full fine-tune with aggressive memory optimization (ZeRO-3 stage 3, activation checkpointing, gradient accumulation at small micro-batch sizes), but production training typically uses 8 to 64 GPUs depending on target throughput. QLoRA at 4-bit quantization reduces the trainable working set to 30 to 50 GB, fitting on a single 80 GB or 96 GB GPU, but the resulting adapter is substantially smaller in capacity than a full fine-tune.

Pretraining a frontier model from scratch (100B+ parameters) requires distributed memory pools measured in terabytes. An 8-GPU H100 node provides 640 GB total VRAM. With ZeRO-3 sharding optimizer states, gradients, and parameters across all eight GPUs, this is sufficient for full fine-tuning of models up to approximately 70B parameters on a single node, though activation memory and batch size constraints may require activation checkpointing. Pretraining at that scale or above requires multiple nodes, and the network between them becomes the bottleneck, which is where the interconnect hierarchy takes over as the binding constraint.

The sizing table below maps common workloads to minimum VRAM requirements. These figures assume standard training configurations without aggressive memory optimization techniques like activation checkpointing, which trade compute for memory and can reduce VRAM requirements by 30 to 50 percent at a throughput cost.

Workload	Min VRAM/GPU	Min GPUs	Key Constraint	Example Providers
7B LoRA fine-tune	16-24 GB	1	VRAM capacity	Hostline A4000/A5000, Hetzner GEX44
7B-13B full fine-tune	48-80 GB	1-2	VRAM + optimizer states (16 bytes/param)	Hetzner GEX131 (96 GB), RunPod A100 80 GB
30B-70B full fine-tune	80+ GB/GPU	8-64	GNVLink + ZeRO-3 required	Lambda H100 SXM, RunPod H200, CoreWeave B200
70B+ pretraining	80+ GB/GPU	8-64+	HInfiniBand required	CoreWeave, Together AI, Nebius, Lambda clusters
Frontier pretraining (200B+)	141+ GB/GPU	256+	3,200+ Gbps fabric	AWS, Azure, GCP, Oracle, CoreWeave

The Interconnect Hierarchy: PCIe, NVLink, and InfiniBand

The GPU server market contains three interconnect tiers, each separated by roughly an order of magnitude in bandwidth, and each enabling a different class of training workload.

PCIe Gen3 provides roughly 16 GB/s per direction (32 GB/s bidirectional) between CPU and GPU. This is what Hostline’s Xeon Gold 6130 servers provide with their RTX A5000 GPUs. At this bandwidth, a single GPU operates at full efficiency for workloads that fit in its VRAM. Two or three GPUs sharing a PCIe bus can run data-parallel training with periodic gradient synchronization, but the synchronization overhead limits scaling efficiency to roughly 70 to 85 percent on communication-heavy workloads like transformer training. Tensor parallelism, which splits individual layers across GPUs and requires constant inter-GPU communication during every forward and backward pass, is not viable over PCIe because the bandwidth is 28x to 56x below what NVLink provides (900 GB/s on Hopper, 1.8 TB/s on Blackwell).

NVLink 4 (Hopper) provides 900 GB/s per GPU bidirectional through an NVSwitch fabric on HGX H100 and H200 nodes. NVLink 5 (Blackwell) doubles this to 1.8 TB/s on HGX B200 nodes. At these bandwidths, tensor parallelism works efficiently within a single node, and an 8-GPU node behaves almost like a single large GPU for training purposes. This is why the distinction between H100 PCIe ($1.97/hr on Vast.ai, no NVLink) and H100 SXM ($2.69/hr on RunPod, with NVSwitch) matters far more than the price difference suggests. Multi-GPU training throughput can differ by 2 to 3x between PCIe and NVLink configurations running the same model at the same batch size.

InfiniBand and RDMA fabrics (400 to 3,600 Gbps depending on generation and provider) connect nodes to each other. Once a training job exceeds the VRAM of a single 8-GPU node, the inter-node fabric becomes the bottleneck. The spread is extreme: Hostline and Hetzner operate at 1 Gbps Ethernet (adequate for single-node workloads but unusable for multi-node training), while CoreWeave offers 800 Gb/s Quantum-X800 InfiniBand on Blackwell nodes, Lambda and Nebius provide 3,200 Gbps Quantum-2 InfiniBand, and Google Cloud’s A3 Ultra delivers 3,600 Gbps GPUDirect RDMA. A team that outgrows its single 8-GPU node cannot solve the problem by buying a second bare-metal server and connecting them over Ethernet. The fabric tier must change, which usually means migrating from a bare-metal provider to a specialized AI cloud or hyperscaler.

Precision Formats and What They Change for Training Cost

The practical impact of FP8 and FP4 precision formats is often described abstractly. The concrete effect is measurable. Together AI’s vendor-published benchmark (not independently replicated) documents 15,264 tokens/s/GPU on HGX B200 versus 8,080 tokens/s/GPU on HGX H100 for Llama 70B BF16 training with TKC and TorchTitan. That is a 90 percent throughput improvement, which means a training run that takes 1,000 H100 GPU-hours could complete in roughly 530 B200 GPU-hours. Even though B200 costs 1.5 to 2x more per GPU-hour than H100 across comparable providers, the net cost per completed training run can be lower on B200 than on H100.

Ampere-generation GPUs (including the RTX A4000 and RTX A5000 in Hostline’s lineup) do not support FP8 or FP4 training. They operate at FP32, TF32, and FP16/BF16. For LoRA fine-tuning workloads where total GPU-hours per run are measured in single digits rather than thousands, the precision format gap does not materially change the economics. For full training runs lasting days or weeks, it does.

Storage I/O: The Checkpoint Bottleneck Most Buyers Overlook

A 70B model checkpoint at full precision exceeds 140 GB. On SATA SSD with roughly 500 MB/s sequential write throughput, saving that checkpoint takes approximately five minutes. On NVMe Gen4 with 3+ GB/s sequential write, the same checkpoint saves in under a minute. Over a multi-day training run with hourly checkpointing, SATA storage adds hours of cumulative idle GPU time, with GPUs sitting at zero utilization while the checkpoint writes complete.

This matters concretely for two providers in this guide. Hostline’s current GPU server plans use SATA SSD storage, which is documented as a constraint for models above approximately 13B parameters where checkpoint sizes become large enough to create material idle time. Hetzner’s GEX131, by contrast, includes NVMe Gen4 (2x 960 GB) at a comparable price tier. For workloads in the 7B to 13B LoRA fine-tuning range, checkpoint sizes are small enough (typically 1 to 5 GB for adapter weights) that SATA I/O does not create a meaningful bottleneck. The distinction matters for teams scaling beyond that range.

Provider	Category	Top GPU	VRAM/GPU	Scale-Up	Scale-Out Fabric	Pricing	Egress
Hetzner GEX	EU bare-metal	RRTX PRO 6000	96 GB GDDR7	PCIe	1 Gbps Ethernet	EUR 889/mo	None
Hostline	EU bare-metal	RTX A5000	V24 GB GDDR6	PCIe Gen3	1 Gbps Ethernet	EUR 360-1,220/mo	None
CoreWeave	Blackwell AI cloud	GB300 NVL72	288 GB HBM3e	NVLink 5	1 800 Gbps IB	$6.16-$10.50/GPU-hr	None
Lambda Labs	Transparent AI cloud	B200 SXM6	192 GB HBM3e	NVLink 5	3,200 Gbps IB	$1.29-$6.99/GPU-hr	None
Nebius	EU-sovereign AI cloud	B200 HGX	192 GB HBM3e	NVLink 5	3,200 Gbps IB	$2.30-$5.50/GPU-hr	None
RunPod	Cost-efficient AI cloud	B200 SXM	192 GB HBM3e	NVLink 5	Varies	$2.69-$5.98/GPU-hr	None
Together AI	Optimized training	GB300 NVL72	288 GB HBM3e	NVLink 5	InfiniBand	$1.76-$7.49/GPU-hr	None
AWS EC2 P5/P6	Hyperscaler	B200 / GB200 NVL72	192-288 GB HBM3e	NVLink 5	3,200 Gbps EFA	$6.88-$14.24/GPU-hr	Per-GB
Azure ND-series	Hyperscaler (MI300X)	GB200 NVL72	192 GB	NVLink 5	3,200 Gbps IB	$6.00-$12.29/GPU-hr	Per-GB
Google Cloud A3/A4	Hyperscaler	B200 (A4)	192 GB HBM3e	NVLink 5	3,600 Gbps RDMA	~$10.60/GPU-hr	Per-GB
Oracle OCI	Hyperscaler bare-metal	GB200 NVL72	192 GB HBM3e	NVLink 5	RoCE v2 RDMA	$6-$10/GPU-hr	10 TB/mo free
FluidStack	Enterprise clusters	GB200 NVL72	192 GB HBM3e	NVLink 5	InfiniBand	Quote-based	None
Vast.ai	GPU marketplace	H100 NVL / B200	Up to 192 GB	Varies	Varies	From $1.76/GPU-hr	None

The Providers, Evaluated Against the Framework

EU Bare-Metal: Hetzner GEX

source: hetzner.com

Hetzner is a German hosting company that has been operating data centers since 1997. Its GPU server lineup entered the LLM-relevant conversation with the GEX131, which ships an NVIDIA RTX PRO 6000 Blackwell Max-Q GPU with 96 GB GDDR7 at EUR 889/month (or EUR 1.42/hr hourly) per Hetzner’s press release. A December 2025 forum reference cites a higher rate of EUR 1,057.91/month, so the live storefront price should be verified before purchasing. The GEX44, with an RTX 4000 SFF Ada (20 GB GDDR6), starts at EUR 184/month.

The GEX131’s 96 GB VRAM is notable in context. It is the most VRAM available in a single bare-metal GPU in this comparison, and it is enough to run QLoRA fine-tuning of a 70B model, which no other bare-metal single-GPU here can hold. Storage is 2x 960 GB NVMe Gen4 SSD, a concrete advantage over SATA for checkpoint-heavy workloads. ISO 27001 certified. EU/GDPR data residency in Germany.

The constraint is isolation. A single GPU means no multi-GPU parallelism for workloads that exceed 96 GB VRAM. The 1 Gbps Ethernet means no distributed training across multiple nodes. GDDR7 bandwidth at roughly 1.8 TB/s is fast for workstation memory but significantly slower than the 3.35 to 8 TB/s HBM3/HBM3e on data center GPUs like the H100 and B200, which means bandwidth-bound layers in large transformer models run slower per FLOP than on HBM-equipped hardware. For teams whose workload fits within 96 GB on a single GPU, Hetzner is the strongest bare-metal option in this comparison. For teams that will outgrow that ceiling, the migration path leads to cloud.

Strengths

96 GB VRAM in a single bare-metal GPU at EUR 889/month; NVMe Gen4 storage; Blackwell-generation FP4 and FP8 support; EU/GDPR data residency in Germany; ISO 27001 certified.

Limitations

Single GPU only with no multi-GPU configurations; 1 Gbps Ethernet; GDDR7 bandwidth (roughly 1.8 TB/s) significantly lower than HBM3e; no NVLink; no published SLA percentage; pricing discrepancy between press release and forum reports requires verification.

EU Bare-Metal: Hostline

source: hostline.io

Hostline is a hosting infrastructure provider operating from Vilnius, Lithuania since 2011, serving the EU market with dedicated bare-metal servers, VPS, and colocation. Its GPU server lineup targets the entry tier of the LLM training market: teams running LoRA fine-tuning and adapter training on models that fit within 24 GB VRAM per GPU, with a preference for fixed monthly billing in EUR and no exposure to variable cloud costs.

Hostline provides three dedicated GPU server configurations from its Vilnius data center, all with full root access, iDRAC 9 Enterprise remote management, DDR4 ECC RAM, and zero egress fees.

The entry plan pairs a single RTX A4000 (16 GB GDDR6) with an Intel Xeon Gold 6130, 64 GB ECC RAM, and dual 960 GB SATA SSDs for EUR 360/month. The mid-tier plan doubles up with two RTX A5000 GPUs (24 GB GDDR6 each), two Xeon Gold 6130 CPUs, 128 GB ECC RAM, and dual 1.92 TB SATA SSDs for EUR 903/month. The top configuration adds a third RTX A5000 and scales system RAM to 256 GB ECC for EUR 1,220/month. All plans connect over 1 Gbps Ethernet with no per-hour metering, no commitment minimums, and DDoS protection included.

At EUR 1,220/month for three RTX A5000 GPUs running 24/7, the effective hourly rate works out to EUR 1.67/hr. That is lower than the cheapest on-demand cloud H100 in this comparison ($2.69/GPU-hr on RunPod), despite the Ampere-generation hardware. The math flips at around 60 percent utilization: above that threshold, fixed monthly pricing beats hourly cloud rates. Below it, per-second cloud billing avoids paying for idle GPUs. The 256 GB DDR4 ECC system RAM on the triple A5000 plan leaves room for dataset preprocessing alongside GPU training without memory pressure on the host.

The tradeoffs are concrete and documented. Ampere-generation RTX A5000 GPUs lack FP8 and FP4 precision formats, which means training throughput on Hostline hardware cannot benefit from the mixed-precision acceleration available on Hopper and Blackwell GPUs. The Intel Xeon Gold 6130 CPUs are Skylake-SP vintage (2017) providing PCIe Gen3, which constrains CPU-to-GPU data transfer bandwidth compared to Gen4 or Gen5 systems. Storage is SATA SSD rather than NVMe Gen4, creating a checkpoint I/O ceiling that adds minutes per save on models above approximately 13B parameters as documented in the storage section above. The 1 Gbps Ethernet makes multi-node distributed training impractical. Maximum VRAM of 24 GB per GPU rules out full fine-tuning of models above roughly 13B parameters without aggressive quantization. No published SOC 2 or ISO 27001 certification is available (the data center is described as “Tier III certified” on hostline.io but no Uptime Institute certificate is publicly linked). NVLink bridge status on dual and triple A5000 configurations is not documented on the product page despite the RTX A5000 hardware supporting NVLink bridges per NVIDIA specifications.

For teams operating from the EU with GDPR data residency requirements, LoRA or adapter-based fine-tuning workloads that fit within 24 GB VRAM per GPU, and a preference for predictable monthly billing in EUR without variable cloud costs, Hostline’s triple RTX A5000 at EUR 1,220/month covers the requirement at the lowest sustained monthly cost in this comparison.

Strengths

Lowest entry price in this comparison at EUR 360/month for a dedicated GPU server with 16 GB VRAM; the only provider in this comparison with published fixed monthly EUR pricing and no per-hour metering or variable costs; the only provider in this comparison offering ECC RAM (DDR4 ECC) on all GPU plans including the entry tier; EU/GDPR data residency in Vilnius, Lithuania; dedicated bare metal with full root access, no shared tenancy, and iDRAC 9 Enterprise remote management on all plans; zero egress fees; 256 GB DDR4 ECC system RAM on the triple A5000 plan; DDoS protection included; effective hourly rate of EUR 1.67/hr at 24/7 utilization on the triple A5000.

Limitations

Ampere-generation RTX A5000 GPUs without FP8 or FP4 precision support; Intel Xeon Gold 6130 CPUs (Skylake-SP, 2017) with PCIe Gen3 bandwidth constraints; SATA SSD storage rather than NVMe Gen4, adding minutes per checkpoint save on models above 13B parameters; 1 Gbps Ethernet with no multi-node distributed training capability; 24 GB maximum VRAM per GPU, ruling out full fine-tuning above 13B parameters without quantization; no published SOC 2 or ISO 27001 certification; NVLink bridge status undocumented on dual and triple A5000 configurations; GPU VPS product has non-deterministic SKU availability.

Specialized AI Cloud: CoreWeave

source: coreweave.com

CoreWeave went public on the Nasdaq in 2025 and has built its business around being first to market with NVIDIA’s newest GPU architectures. The company announced deployment of the GB300 NVL72 platform on July 3, 2025, with cloud instances in select regions as of August 19, 2025 per its docs changelog. The GPU lineup spans HGX H100 ($49.24/hr for an 8-GPU node, roughly $6.16/GPU-hr), HGX H200 ($50.44/hr), HGX B200 ($68.80/hr), GB200 NVL72 at roughly $10.50/GPU-hr, and GB300 NVL72 instances. Networking runs at 400 Gb/s NDR InfiniBand on Hopper nodes and 800 Gb/s Quantum-X800 on Blackwell. Zero egress fees. Reserved pricing discounts reach up to 60 percent versus on-demand.

The platform is Kubernetes-native, which is an advantage for teams already running containerized training pipelines and a barrier for teams accustomed to SSH-based VM workflows. The compliance portfolio is the broadest among specialized AI clouds: SOC 2, ISO 27001, HIPAA, PCI, GDPR, FedRAMP, and CSA STAR Level 1.

Strengths

Widest Blackwell GPU selection among the providers evaluated including GB200 and GB300 NVL72; 800 Gb/s InfiniBand on Blackwell; zero egress; broadest compliance among specialized AI clouds; reserved discounts to 60 percent.

Limitations

On-demand single-B200 at $8.60/GPU-hr is among the highest in the specialized tier; minimum commitments on reserved capacity; Kubernetes-native platform requires container expertise; no spot or interruptible tier.

Specialized AI Cloud: Lambda Labs

source: lambda.ai

Lambda has carved out a positioning around pricing transparency. The pricing page at lambda.ai/pricing updates in real time and lists every GPU SKU with its on-demand rate. Third-party trackers in May 2026 cite H100 SXM at $2.99 to $3.99/GPU-hr depending on configuration, B200 SXM6 at $4.99 to $6.99/GPU-hr, A100 80 GB at $2.49, and A100 40 GB at $1.29. Lambda’s pricing updates in real time and these rates may have changed; verify at lambda.ai/pricing. 1-Click Clusters for HGX B200 start at $9.86/GPU-hr for 16 GPUs and drop to $8.87 at 256+ GPUs, with a two-week minimum and Lambda approval required before provisioning. Cluster networking uses Quantum-2 InfiniBand at 3,200 Gbps. Zero egress. Per-minute billing. Pre-configured Lambda Stack with PyTorch, CUDA, cuDNN.

The H200 is available in cluster configurations but does not have a published on-demand hourly rate as of May 2026, which is a gap for teams that want to evaluate H200 performance without a two-week cluster commitment.

Strengths

Most transparent published pricing among the providers evaluated; on-demand B200 without sales conversations; 3,200 Gbps InfiniBand on clusters; zero egress; per-minute billing; SOC 2 Type II.

Limitations

H200 cluster-only with no on-demand rate; no EU data center regions; 1-Click Clusters require two-week minimums and approval; capacity constraints reported during peak demand.

Specialized AI Cloud: Nebius

source: nebius.com

Nebius is headquartered in Helsinki, Finland and positions itself explicitly as an EU-sovereign alternative to US-headquartered AI clouds. For organizations subject to data residency requirements that prevent the use of US-based cloud providers, this positioning narrows the competitive set significantly. GPU pricing starts at HGX H100 for $2.95/GPU-hr on-demand with committed plans as low as $2.30/GPU-hr, HGX H200 at $3.50/GPU-hr, and HGX B200 at $5.50/GPU-hr. GB200 and GB300 NVL72 are in pre-order as of May 2026. InfiniBand at 3,200 Gbps.

Nebius holds NVIDIA Exemplar Status and has published MLPerf Inference v5.1 results on B200: 1,660 tokens/s offline and 1,280 tokens/s server for Llama 3.1 405B (vendor-published on the Nebius blog, representing Nebius’s own hardware in its own testing environment).

Strengths

Competitive H100 committed pricing at $2.30/GPU-hr; EU sovereignty and data residency; MLPerf-validated B200 performance (vendor-published); 3,200 Gbps InfiniBand; NVIDIA Exemplar Status.

Limitations

GB200 and GB300 not yet GA as of May 2026; smaller global footprint than hyperscalers; compliance narrower than AWS or Azure; shorter operational track record.

Specialized AI Cloud: RunPod

source: runpod.io

RunPod operates a two-tier model. Community Cloud is a shared environment where GPU availability and performance consistency vary by region and time. Secure Cloud is single-tenant, SOC 2 Type II certified since October 2025, with dedicated infrastructure. The pricing model charges per second with no minimum commitment, which is the lowest friction entry point in this comparison for teams that want to test a GPU configuration before committing to anything.

H100 SXM runs at $2.69/GPU-hr in Community Cloud. H200 SXM at $3.59. B200 SXM at $4.99 Community or $5.98 Secure Cloud. The platform offers a $25 free credit for new accounts and a Startup Program with up to 1,000 free H100 hours, which is enough to run a meaningful training evaluation before any spend. Zero egress fees.

Strengths

Per-second billing with no minimums; among the lowest on-demand H100 and B200 rates here; $25 free credit and startup program; Secure Cloud SOC 2 Type II; zero egress.

Limitations

Community Cloud has no platform SLA on uptime or performance consistency; InfiniBand availability not guaranteed; networking less documented than CoreWeave or Lambda.

Specialized AI Cloud: Together AI

source: together.ai

Together AI’s Chief Scientist is Tri Dao, the creator of FlashAttention, who joined in July 2023. The company’s differentiation is software rather than hardware: the Together Kernel Collection (TKC) optimizes training throughput at the kernel level, and the vendor-published benchmark shows 15,264 tokens/s/GPU on HGX B200 versus 8,080 on HGX H100 for Llama 70B BF16 training, a 90 percent improvement. This benchmark has not been independently replicated, but the directional finding aligns with expected Blackwell FP8 gains, and the magnitude is plausible given the architectural differences.

Pricing spans HGX H100 reserved at $1.76/GPU-hr (the lowest documented H100 rate from a managed provider in this comparison), HGX B200 on-demand at $5.50/GPU-hr and reserved from $4.00, and H200, GB200, and GB300 configurations. InfiniBand across all cluster tiers. The software optimization means teams willing to adopt TKC get documented throughput gains that reduce total GPU-hours per run, potentially offsetting a higher per-hour rate.

Strengths

Lowest H100 reserved pricing among all providers evaluated at $1.76/GPU-hr; FlashAttention-native training optimization through TKC; documented 90 percent B200 throughput gain (vendor benchmark); GB300 available; 25+ city presence.

Limitations

90 percent speedup is vendor-published on a specific model and framework; reserved pricing requires commitment; platform optimized for Together’s stack; SOC 2 only without HIPAA or FedRAMP.

Hyperscaler: AWS EC2 P5 and P6

source: aws.amazon.com

AWS operates the largest GPU fleet globally and offers the broadest instance family for LLM training. The P5 (8x H100 SXM) runs at approximately $55.04/hr on-demand per Vantage pricing data in May 2026. The P5en (8x H200) at approximately $63.30/hr. The P6 includes 8x B200 at approximately $113.93/hr. GB200 NVL72 UltraServer configurations are available. EFAv4 networking at 3,200 Gbps. UltraClusters scale to 20,000+ GPUs.

AWS announced pricing reductions of up to 45 percent in June 2025, but Network World reported H200 Capacity Block rates subsequently increased by roughly 15 percent in late 2025. Pricing should be verified at time of purchase. The compliance portfolio is the broadest in this comparison: SOC 2, ISO 27001, HIPAA, FedRAMP, and ITAR.

Strengths

Largest GPU fleet globally; UltraClusters at 20,000+ GPUs; 3,200 Gbps EFA; broadest compliance portfolio; mature ecosystem including SageMaker and ParallelCluster.

Limitations

Highest on-demand list prices across all providers evaluated; per-GB egress fees; pricing volatility documented in 2025 and 2026; minimum 8-GPU granularity on P5/P6.

Hyperscaler: Azure ND-Series

source: learn.microsoft.com

Azure’s ND MI300X v5 deserves specific attention. Starting from approximately $6/GPU-hr (the rate varies by region; verify in the Azure Pricing Calculator) with 192 GB HBM3 per GPU, the MI300X offers the best VRAM-per-dollar ratio in any major cloud. ROCm and PyTorch support have matured substantially through 2025 and 2026, and standard distributed training workflows run without modification. The limitation is in ecosystem breadth: custom CUDA kernels, specialized libraries like Apex, and some less common frameworks may require porting effort. The ND H100 v5 runs at roughly $12.29/GPU-hr, the ND H200 v5 and GB200 NVL72-based ND v6 are available. All ND v5 configurations connect through Quantum-2 InfiniBand at 3,200 Gbps.

Strengths

Only hyperscaler offering MI300X at $6/GPU-hr with 192 GB; 3,200 Gbps InfiniBand; full compliance; Azure ML integration; GB200 on ND v6.

Limitations

MI300X requires ROCm; H100 at $12.29/GPU-hr is the highest hyperscaler rate here; per-GB egress.

Hyperscaler: Google Cloud A3 and A4

source: cloud.google.com

Google Cloud’s strength in this comparison is network fabric. The A3 Ultra (8x H200) and A4 (8x B200) deliver 3,600 Gbps GPUDirect RDMA, which is the highest documented inter-node bandwidth among any provider in this comparison. For training jobs where gradient synchronization between nodes is the bottleneck rather than single-GPU compute, that bandwidth advantage translates directly into scaling efficiency. The A3 High (8x H100) is the more accessible entry point at 600 to 800 Gbps, and it is one of the few hyperscaler instances that offers sub-8-GPU granularity, making it usable for smaller multi-GPU workloads without renting an entire 8-GPU node. On-demand pricing runs at roughly $10.60/GPU-hr for A3 Ultra and A4, though on-demand availability varies by region and reservation or Spot may be required. Google’s TPU lineup provides a parallel path for teams building on JAX, though mixing GPU and TPU strategies adds platform complexity. Compliance includes SOC 2, ISO 27001, HIPAA, and FedRAMP.

Strengths

Highest fabric bandwidth in this guide at 3,600 Gbps; sub-8-GPU options on A3 High; TPU alternatives; full compliance.

Limitations

A3 Ultra and A4 require reservation or Spot; fewer GPU SKU options; per-GB egress.

Hyperscaler: Oracle OCI

source: oracle.com

Oracle OCI takes a different approach from the other hyperscalers in this comparison. Every GPU instance is bare-metal with no hypervisor layer, which eliminates the virtualization overhead that can cost 2 to 5 percent of GPU throughput on other platforms. H100 and H200 8-GPU nodes run at $10/GPU-hr ($80/hr per node), and the local NVMe storage scales up to 61.4 TB per node, which is an order of magnitude more local storage than any other provider here offers. For training workflows with large datasets that need to be loaded from local disk rather than streamed over the network, that storage capacity matters. Superclusters scale to 131,072 B200 GPUs or 65,536 H200s. MI300X is available at $6/GPU-hr, matching Azure’s pricing. OCI uses RoCE v2 RDMA networking rather than InfiniBand, and provides 10 TB/month free egress. SOC 2 and ISO 27001 certified.

Strengths

True bare-metal with no hypervisor; massive local NVMe at 61.4 TB; generous free egress; Supercluster scale; MI300X at $6/GPU-hr.

Limitations

RoCE v2 less widely adopted for ML training than InfiniBand; OCI Data Science less mature than SageMaker or Vertex AI; smaller ML community.

Enterprise Custom: FluidStack

source: fluidstack.io

FluidStack occupies a different position from every other provider in this comparison. Rather than selling GPU hours through a self-serve interface, the company builds custom single-tenant clusters for AI labs and enterprises that need dedicated infrastructure at scale. As an NVIDIA Cloud Partner with over 100,000 GPUs under management, FluidStack offers H100, H200, B200, and GB200 NVL72 configurations with single-tenant InfiniBand fabrics. The scale of the operation is substantial: TeraWulf’s August 2025 press release documents $6.7 billion in contracted revenue through two FluidStack hosting agreements. Google provides a $3.2 billion financial backstop in that arrangement, though this is a Google-TeraWulf relationship rather than a direct Google-FluidStack service contract. Anthropic has selected FluidStack for custom data center builds in New York and Texas. Pricing is entirely quote-based with no published rates. SOC 2 Type 2, ISO 27001, and GDPR certified.

Strengths

Enterprise-scale custom builds; single-tenant InfiniBand; NVIDIA Cloud Partner; SOC 2 Type 2 and ISO 27001; selected by Anthropic.

Limitations

No published pricing; sales-engagement only; longer provisioning timelines; no self-serve access.

GPU Marketplace: Vast.ai

source: vast.ai

Vast.ai aggregates GPU capacity from independent hosts into a peer-to-peer marketplace with bid-based pricing. H100 PCIe from $1.97/hr, H100 NVL (141 GB HBM3e) from $1.76/hr, RTX 4090 from $0.31/hr. B200 at variable rates. Per-second billing with no minimums. The marketplace model means pricing is lower than any managed provider, but reliability varies by host, networking specifications are often undocumented, and hosts may terminate instances with limited notice. There is no platform-level SLA.

Strengths

Lowest GPU-hr rates across all providers evaluated; per-second billing; no minimums; broad GPU selection including consumer cards.

Limitations

No platform SLA; networking varies by host; hosts may terminate instances; limited compliance; not suitable for production runs requiring guaranteed completion.

Common Mistakes When Choosing GPU Servers for LLM Training

The most common mistake is choosing by GPU-hr price alone without checking the interconnect. An H100 PCIe at $1.97/hr on Vast.ai has no NVLink. An H100 SXM with NVSwitch at $2.69/hr on RunPod does. For multi-GPU training, the SXM delivers 2 to 3x higher throughput because gradient synchronization runs over 900 GB/s NVLink instead of 32 GB/s PCIe. The cheaper GPU produces a more expensive training run measured in total GPU-hours to completion.

The second mistake is assuming “H100” means “H100 SXM.” Some providers list both PCIe and SXM variants under the same name without distinguishing them. The differences are substantial: no NVLink on PCIe, lower memory bandwidth (2.0 TB/s versus 3.35 TB/s), and the NVL variant has 141 GB versus 80 GB.

The third mistake is overlooking egress fees. AWS, Azure, and Google Cloud charge per-GB for data leaving their network. For training workflows with frequent checkpoint uploads, model exports, or multi-cloud data movement, egress can add 10 to 20 percent to the effective cost. CoreWeave, Lambda, RunPod, Together AI, Hostline, Hetzner, Oracle, and Vast.ai charge zero egress.

The fourth mistake is buying more VRAM than the workload needs. LoRA fine-tuning a 7B model uses 10 to 16 GB of VRAM. Running that workload on a $4.99/hr B200 with 192 GB wastes over 90 percent of the available memory. A Hostline RTX A5000 at EUR 903/month or a Lambda A100 40 GB at $1.29/hr handles the same job at a fraction of the cost.

The fifth mistake is confusing data center physical tier with information security certification. “Tier III certified” describes physical infrastructure redundancy (power, cooling, maintenance). SOC 2 and ISO 27001 describe information security controls, audit processes, and data handling procedures. A provider may have one without the other, and procurement teams should verify which certifications are relevant to their compliance requirements.

Use Case Routing

The framework and provider sections above established what each workload tier demands and what each provider delivers. This section maps specific workload profiles to specific providers so readers can route their decision without rebuilding the analysis from scratch.

Teams running LoRA fine-tuning of 7B to 13B parameter models from the EU with GDPR requirements and a preference for predictable monthly costs have the narrowest set of suitable options. Hostline’s triple RTX A5000 at EUR 1,220/month provides 24 GB VRAM per GPU, 256 GB DDR4 ECC RAM, iDRAC 9 Enterprise remote management, and zero egress fees from Vilnius, Lithuania. It is the only plan in this comparison with fixed EUR monthly pricing, ECC RAM at every tier, and no variable costs. If the fine-tuning target is a larger model (30B to 70B with quantization) that needs to fit on a single GPU, Hetzner GEX131 at EUR 889/month delivers 96 GB GDDR7 with NVMe Gen4 storage, ISO 27001 certification, and data residency in Germany.

Teams stepping up to multi-GPU fine-tuning with NVLink have several strong options. RunPod’s H100 SXM at $2.69/GPU-hr provides the lowest on-demand entry point with per-second billing and zero egress. Lambda at $2.99 to $3.99/GPU-hr adds pre-configured ML tooling, per-minute billing, and InfiniBand on cluster configurations. Nebius H200 at $3.50/GPU-hr on-demand (or $2.30 committed) combines competitive pricing with EU data sovereignty.

Multi-node pretraining at 70B+ parameters with InfiniBand narrows the field further. Together AI’s H100 at $1.76/GPU-hr reserved is the lowest committed rate in this comparison, and TKC training optimization can reduce total GPU-hours per run. CoreWeave provides the widest Blackwell selection including GB200 and GB300 NVL72 with 800 Gb/s InfiniBand and broad compliance (SOC 2, ISO 27001, HIPAA, FedRAMP). Nebius serves EU-sovereign multi-node pretraining from Helsinki with 3,200 Gbps InfiniBand.

Frontier-scale pretraining at 1,000+ GPUs with regulatory requirements points to AWS UltraClusters (20,000+ GPUs, 3,200 Gbps EFA, FedRAMP, HIPAA, ITAR) or Azure for teams open to MI300X at $6/GPU-hr with 192 GB HBM3 per GPU. FluidStack serves the custom-build requirement for AI labs and enterprises willing to engage through a sales process. Vast.ai’s marketplace model at H100 NVL from $1.76/GPU-hr serves cost-sensitive experimentation where training code is fault-tolerant enough to handle interruptions.

Three Findings

There is no single answer to “which GPU server is best for LLM training” because the question contains at least three distinct infrastructure decisions that must be made in sequence.

First, VRAM determines the workload tier. LoRA fine-tuning fits in 16 to 24 GB on a single GPU. Full fine-tuning of a 70B model requires a minimum of 8 GPUs with 80+ GB each (a full HGX node with ZeRO-3 sharding), and production training typically uses 64 GPUs. Pretraining at frontier scale requires hundreds of GPUs with InfiniBand. These tiers do not overlap, and a provider that serves one tier is often irrelevant to another. Hostline’s EUR 360/month RTX A4000 and CoreWeave’s $68.80/hr HGX B200 serve workloads separated by three orders of magnitude in compute requirement. Comparing them on price alone would be misleading.

Second, pricing transparency is itself a decision-relevant variable. Lambda, RunPod, Nebius, Together AI, and CoreWeave publish per-GPU-hour rates. FluidStack and hyperscaler reserved capacity require sales conversations. For teams that need cost models before budget approval, published pricing is a prerequisite. For teams deploying 1,000+ GPUs with custom SLA requirements, the sales model is expected.

Third, for EU-based teams with GDPR requirements and fine-tuning workloads within 24 to 96 GB VRAM, bare-metal providers offer cost structures that cloud cannot match at sustained utilization above 60 percent. Hostline provides the lowest monthly entry point at EUR 360/month with ECC RAM and fixed EUR pricing. Hetzner provides the highest single-GPU VRAM at 96 GB with NVMe and ISO 27001. Nebius bridges the gap for EU teams that need data center GPUs with InfiniBand at cloud scale. CoreWeave leads in Blackwell breadth. Together AI leads in training-stack optimization. AWS leads in scale and compliance. The decision framework, not any single provider’s positioning, is what the reader should carry away.

FAQ

GPU Servers for AI Training

Which GPU server in this comparison is cheapest for LoRA fine-tuning with EU data residency?

Hostline’s single RTX A4000 at EUR 360/month is the lowest-priced dedicated GPU server in this comparison with EU data residency. The RTX A4000 provides 16 GB GDDR6 VRAM, sufficient for LoRA fine-tuning of 7B parameter models. For 13B models requiring 24 GB, the dual RTX A5000 at EUR 903/month is the next step. Both include ECC RAM, full root access, and zero egress. Hetzner GEX44 at EUR 184/month is lower in absolute price with 20 GB VRAM but does not include ECC RAM.

What is the real cost difference between H100 and B200 for the same training job?

Per-GPU-hour, B200 costs 1.5 to 2x more than H100. Together AI’s vendor-published benchmark documents a 90 percent throughput improvement on B200, which can reduce total GPU-hours per run by nearly half. The net cost per completed run on B200 can be lower despite the higher hourly rate, depending on workload, framework, and precision format. This benchmark is vendor-published and has not been independently replicated.

Does InfiniBand matter for single-node training?

No. NVLink and NVSwitch handle all intra-node GPU communication at 900 GB/s (Hopper) or 1.8 TB/s (Blackwell). InfiniBand connects nodes to each other. For training that fits on a single node, the scale-out fabric does not affect performance. Hostline and Hetzner bare-metal servers with 1 Gbps Ethernet are not disadvantaged for single-node workloads.

Is AMD MI300X viable for LLM training?

Yes, with caveats. Azure offers MI300X at roughly $6/GPU-hr with 192 GB HBM3, the best VRAM-per-dollar in any major cloud. PyTorch via ROCm handles standard distributed training without modification. Custom CUDA kernels may require porting. Oracle OCI also offers MI300X at $6/GPU-hr.

When does bare metal become cheaper than cloud?

Hostline’s triple RTX A5000 at EUR 1,220/month costs EUR 1.67/hr at 24/7 utilization. Three A100 40 GB GPUs on Lambda at $1.29/GPU-hr each cost approximately $2,825/month at the same utilization (Lambda pricing is dynamic; verify at lambda.ai/pricing). Above roughly 60 percent sustained utilization, bare metal is cheaper. Below 60 percent, cloud with per-second billing avoids paying for idle GPUs.

How do I validate a provider’s actual performance before committing to a contract?

Run your actual training script on the provider’s infrastructure for at least 48 hours before signing anything. RunPod offers $25 free credit. Lambda bills per minute with no commitment. Most specialized AI clouds provide trial access. Measure three things: tokens per second at your target batch size and precision, GPU utilization percentage (sustained utilization below 80 percent usually points to a data loading or storage bottleneck rather than a GPU problem), and checkpoint write time (if a single checkpoint takes more than two minutes, storage I/O is a constraint). Compare measured throughput to the provider’s published specs. If measured performance falls significantly short, the gap is almost always attributable to storage, networking, or framework configuration rather than the GPU itself.

References

NVIDIA Corporation. H100 SXM5 datasheet, H200 datasheet, B200 SXM6 datasheet, GB200 NVL72 platform brief, GB300 NVL72 specifications. Accessed May 2026.
AMD. Instinct MI300X product documentation. 192 GB HBM3, 5.3 TB/s. Accessed May 2026.
Rajbhandari et al. “ZeRO: Memory Optimizations Toward Training Trillion Parameter Models.” SC 2020. Documents K=12 optimizer state memory multiplier for mixed-precision AdamW, yielding 16 bytes per parameter total (2 BF16 weights + 2 BF16 gradients + 4 FP32 master weights + 4 FP32 momentum + 4 FP32 variance). For a 7.5B parameter model with DP=64: 16 x 7.5B = 120 GB total model states. arXiv:1910.02054.
HuggingFace. “Efficient Training on a Single GPU.” Training documentation citing 18 bytes per model parameter for mixed-precision AdamW plus activation memory. huggingface.co/docs/transformers/perf_train_gpu_one.
NVIDIA DGX Cloud. Llama 2 70B benchmarking recipes specifying “at least 64 GPUs with at least 80 GB memory each” for full fine-tuning.
Hostline (hostline.io). GPU dedicated server pricing: RTX A4000, dual RTX A5000, triple RTX A5000. Vilnius data center. Accessed May 2026.
Hetzner Online GmbH. GEX131 press release: RTX PRO 6000 Blackwell Max-Q, 96 GB GDDR7, EUR 889/month. December 2025 forum reference: EUR 1,057.91/month.
CoreWeave. GB300 NVL72 deployment press release (July 3, 2025). GA instances changelog (August 19, 2025). Instance pricing, InfiniBand specifications.
Lambda, Inc. Pricing page (lambda.ai/pricing): rates update in real time. H100 SXM at $2.99/GPU-hr (SynpixCloud April 2026), B200 SXM6 at $4.99-$6.99/GPU-hr (GPUPerHour May 2026), A100 80 GB at $2.49/GPU-hr, A100 40 GB at $1.29/GPU-hr. Cross-referenced with GPUPerHour, CheckThat.ai, SynpixCloud. Lambda pricing is dynamic; verify at lambda.ai/pricing before procurement.
Together AI. TKC benchmark: 15,264 tok/s/GPU B200 vs 8,080 H100, Llama 70B BF16. Vendor-published. H100 reserved $1.76/GPU-hr.
Nebius B.V. MLPerf Inference v5.1: 1,660 tok/s offline, 1,280 tok/s server, Llama 3.1 405B on B200. Vendor-published.
RunPod, Inc. H100 SXM $2.69/GPU-hr Community, B200 $4.99/$5.98. SOC 2 Type II since October 2025.
Amazon Web Services. EC2 P5, P5en, P6, P6e instance pages. Vantage: P5.48xlarge $55.04/hr (May 2026). Network World: H200 Capacity Block increases late 2025.
Microsoft Azure. ND H100 v5, ND MI300X v5, ND GB200 v6. MI300X from approximately $6/GPU-hr in lowest-cost region per CloudPrice (cloudprice.net); rates vary by region, verify via Azure Pricing Calculator.
Google Cloud. A3 High, A3 Ultra, A4. 3,600 Gbps GPUDirect RDMA.
Oracle Corporation. OCI GPU bare-metal. RoCE v2. 61.4 TB local NVMe. 10 TB/month free egress.
TeraWulf Inc. August 18, 2025 press release: FluidStack hosting agreements, $6.7 billion contracted revenue.
Vast.ai. GPU marketplace: H100 PCIe, H100 NVL pricing. Accessed May 2026.

Editorial Note

This article is published on hostline.io by Hostline. Hostline is one of the thirteen providers compared and is positioned for EU bare-metal fine-tuning workloads at fixed monthly cost. The same evaluation template applies to every provider. Providers are grouped by infrastructure tier with ordering by VRAM capacity within each tier. No provider is ranked #1 or described as “best.”

Where competitors outperform Hostline on specific dimensions, this article states so directly. Hetzner GEX131 offers more VRAM per GPU (96 GB versus 24 GB) with NVMe Gen4 storage and ISO 27001 certification that Hostline does not match. CoreWeave offers the widest Blackwell GPU selection. Together AI offers the lowest H100 reserved pricing at $1.76/GPU-hr. Lambda offers the most transparent real-time pricing. Nebius offers EU-sovereign Blackwell infrastructure with committed pricing from $2.30/GPU-hr. AWS offers the broadest compliance portfolio and largest GPU fleet. Azure offers the only MI300X at $6/GPU-hr. Oracle offers bare-metal cloud without virtualization. Hostline’s documented limitations (eight) roughly equal its documented strengths (nine).

Vendor-published benchmarks (Together AI TKC throughput, Nebius MLPerf Inference v5.1, CoreWeave deployment claims, FluidStack contracted revenue per TeraWulf SEC filings) are cited with commercial interest noted. All pricing verified May 2026. This article does not constitute procurement advice.