Nexus MI300X Training Pod — Multi-Node Cluster (8-Rail 400G Fabric)

Rack-scale MI300X cluster engineered for distributed model training.

Overview

The Nexus MI300X Training Pod links multiple 8-GPU MI300X nodes over an 8-rail 400G fabric with shared parallel storage and ROCm-based orchestration, delivered as one engineered system rather than a parts list. Nexus Compute designs, sources, stages, and tests the full pod across compute, fabric, and storage, then delivers it warranty-backed through authorized channels.

Specifications

Compute Nodes	Multiple 8x MI300X servers (192GB HBM3 per GPU)
Intra-Node Interconnect	AMD Infinity Fabric, all-to-all per node
Cluster Fabric	8-rail fat tree, 400G RoCEv2 or InfiniBand NDR (1:1 GPU:NIC)
Shared Storage	High-throughput parallel filesystem
Orchestration	Slurm or Kubernetes with ROCm
Scale	16 to 64+ MI300X GPUs (configurable)
Monitoring	GPU, fabric, and job health monitoring
Deployment	Design, sourcing, staging, and commissioning support

Typical Use Cases

·Distributed training (FSDP, Megatron, DeepSpeed) on ROCm
·Foundation and large custom model training
·Shared multi-team research compute
·Building an owned AMD AI training platform

Industries

AI & Machine LearningHigher Education & ResearchGovernment & DefenseHPCFinancial Services

Warranty & Support

Supplied through authorized channels with full manufacturer warranty. On-site, next-business-day support options available. Every system is configured, tested, and documented before delivery, with asset and warranty records provided for enterprise audit requirements.

Request a tailored quote

Configurations are tailored per engagement — contact us for pricing and lead times.

sales@nexus-compute.com

+1 737 276 1016

nexus-compute.com

Specifications are indicative and configured to each engagement. All product names, logos, and trademarks are the property of their respective owners. Nexus Compute is an independent enterprise hardware supplier.