NVLink and Storage: Benchmark Suite to Measure GPU-to-Storage Bottlenecks for RISC-V Platforms
benchmarkinggpuperformance

NVLink and Storage: Benchmark Suite to Measure GPU-to-Storage Bottlenecks for RISC-V Platforms

ccloudstorage
2026-04-15
9 min read
Advertisement

Open benchmark suite to expose GPU-to-storage throughput and latency on NVLink-fused RISC-V platforms—practical tests, profiling steps, and fixes.

Hook: Why GPU-to-Storage Bottlenecks Threaten RISC-V AI Platforms Today

Architects building NVLink-fused RISC-V systems for AI pipelines face a familiar, expensive problem: GPUs can be starved by storage, and traditional benchmarks miss the bottleneck. As SiFive and partners began shipping NVLink Fusion-enabled RISC-V silicon in late 2025 and early 2026, teams discovered that existing storage tests (fio, iperf, generic I/O suites) rarely measure the full GPU->storage path across NVLink fabrics. The result: unexpected tail latencies, suboptimal throughput, and wasted GPU cycles in production models.

Executive summary — what this open benchmark gives you

This article introduces a practical, open benchmark suite purpose-built to surface GPU-to-storage throughput and latency problems on NVLink-fused RISC-V platforms. It is written for system architects and platform engineers who must validate end-to-end AI pipeline performance. You’ll get:

  • A clear benchmark design that isolates NVLink/GPU-IO paths (not just CPU->storage)
  • Recommended hardware/software stack for reproducible tests on RISC-V NVLink platforms
  • Measurement and profiling recipes using modern tooling (Nsight, perf/eBPF, Prometheus) and RISC-V-specific considerations
  • Actionable optimization playbook to eliminate common bottlenecks
  • A template reporting format (throughput, P99/P999, GPU idle-time, CPU overhead)

Why this matters now (2026 context)

In 2025–2026 the industry moved fast: NVLink Fusion started appearing on alternative ISAs, notably RISC-V vendor integrations announced in early 2026. That opened new heterogenous platform choices for AI inference and training outside the x86 ecosystem. But these platforms introduce novel I/O topologies — NVLink P2P, NVLink fabrics, DPU/NVMe-oF frontends — where standard storage benchmarks miss GPU-side serialization, kernel copy overheads, or DPU misconfigurations.

Key 2026 trends that increase the stakes:

  • NVLink Fusion adoption on RISC-V creates direct GPU-RISC-V links and new DMA paths.
  • Wider GPUDirect and vendor GDS-like APIs are available for non-x86 stacks, but configuration and driver maturity vary.
  • Composable infrastructure and DPUs shift storage logic off host CPUs — which can both help and mask problems.
  • Stricter SLAs for AI model latency force attention to tail behavior (P99/P999).

Benchmark goals and success metrics

The benchmark suite is designed to answer three engineering questions:

  1. Can a GPU saturate the storage path available to it? (Max sustainable MB/s/GB/s per GPU)
  2. What are the latency characteristics for AI-friendly IO patterns? (small random reads, batched sequential reads)
  3. Where do losses occur? (CPU copy, kernel queues, NVLink lane congestion, device-side queuing)

Primary metrics you must report:

  • Throughput (MB/s or GB/s per GPU and aggregate)
  • Latency distribution (median, P95, P99, P999 and full CDF)
  • GPU utilization and idle time while waiting on IO
  • CPU overhead and kernel-space copy counts
  • NVLink transport metrics (bandwidth per link, link utilization)

Required hardware and software stack

To run the suite reproducibly, standardize on these components. The stack assumes NVLink Fusion-capable RISC-V SoC with NVIDIA GPUs and modern storage (NVMe, NVMe-oF or DPU-managed targets).

  • RISC-V host/SoC with NVLink Fusion bridge (SiFive or OEM boards shipping in 2026)
  • NVIDIA GPUs that support GPUDirect/GPUDirect Storage on NVLink (2024+ GPU families)
  • NVMe SSDs (local) and an NVMe-oF target (remote) to exercise both local and networked stacks
  • Optional DPU/SmartNIC configured as NVMe-oF front-end

Software

  • Linux kernel with up-to-date RISC-V NVLink and GDS drivers (2025-2026 driver tree)
  • GPUDirect Storage or equivalent vendor library for RISC-V (cuFile/cuFD-like API)
  • fio with a GPU-aware ioengine plugin (or the benchmark's custom I/O harness)
  • Nsight Systems / Nsight Compute or comparable GPU tracing tool with RISC-V support
  • Prometheus + Grafana for metrics collection; perf/eBPF for kernel-level traces
  • nvme-cli, smartctl, and vendor DPU management utilities

Designing the test matrix

A meaningful matrix tests block sizes, access patterns, concurrency and topologies.

Access patterns

  • Large sequential reads (1MB–16MB) to measure sustained throughput for model checkpoints
  • Small random reads (4KB–64KB) to characterize input tile fetches for inference
  • Batched scatter-gather reads (variable offsets per batch) to mimic sharded dataset reads
  • Write-back patterns (for checkpoints) and mixed read/write to exercise queueing

Concurrency and queue depth

  • Single-stream vs multi-stream GPU access (1, 4, 8, 16 concurrent CUDA streams or equivalents)
  • IO queue depths 1–256 to capture different driver/device bottlenecks

Topologies

  • GPU -> local NVMe over NVLink (direct path)
  • GPU -> host CPU -> local NVMe (disable GPUDirect to show worst-case)
  • GPU -> NVMe-oF target across DPU (network path)
  • Cross-GPU (GPU1 reads from storage while GPU2 is loaded) to measure fabric contention

Measurement & profiling recipe

Collecting the right signals is critical — you must observe both ends of the path: the GPU side and the storage side.

GPU-side traces

  • Nsight Systems timeline to see when GPU kernels block on I/O operations (IO events, memcpy, cuFile calls)
  • GPU utilization and memory copy counters (SM utilization, memory throughput)
  • Record the time between GPU request submit and completion to get GPU-observed latency

Host/storage-side traces

  • fio or the GDS-aware harness telemetry for per-request latency
  • nvme-cli for controller bandwidth and queue statistics
  • iostat/dstat and perf/eBPF to measure CPU cycles spent in kernel copy and IO processing
  • Prometheus exporters for NVLink and DPU if available (2026 drivers often export link utilization)

Distributed telemetry

Use synchronized clocks (PTP/NTP) across host, DPU, and storage to correlate events. Capture traces for each run and store them in an artifact bucket for comparison.

Step-by-step: a reproducible run

  1. Baseline: ensure stable system temperature and no background workloads. Reboot if necessary.
  2. Enable GPUDirect/GDS and confirm driver readiness. Confirm NVLink links up with vendor diagnostics.
  3. Start Prometheus collectors and Nsight tracing (lightweight sampling initially).
  4. Run microbenchmarks: large sequential read at QD=32 (GPU-initiated). Capture throughput and GPU wait times.
  5. Run small-random IOs at QD=8 and 16 with many concurrent GPU streams to measure tail latency.
  6. Compare runs with GPUDirect enabled vs disabled to quantify CPU-copy overhead.
  7. Run topology-switch tests: local NVMe vs NVMe-oF vs DPU-managed target.
  8. Collect artifacts: fio logs, Nsight trace files, perf/eBPF traces, and Prometheus metrics snapshots.

Example fio-style job (conceptual)

Use a GPU-aware ioengine or the benchmark harness. The job below is conceptual to show parameters to exercise.

# gpu_read.job rw=read ioengine=gds (or gpu-engine) bs=64k iodepth=32 numjobs=8 direct=1 runtime=120 filename=/dev/nvme0n1

Note: Replace ioengine with the platform-supported GPU ioengine (GPUDirect/cuFile bindings) for RISC-V driver stacks.

Interpreting results — common bottlenecks and signatures

When you run the matrix, watch for these patterns:

  • Low throughput but low CPU load: possible NVLink link under-provisioning or NVMe controller-side bottleneck. Check link utilization and per-lane stats.
  • High CPU load and high latencies with GPUDirect disabled: kernel memcpy paths are throttling the pipeline — enabling GPUDirect should reduce CPU cycles and latency.
  • P99/P999 spikes with small IOs: queuing delays at NVMe controller or DPU; use smaller batch sizes or reconfigure NVMe QoS/IO scheduling.
  • One GPU idle while another saturates: NVLink fabric routing asymmetry or per-GPU queue limits. Rebalance workloads across NVLink islands.
  • DPU shows low CPU but high latency: DPU may be underpowered for storage front-end processing — increase DPU resources or move queue handling to host temporarily.

Optimization playbook (actionable steps)

For each detected bottleneck, apply these prioritized fixes and re-run the tests:

  1. Enable GPUDirect/GDS and verify direct DMA paths from SSD to GPU memory. Most gains are here for large sequential workloads.
  2. Increase IO queue depth and use multiple GPU streams to keep NVLink lanes fed — but watch for increased tail latencies on small IOs.
  3. Use hugepages and pre-registered buffers to reduce pin/unpin overhead during GPUDirect transfers.
  4. Topology-aware scheduling: place data on NVMe devices with the best NVLink proximity to the serving GPU.
  5. Offload metadata and small-IO handling to host CPU if the DPU struggles with thousands of tiny requests; alternatively use a smarter DPU configuration.
  6. NVMe controller tuning: enable write-back caches for checkpoint writes, adjust scheduler, and optimize NVMe namespaces for low latency.
  7. Monitor link health and firmware updates — NVLink Fusion is rapidly iterating in 2025–2026 drivers and microcode.

Hypothetical case study: diagnosing a 30% GPU starvation issue

Situation: A RISC-V cluster with NVLink-fused GPUs showed 30% lower throughput in a retrieval-augmented generation (RAG) inference pipeline compared to lab estimates.

What the benchmark revealed:

  • With GPUDirect enabled, throughput jumped but P99 latencies remained high for 4KB reads.
  • Nsight traces showed GPU waiting on cuFile reads; perf traces showed sporadic kernel dispatch delays caused by a misconfigured NVMe QoS policy on the DPU.
  • NVLink telemetry exposed that one fabric lane had 60% higher utilization — causally routing multiple GPUs through the same bridge.

Fixes applied:

  • Rebalanced device namespace placement to local NVLink islands
  • Tuned DPU request handling and raised completion queue sizes
  • Upgraded the NVLink microcode (a 2025 driver backport in 2026) that fixed a small serialization edge-case

Result: sustained throughput improved to target and P99 latency fell by 4x on small reads.

Reporting format — what to publish after each run

KPI dashboard for each test run should include:

  • Aggregate throughput (GB/s) and per-GPU breakdown
  • Latency CDF and P99/P999 figures
  • GPU idle-time while I/O outstanding
  • CPU cycles spent in memcpy vs kernel IO processing
  • NVLink link utilization heatmap
  • Versioned environment capture (kernel, driver, firmware, cuFile/GDS, benchmark version)

Reproducibility and open-source best practices

To make benchmark results trusted and actionable:

  • Openly publish job files, harness code, and metric collectors
  • Include hardware inventory and firmware/driver hashes
  • Use containers for the user-space stack and supply kernel/driver provisioning scripts where possible
  • Provide an automated CI path (simulated or real hardware runners) to validate regressions when drivers or microcode change

Expect these shifts through 2027 that will affect GPU-IO benchmarking:

  • Faster driver churn: NVLink Fusion and GPUDirect variants across ISAs will iterate rapidly; keep benchmarks under version control.
  • RISC-V ecosystem tool maturity: more mature vendor-provided telemetry and kernel modules with expanded perf/eBPF hooks will arrive in 2026.
  • Composable disaggregation: storage disaggregation over fabrics will become the norm, making remote NVMe-oF tests more critical.
  • ML models with tighter I/O loops: retrieval-heavy inference will put more emphasis on small-read tail latency optimizations.

Actionable takeaways

  • Benchmark the full GPU->storage path, not just CPU->storage. Use GPUDirect-aware engines to avoid false confidence.
  • Measure tail latencies (P99/P999) for small-block IOs — these kill inference SLAs.
  • Be topology-aware: NVLink islands and link saturation are common root causes of imbalance.
  • Automate artifact capture (traces, firmware versions, driver hashes) to enable reproducible diagnosis.

Call to action

We built the initial open benchmark harness and reporting templates to help RISC-V NVLink platform teams get started. Clone the repo, run the suite on your hardware, and open issues with trace artifacts — the goal is community-driven improvement of GPU-IO diagnostics for the emerging RISC-V + NVLink ecosystem.

Get involved: adopt the benchmark, contribute topology tests for your boards, and share results so the community can converge on best practices that keep GPUs busy and storage predictable.

Advertisement

Related Topics

#benchmarking#gpu#performance
c

cloudstorage

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-04-15T02:24:54.495Z