Benchmark Test Plan for NVLink-Connected Systems: Measuring Storage Throughput and Latency
benchmarkgputestingperformance

Benchmark Test Plan for NVLink-Connected Systems: Measuring Storage Throughput and Latency

ccloudstorage
2026-05-21
10 min read

Reproducible benchmark plan to measure throughput and latency on NVLink Fusion systems pairing RISC‑V SoCs and GPUs — tools, tests, and analysis for 2026.

As teams deploy heterogeneous clusters that pair RISC‑V SoCs with NVIDIA GPUs using NVLink Fusion, one problem becomes immediate and painful: how do you reliably quantify storage throughput and latency when the CPU, GPU and the interconnect can all originate I/O? If you’re responsible for architecture, SRE or procurement, you need a reproducible, defensible benchmark plan that isolates the storage subsystem while exercising GPU‑initiated paths such as GPUDirect/NVLink Fusion. This article gives you that plan — test matrix, tools, commands and metrics — in a format you can run in your lab or CI pipeline in 2026.

In late 2025 and early 2026 the industry accelerated integrations connecting CPU vendors (including RISC‑V silicon IP stacks) with NVIDIA’s NVLink Fusion fabric. These integrations unlock cache‑coherent, direct GPU access to host memory and, increasingly, storage stacks. For architectural teams this changes the measurable surface: I/O can be initiated by the SoC, the GPU, or both; kernel bypass paths and DMA registration matter more than ever; and tail latencies from storage can limit model training and inference throughput.

SiFive and other RISC‑V platform partners announced NVLink Fusion integrations in late 2025 — creating new performance considerations for storage I/O between RISC‑V SoCs and GPUs (industry announcements, 2025–2026).

Goals of this benchmark plan

  • Reproducibility: provide reproducible commands, fio jobs and analysis steps.
  • Coverage: measure GPU‑initiated IO, CPU‑initiated IO, and mixed workloads across block sizes and QD.
  • Actionable metrics: capture throughput, IOPS, latency percentiles, interconnect utilization and system overhead.
  • Comparability: enable baseline comparisons (NVMe local, NVMe‑oF, GDS/NVLink Fusion).

Testbed: hardware and software baseline

Use a repeatable lab configuration. Record every firmware/driver/kernel version — this matters more in 2026 as NVLink Fusion support is still evolving across kernels and vendor stacks.

Minimum hardware

  • RISC‑V SoC development board or custom board with NVLink Fusion endpoint support.
  • NVIDIA GPU(s) with NVLink Fusion support (vendor‑recommended models, 2025/2026-series).
  • Storage media: high‑end NVMe SSD(s) (client or datacenter grade), and optionally NVMe‑oF target for networked tests.
  • Switching fabric and cables per vendor recommendation (if external NVLink Fabric devices used).

Minimum software stack

  • Linux kernel with NVLink Fusion and GPUDirect Storage support (record exact git tag or vendor kernel version).
  • NVIDIA drivers and CUDA toolkit (2026 release or latest stable supporting Fusion/GDS).
  • fio (latest stable) with libaio, io_uring support, and optional GDS plugin.
  • SPDK (optional) for kernel‑bypass tests.
  • Nsight Systems / nvidia‑smi for GPU metrics; perf, bpftrace for CPU/RISC‑V side.

Key metrics (what to collect and why)

Collect both traditional storage metrics and interconnect/GPU‑centric metrics so you can correlate I/O behavior with GPU/Link utilization.

  • Throughput (GB/s) — sustained read/write bandwidth over the test window.
  • IOPS — useful for small random workloads (4KB/8KB).
  • Latency percentiles — p50, p90, p99, p99.9, p99.99. Tail latency is critical for online inference.
  • Queue depth (QD) scaling — how throughput/latency scale with client QD.
  • NVLink / fabric utilization — bytes/s per NVLink lane and percent utilization if vendor exposes counters.
  • GPU memory bandwidth & occupancy — to see whether kernel scheduling or PCIe/NVLink is saturated.
  • CPU/RISC‑V load — user/system/irq time to detect driver overheads.
  • Jitter & variance — standard deviation of throughput and latency across windows.

Designing the test matrix

The matrix must exercise realistic modes that matter in production. At a minimum include these axes:

  • Initiator: CPU (SoC) direct vs GPU direct (GPUDirect/NVLink Fusion).
  • Workload type: sequential read/write, random read/write, mixed read/write.
  • Block sizes: 4KB, 16KB, 64KB, 256KB, 1MB (and 4MB for very large streaming reads).
  • Queue depths: 1, 4, 16, 32, 128 (match application expected QD).
  • Concurrency model: single process vs multi‑process vs multi‑GPU.
  • Kernel bypass: kernel path vs SPDK vs GDS.

Measurement methodology — step by step

  1. Baseline validation: boot the system, apply latest microcode/firmware, ensure GPUs are visible and NVLink Fusion is enabled. Record nvidia‑smi output and kernel messages (dmesg) to capture link initialization.
  2. Device preparation: wipe test SSD(s), create dedicated partitions or raw block devices for testing. Disable background tasks (cron, logs) and ensure CPU frequency scaling is set to performance.
  3. Warm up: conduct a warm‑up run for each device/workload (e.g., 60s) to stabilize flash behavior and caches. Discard warm‑up results.
  4. Run the matrix: for each axis combination execute the workload for a controlled duration (recommended 180–300s runs) and repeat at least 3 times to compute variance.
  5. Collect system counters: concurrently record nvidia‑smi (or vendor counters), perf or bpftrace traces for kernel functions, and iostat or blktrace for device behavior.
  6. Post processing: aggregate results into CSVs, compute percentiles, normalized throughput, utilization ratios, and failure cases.

Practical fio jobs and commands (reproducible)

Below are example fio jobs you can use as a starting point. Adjust filenames/device paths and plugin options per your environment. Each job uses 3 runs with 240s duration.

Example: GPU‑initiated sequential read (1MB blocks)

[global]
ioengine=libaio   # use io_uring or gds plugin when available
direct=1
rw=read
bs=1M
runtime=240
time_based
numjobs=1
group_reporting

[gpu_seq_read]
filename=/dev/nvme0n1
  

For GPU‑initiated I/O with GPUDirect Storage (GDS) plugin, fio needs to be built with GDS support and launched with ado flags or a specialized plugin. Example (pseudo):

fio --filename=/dev/nvme0n1 --name=gds_read --rw=read --ioengine=gds --bs=1M --runtime=240 --numjobs=1
  

Example: small random reads (4KB), QD sweep

[global]
ioengine=libaio
direct=1
rw=randread
bs=4k
runtime=240
time_based
numjobs=1
group_reporting

[rand4k_qd1]
iodepth=1
filename=/dev/nvme0n1

[rand4k_qd16]
iodepth=16
filename=/dev/nvme0n1
  

GPU side metrics and tooling

Capture GPU counters while tests run. Use these tools:

  • nvidia‑smi dmon or nvidia‑smi --query-gpu= --format=csv for basic utilization, memory utilization and power.
  • Nsight Systems for timeline traces (kernel launches, DMA activity, memcpy operations across NVLink).
  • NVLink/Fusion specific counters — vendor tooling or kernel debugfs nodes may expose link bytes/s. Export these to CSV at 1s intervals.

To determine whether NVLink or the storage device is the limiting factor, use a three‑step isolation approach:

  1. Measure raw SSD streaming bandwidth using an idle host (bypass NVLink/NVMe drivers if possible). This gives device_max_bw.
  2. Measure GPU memory copy bandwidth across NVLink with synthetic memcpy kernels to get interconnect_max_bw.
  3. Run GPU‑initiated storage read and compare observed throughput to min(device_max_bw, interconnect_max_bw). If observed << min, investigate driver/CPU overhead, registration costs or contention.

Interpreting results — what failures look like

Here are common failure modes and how to identify them:

  • Low throughput but low GPU/memory utilization: indicates storage device or NVMe driver bottleneck. Inspect SSD SMART stats, queue saturation and block layer latencies.
  • High NVLink utilization with tail latency spikes: suggests congestion on the interconnect or inefficient batching. Try increasing request coalescing or adjusting block sizes.
  • High CPU system time during GPU direct tests: kernel registration or invalidation costs may dominate. Evaluate SPDK or persistent memory registration strategies.

Reporting format — make results comparable

For each test run produce a small, consistent result bundle containing:

  • Test metadata: hardware, software versions, BIOS/firmware hashes, date/time.
  • Raw fio output (JSON) and parsed CSVs for throughput, IOPS and latency percentiles.
  • NVLink/GPU counters sampled at 1s resolution.
  • CPU/system counters: top processes, perf summary or bpftrace snapshots.
  • A short conclusion block: expected bottleneck vs observed, and one suggested mitigation per identified issue.

Example analysis (interpreting a hypothetical run)

Suppose a 1MB sequential GPU‑initiated read yields 8 GB/s sustained. Device_max_bw measured separately is 10 GB/s. NVLink memcpy synthetic tests show 16 GB/s interconnect capacity. This points to a 2 GB/s gap: investigate driver overheads or queue depth. If CPU system time is high during the run, focus on memory registration costs or small DMA chunks. If NVLink counters show intermittent saturation spikes, try increasing request coalescing on the GPU side or batching reads in larger strides.

Advanced strategies for more accurate or lower‑latency results

  • Use SPDK or userland drivers to eliminate kernel block layer jitter for best‑effort maximum bandwidth measurements.
  • Experiment with io_uring and submission batching on the RISC‑V side to lower CPU syscall overhead when the SoC initiates I/O.
  • Persistent memory registration and reuse: avoid repeated registration of GPU buffers with storage drivers; reuse registered buffers to minimize overhead.
  • NUMA and CPU/GPU affinity: pin processes and DMA to the same domain to reduce cross‑domain latency (especially important on multi‑socket host designs or complex SoCs).

Reproducible automation checklist

Put your tests into CI by codifying steps above. At minimum: a small shell script that sets up performance governor, clears caches, runs a warm‑up, executes fio with JSON output, and archives GPU/Link counters. Store results in an artifact store and apply a simple analyzer script to generate comparison dashboards.

#!/bin/bash
set -e
# record environment
uname -a > run_meta.txt
nvidia-smi --query --format=csv > gpu_meta.csv
# prepare device
blockdev --flushbufs /dev/nvme0n1
# warmup
fio warmup_job.fio --output=warmup.json
# run test
fio gpu_seq_read.fio --output=run.json
# capture GPU counters during run (pseudo)
# nvidia-smi dmon -s u -f nvidia_dmon.csv &
# analyze
python3 analyze_results.py run.json nvidia_dmon.csv
  

Future predictions and architecture best practices for 2026+

As NVLink Fusion adoption grows across RISC‑V and other CPU platforms, expect these trends:

  • Standardization of GPU‑initiated storage APIs and expanded support for fio and SPDK plugins to make GPU direct tests easier to run in CI.
  • Tighter co‑design between storage vendors and GPU vendors to expose link and DMA telemetry that makes root cause analysis deterministic.
  • More workloads streaming data directly into GPU memory for model training and inference, raising the bar for tail‑latency SLAs.
  • Commoditization of persistent memory registration pools and driver support for long‑lived registrations to reduce CPU load.

Checklist: what to include in a release‑ready benchmark report

  • Clear, versioned environment metadata.
  • Test matrix and raw outputs (JSON/CSV).
  • Analysis scripts and visualization artifacts (throughput vs time, percentile curves, NVLink utilization graphs).
  • Hypotheses for observed bottlenecks and at least one remediation test per hypothesis.

Final actionable takeaways

  • Always run both CPU‑initiated and GPU‑initiated paths — they surface different bottlenecks.
  • Measure tail percentiles (p99.9+) as they matter for inference SLAs; average throughput is insufficient.
  • Automate environment capture (drivers/kernel/firmware) so benchmark runs are auditable and comparable.
  • Use kernel bypass (SPDK) or persistent registrations to isolate device capacity from software overheads.
  • Compare observed throughput to the theoretical min(device_bw, interconnect_bw, memory_bw) — this gives an immediate direction for root cause work.

Call to action

Ready to run this plan in your lab? Start by cloning your environment metadata, installing fio and Nsight Systems, and running the two baseline tests (device_max_bw and interconnect memcpy). If you need a turnkey test harness or a scripted analyzer that formats reports into dashboards, reach out for a reproducible test harness we’ve used with RISC‑V + NVLink Fusion prototypes in late 2025. Share your anonymized results and we’ll help interpret bottlenecks and suggest mitigations tailored to your stack.

Related Topics

#benchmark#gpu#testing#performance
c

cloudstorage

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

2026-05-21T20:34:24.192Z