A single-host fio wrapper that measures the storage visible at one point in a stack and emits structured JSON. It is deliberately narrow: it measures and reports, nothing else.
It is: a measurement agent. Runs a curated set of workload profiles, captures full latency distributions, emits one JSON document.
It is not: topology-aware, multi-node-aware, or a report generator. It doesn't know what layer it's running on or what other nodes exist. Those concerns belong to a wrapper (Slurm launcher) and to later analysis tooling. This separation is intentional - the agent stays small and stable while everything around it evolves.
chmod +x bench_agent.py
# Smoke test (~10 min) - verify it runs and produces data
sudo ./bench_agent.py -d /mnt/target --mode quick -o test.json
# Default run (~75 min)
sudo ./bench_agent.py -d /mnt/target --mode operating -o host1.json
# Thorough (~2.5 hr)
sudo ./bench_agent.py -d /mnt/target --mode limit -o host1.jsonRun with sudo where possible: dropping the page cache between runs requires root and
gives cleaner numbers. Without it the agent still runs but warns that results may include
cache effects.
Modes are organized by intent - the question a run answers - not just duration.
| Mode | Wall time | Repeats | Concurrency sweep | Question it answers |
|---|---|---|---|---|
| quick | ~10 min | 1 | jobs {1,4} x qd {1,8} | "does it work, rough numbers?" |
| operating | ~75 min | 3 | jobs {1,4} x qd {1,8,32} | "what's the usable performance?" (default) |
| limit | ~70 min | 2 | jobs {1,8,32} x qd {1,16,64} | "where does it fall over?" |
operating sweeps the usable range up to and just past the knee - the performance you can
plan around. limit deliberately pushes into saturation to find the breaking point (expect
huge tail latencies at the top; that's the data). Most of the runtime is the durability
doubling: write profiles run twice (scratch + sync).
Per-test runtime is overridable independent of mode with --runtime SECONDS. Use longer
runs (120+) to defeat burst-credit storage and smooth variance:
sudo ./bench_agent.py -d /mnt/target --mode operating --runtime 120Warm-up exclusion with --ramp-time SECONDS (default 0): fio runs that long before
it starts measuring, discarding cache warm-up from the results. On cached layers - GPFS
pagepool, NFS client/server caches - a short run otherwise averages warm-up together with
steady state, producing bimodal latencies (fast cache hits mixed with slow misses). A ramp
gives a cleaner steady-state number:
sudo ./bench_agent.py -d /gpfs --no-direct --ramp-time 10 --mode operatingThe ramp time used is recorded per result (ramp_time_s), so ramped and unramped runs stay
distinguishable.
Repeats > 1 report each run separately so analysis can take medians and see the spread - important in virtualized and shared environments where a single run is noisy.
Curated patterns that bracket the IO space. Any real workload resembles one of these:
| Profile | Pattern | Models | Durability |
|---|---|---|---|
| seq_read | 1M sequential read | dataset load, streaming input | scratch |
| seq_write | 1M sequential write | checkpoint/restart | scratch + sync |
| rand_read | 4K random read | metadata-like, small-object reads | scratch |
| rand_write | 4K random write | small random writes | scratch + sync |
| mixed | 16K 70/30 read/write | typical application I/O | scratch |
| lat_floor | 4K QD1 single-job read | honest per-layer round-trip cost | scratch |
Run a subset with --profiles seq_write,rand_write.
Each profile has a fixed block size suited to its pattern (sequential 1m, random
4k, mixed 16k). --block-size overrides all selected profiles to one size,
passed straight to fio. This matters for filesystems whose block size differs
from the defaults: a parallel filesystem like GPFS with a 4m block performs
terribly on the default 4k random profiles (a 1024x size mismatch forcing
worst-case behaviour), but characterizes properly when tested at its block size:
# See GPFS at its native block size, not its worst case
./bench_agent.py -d /gpfs --block-size 4m
# Scope to specific profiles
./bench_agent.py -d /gpfs --block-size 1m --profiles rand_read,rand_writeThe block size used is recorded per result, so runs at different sizes stay distinguishable in the CSV. A key methodology lesson: small-block random tests are only meaningful when the block size is sensible for the target filesystem - 4k on a 4k-formatted local disk is real; 4k on a 4m GPFS is a pathology, not a characterization.
Write profiles run twice - once each way - because they answer different questions:
- scratch: writes return when the stack accepts them (throughput view). Relevant for disposable/intermediate data where durability is handled elsewhere.
- sync:
fdatasyncafter each write, forcing data toward stable storage (durable view). Relevant for checkpoints you can't afford to lose. On NFS layers this interacts with thesync/asyncexport option - worth being aware of when interpreting results.
The gap between the two is often large (and on cached/NFS layers, dramatic). Seeing both side by side at each layer is a primary goal of this exercise.
The agent measures one host. Slurm launches one agent per node; the shared --run-id
(the Slurm job ID) ties the per-node JSON files together for later analysis.
sbatch --nodes=4 slurm_bench.sh /mnt/shared_target operating
sbatch --nodes=16 slurm_bench.sh /scratch limitProduces bench_<jobid>_<nodename>.json per node. Concurrency in the HPC sense
(many nodes hitting shared storage at once) comes from running across nodes - the agent's
own jobs/iodepth sweep is per-host parallelism, a different and complementary axis.
fio parameters live in one place, separated by kind: FIXED_FIO_ARGS (harness flags,
identical every run), per-test workload params (built by build_fio_params from the
profile + sweep point), and environment params (filename, engine, direct, size, runtime -
merged in at run time by resolve_fio_params).
Because parameters are data, the agent can print the exact fio command for every test without running anything:
./bench_agent.py -d /mnt/target --mode quick --dry-run
./bench_agent.py -d /mnt/target --mode operating --profiles seq_write --dry-runUse this to verify the workload definitions, confirm --fdatasync=1 appears only on sync
variants, and check engine selection (psync at QD1, async otherwise) before committing to
a long run.
The agent emits JSON; bench_to_csv.py flattens it to CSV for spreadsheet analysis.
Measurement and presentation are kept separate - the agent only produces JSON, the converter
only consumes it.
./bench_to_csv.py results.json # CSV to stdout
./bench_to_csv.py results.json > out.csv # redirect to a file
./bench_to_csv.py results.json -o out.csv # write to a file explicitly
cat results.json | ./bench_to_csv.py - # read from stdinOne row per record. The full percentile array is reduced to four key columns (p50_us,
p99_us, p99_9_us, p99_99_us); the complete array stays in the JSON. To surface
different percentiles, edit the PERCENTILES map at the top of the converter. Metadata
fields (layer, appliance_id, config_version) fall back to the document-level metadata
block when not set per-record, so a wrapper can stamp them either way.
One JSON document. Top level carries run-wide info plus a metadata block (blank for now -
a wrapper fills in layer, appliance_id, config_version later). records is a flat
list, one entry per (profile x concurrency point x durability x direction x repeat).
Each record includes test identity, environment, and metrics: bandwidth, IOPS, mean latency, and the full completion-latency percentile array (1st through 99.99th, in us). The complete percentile array is captured because tail latency is where layered/virtualized storage hurts, and re-running to recover a missing percentile later is expensive.
Environment and cache-context fields per record: engine, direct, runtime_s,
ramp_time_s, file_size, mem_total_gb, file_exceeds_ram, caches_dropped. These
make every record self-describing about the conditions it was measured under - so you can
later filter, for example, for only runs where the working set exceeded RAM
(file_exceeds_ram=true) or where a warm-up ramp was applied (ramp_time_s > 0).
When comparing hosts, layers, or configs, the dimensions to group/filter on are:
run_id, hostname, layer, appliance_id, config_version, profile, block_size,
durability, numjobs, iodepth, direction. The metrics to compare are bw_mbs,
iops, lat_mean_us, and the clat_pct_us tail (especially 99.900000 and 99.990000).
- File size should exceed every cache in the path. Default 4g; raise with
-s 8gon hosts with large RAM or where NFS client caching is generous. - O_DIRECT is attempted by default and silently falls back to buffered where unsupported
(NFS, tmpfs, overlayfs) - the record's
directfield tells you which actually happened. - Engine is the best available async engine (
io_uring, elselibaio) for queue-depth tests,psyncfor QD1 latency-floor tests; recorded per-record. io_uring is probed at startup on the target filesystem and the agent falls back to libaio if it's unusable (not compiled in, kernel too old, or blocked bykernel.io_uring_disabled), with a warning. - Caches dropped between runs when root.
- Metadata wrapper: populate
layer/appliance_id/config_versionper host. - Analysis tool: read many JSON files, group by dimensions, compute medians/spreads, compare across hosts/layers/configs, surface tail-latency outliers.
- Per-layer tuning checklist.