Skip to content

aabaris/storage-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

Storage Benchmark Agent

A single-host fio wrapper that measures the storage visible at one point in a stack and emits structured JSON. It is deliberately narrow: it measures and reports, nothing else.

What it is and isn't

It is: a measurement agent. Runs a curated set of workload profiles, captures full latency distributions, emits one JSON document.

It is not: topology-aware, multi-node-aware, or a report generator. It doesn't know what layer it's running on or what other nodes exist. Those concerns belong to a wrapper (Slurm launcher) and to later analysis tooling. This separation is intentional - the agent stays small and stable while everything around it evolves.

Quick start

chmod +x bench_agent.py

# Smoke test (~10 min) - verify it runs and produces data
sudo ./bench_agent.py -d /mnt/target --mode quick -o test.json

# Default run (~75 min)
sudo ./bench_agent.py -d /mnt/target --mode operating -o host1.json

# Thorough (~2.5 hr)
sudo ./bench_agent.py -d /mnt/target --mode limit -o host1.json

Run with sudo where possible: dropping the page cache between runs requires root and gives cleaner numbers. Without it the agent still runs but warns that results may include cache effects.

Modes

Modes are organized by intent - the question a run answers - not just duration.

Mode Wall time Repeats Concurrency sweep Question it answers
quick ~10 min 1 jobs {1,4} x qd {1,8} "does it work, rough numbers?"
operating ~75 min 3 jobs {1,4} x qd {1,8,32} "what's the usable performance?" (default)
limit ~70 min 2 jobs {1,8,32} x qd {1,16,64} "where does it fall over?"

operating sweeps the usable range up to and just past the knee - the performance you can plan around. limit deliberately pushes into saturation to find the breaking point (expect huge tail latencies at the top; that's the data). Most of the runtime is the durability doubling: write profiles run twice (scratch + sync).

Per-test runtime is overridable independent of mode with --runtime SECONDS. Use longer runs (120+) to defeat burst-credit storage and smooth variance:

sudo ./bench_agent.py -d /mnt/target --mode operating --runtime 120

Warm-up exclusion with --ramp-time SECONDS (default 0): fio runs that long before it starts measuring, discarding cache warm-up from the results. On cached layers - GPFS pagepool, NFS client/server caches - a short run otherwise averages warm-up together with steady state, producing bimodal latencies (fast cache hits mixed with slow misses). A ramp gives a cleaner steady-state number:

sudo ./bench_agent.py -d /gpfs --no-direct --ramp-time 10 --mode operating

The ramp time used is recorded per result (ramp_time_s), so ramped and unramped runs stay distinguishable.

Repeats > 1 report each run separately so analysis can take medians and see the spread - important in virtualized and shared environments where a single run is noisy.

Profiles

Curated patterns that bracket the IO space. Any real workload resembles one of these:

Profile Pattern Models Durability
seq_read 1M sequential read dataset load, streaming input scratch
seq_write 1M sequential write checkpoint/restart scratch + sync
rand_read 4K random read metadata-like, small-object reads scratch
rand_write 4K random write small random writes scratch + sync
mixed 16K 70/30 read/write typical application I/O scratch
lat_floor 4K QD1 single-job read honest per-layer round-trip cost scratch

Run a subset with --profiles seq_write,rand_write.

Overriding block size (--block-size)

Each profile has a fixed block size suited to its pattern (sequential 1m, random 4k, mixed 16k). --block-size overrides all selected profiles to one size, passed straight to fio. This matters for filesystems whose block size differs from the defaults: a parallel filesystem like GPFS with a 4m block performs terribly on the default 4k random profiles (a 1024x size mismatch forcing worst-case behaviour), but characterizes properly when tested at its block size:

# See GPFS at its native block size, not its worst case
./bench_agent.py -d /gpfs --block-size 4m

# Scope to specific profiles
./bench_agent.py -d /gpfs --block-size 1m --profiles rand_read,rand_write

The block size used is recorded per result, so runs at different sizes stay distinguishable in the CSV. A key methodology lesson: small-block random tests are only meaningful when the block size is sensible for the target filesystem - 4k on a 4k-formatted local disk is real; 4k on a 4m GPFS is a pathology, not a characterization.

Durability: scratch vs sync

Write profiles run twice - once each way - because they answer different questions:

  • scratch: writes return when the stack accepts them (throughput view). Relevant for disposable/intermediate data where durability is handled elsewhere.
  • sync: fdatasync after each write, forcing data toward stable storage (durable view). Relevant for checkpoints you can't afford to lose. On NFS layers this interacts with the sync/async export option - worth being aware of when interpreting results.

The gap between the two is often large (and on cached/NFS layers, dramatic). Seeing both side by side at each layer is a primary goal of this exercise.

Multi-node (Slurm)

The agent measures one host. Slurm launches one agent per node; the shared --run-id (the Slurm job ID) ties the per-node JSON files together for later analysis.

sbatch --nodes=4  slurm_bench.sh /mnt/shared_target operating
sbatch --nodes=16 slurm_bench.sh /scratch limit

Produces bench_<jobid>_<nodename>.json per node. Concurrency in the HPC sense (many nodes hitting shared storage at once) comes from running across nodes - the agent's own jobs/iodepth sweep is per-host parallelism, a different and complementary axis.

Seeing what fio will run

fio parameters live in one place, separated by kind: FIXED_FIO_ARGS (harness flags, identical every run), per-test workload params (built by build_fio_params from the profile + sweep point), and environment params (filename, engine, direct, size, runtime - merged in at run time by resolve_fio_params).

Because parameters are data, the agent can print the exact fio command for every test without running anything:

./bench_agent.py -d /mnt/target --mode quick --dry-run
./bench_agent.py -d /mnt/target --mode operating --profiles seq_write --dry-run

Use this to verify the workload definitions, confirm --fdatasync=1 appears only on sync variants, and check engine selection (psync at QD1, async otherwise) before committing to a long run.

Converting JSON to CSV

The agent emits JSON; bench_to_csv.py flattens it to CSV for spreadsheet analysis. Measurement and presentation are kept separate - the agent only produces JSON, the converter only consumes it.

./bench_to_csv.py results.json              # CSV to stdout
./bench_to_csv.py results.json > out.csv     # redirect to a file
./bench_to_csv.py results.json -o out.csv    # write to a file explicitly
cat results.json | ./bench_to_csv.py -       # read from stdin

One row per record. The full percentile array is reduced to four key columns (p50_us, p99_us, p99_9_us, p99_99_us); the complete array stays in the JSON. To surface different percentiles, edit the PERCENTILES map at the top of the converter. Metadata fields (layer, appliance_id, config_version) fall back to the document-level metadata block when not set per-record, so a wrapper can stamp them either way.

Output schema

One JSON document. Top level carries run-wide info plus a metadata block (blank for now - a wrapper fills in layer, appliance_id, config_version later). records is a flat list, one entry per (profile x concurrency point x durability x direction x repeat).

Each record includes test identity, environment, and metrics: bandwidth, IOPS, mean latency, and the full completion-latency percentile array (1st through 99.99th, in us). The complete percentile array is captured because tail latency is where layered/virtualized storage hurts, and re-running to recover a missing percentile later is expensive.

Environment and cache-context fields per record: engine, direct, runtime_s, ramp_time_s, file_size, mem_total_gb, file_exceeds_ram, caches_dropped. These make every record self-describing about the conditions it was measured under - so you can later filter, for example, for only runs where the working set exceeded RAM (file_exceeds_ram=true) or where a warm-up ramp was applied (ramp_time_s > 0).

Key fields for comparison

When comparing hosts, layers, or configs, the dimensions to group/filter on are: run_id, hostname, layer, appliance_id, config_version, profile, block_size, durability, numjobs, iodepth, direction. The metrics to compare are bw_mbs, iops, lat_mean_us, and the clat_pct_us tail (especially 99.900000 and 99.990000).

Methodology notes baked in

  • File size should exceed every cache in the path. Default 4g; raise with -s 8g on hosts with large RAM or where NFS client caching is generous.
  • O_DIRECT is attempted by default and silently falls back to buffered where unsupported (NFS, tmpfs, overlayfs) - the record's direct field tells you which actually happened.
  • Engine is the best available async engine (io_uring, else libaio) for queue-depth tests, psync for QD1 latency-floor tests; recorded per-record. io_uring is probed at startup on the target filesystem and the agent falls back to libaio if it's unusable (not compiled in, kernel too old, or blocked by kernel.io_uring_disabled), with a warning.
  • Caches dropped between runs when root.

Next steps (not yet built)

  • Metadata wrapper: populate layer/appliance_id/config_version per host.
  • Analysis tool: read many JSON files, group by dimensions, compute medians/spreads, compare across hosts/layers/configs, surface tail-latency outliers.
  • Per-layer tuning checklist.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages