parallel-walk-plus-process: the pattern behind bc-hash and bc-duplicate

find /data -type f -print0 | xargs -0 -P$(nproc) sha256sum is the reflex command when you need to hash a file tree. It works. For a long time, it works very well. Until the day the corpus grows to several hundred thousand files and the number stops scaling with cores.

On 653,591 files / 19 GB, warm cache, Ryzen 7 5700G, find | xargs -P16 sha256sum finishes in 11.97 s. On the same corpus, bc-hash finishes in 7.31 s — 1.64× faster. Not a magic number: three architectural decisions, measured, and a pattern that reappears identically in bc-duplicate.

This pattern — call it parallel-walk-plus-process — is not an optimization. It’s a recognition: walking and processing are two distinct problems, with different memory-access profiles, different parallelization constraints, and different equilibrium points. Treating them as a single pipelined loop costs more than separating them strictly.

What `find | xargs -P` does, and what breaks at scale

Let me restate the pipeline. find walks the tree depth-first, prints paths to stdout. xargs -P N reads that output, batches paths into chunks, and forks N sha256sum processes in parallel on those chunks. The result is efficient parallelization of the hash — but only the hash.

Three limits become measurable at scale.

1. find is single-threaded. The recursive walk — opendir, readdir, stat, recursion — runs on one core. On 650k files, that walk is no longer negligible: profiling a single-threaded full hash on a corpus that size shows the walk at roughly 63 % of wall time. Parallelizing only the downstream half of a pipeline whose upstream half is two thirds of the cost leaves performance on the table by construction.

2. Each xargs batch is a fork + exec. The new sha256sum process knows nothing: no reusable io_uring ring, no measured-and-cached throughput, no shared VFS cache warmup across batches. On small files it’s acceptable — sha256sum is short and a fork + exec of a small binary costs a few hundred microseconds on Linux. But we still pay several thousand forks for a corpus of several hundred thousand files, and each process relearns what the others already knew.

3. No adaptive dispatch. xargs -P8 on a directory with 3 files will spawn 8 processes for nothing. The fixed cost of fork + exec dominates. Conversely, on a multi-million-file corpus, -P8 can prove insufficient and you need -P16. The right number depends on the workload — no tool in the pipeline computes it.

These three limits have no elegant fix inside the find | xargs pipeline. They require a single tool that does both phases in the same process, but separates them cleanly internally.

Three phases and a strict barrier

The parallel-walk-plus-process architecture boils down to this diagram:

 Phase A (parallel)         Phase B (main)         Phase C (parallel)
 ──────────────────         ──────────────         ────────────────────
 N workers walk the FS      main merges the        N workers process
 via MPMC queue + per-      per-worker vectors     entries via
 worker vectors             → global list          per-worker io_uring
         │                          │                        │
         └── dispatch_and_wait ─────┴── dispatch_and_wait ───┘

The two barriers are deliberate. The natural temptation is to avoid the A↔C barrier by pipelining: as soon as a worker finds a file, it pushes it onto an MPSC queue, a consumer hashes it. Overlap of walk and hash, theoretical gain. Rejected, for three reasons.

First, complexity. A correct lock-free MPSC queue with backpressure, clean termination, and dynamic balancing between producer and consumer workers is a project in itself. The barrier version is written with two primitives: a bounded MPMC queue and an atomic counter. Maintainable by one person.

Second, measurement. On target workloads, T_A_parallel is already smaller than T_C_parallel. Overlapping two phases where one lasts 1.5 s and the other 5 s would gain at most 1.5 s — assuming perfect overlap, which never happens. The measured gain from overlap would be under 10 % in the best case. The barrier costs its real cost: the dispatch_and_wait itself, a few dozen microseconds.

Third, Phase B. The merge between A and C is sequential — a main thread that iterates per-worker slots and copies entries into a global vector. On 650k entries of 48 bytes, that’s a few dozen milliseconds. Less than 1 % of wall. Nothing to gain by parallelizing it.

The rule to keep: the barrier is not a beginner’s compromise, it’s a measured decision. It gets reconsidered only if max(T_A, T_C) dominates the wall and the two phases have comparable durations. Neither holds in the workloads bc-hash and bc-duplicate target.

Phase A: parallel walk with no shared allocation

The most technical phase. Two primitives carry the load, both exposed by bc-concurrency.

A bounded lock-free MPMC queue (Vyukov’s algorithm) shuttles directory paths between workers. Capacity a power of 2, each slot cache-line aligned (64 bytes), because an atomic_uint64_t sequence and the payload not aligned on a cache line pay 3× to 5× on contention via false sharing. Non-blocking push and pop; termination handled by the caller via an atomic counter.

A per-worker slot mechanism — each worker thread gets a private space in which it allocates its accumulators (entry vectors, error collectors) from its own memory context. Never from a shared pool.

That rule is hard. The pool allocators in bc-allocators are not thread-safe. Two workers allocating concurrently from the same pool silently corrupt the structure — no immediate crash, just incoherent pointers discovered three phases later in a sanitizer. Zero-share isn’t a stylistic choice, it’s a correctness invariant.

The corollary is that there’s no shared inode set to prevent cycles either, unlike what a defensive sequential walk would have. On Linux, it’s not necessary: O_NOFOLLOW on every openat prevents following symlinks, and POSIX forbids hardlinks on directories (only . and .. are exceptions, filtered at readdir). A strict O_NOFOLLOW walk on Linux cannot enter a cycle. Adding an inode set would impose a mutex that would violate zero-share, for zero gain.

That leaves termination. The classic trap: the queue can temporarily go empty while a worker is in the middle of adding new subdirectories. A worker that saw queue_pop fail and exited immediately would leave work unprocessed. The correct pattern uses an atomic counter:

atomic int outstanding_directory_count = 0;

// Main, before dispatch:
for (root in roots) {
    atomic_fetch_add(&outstanding, 1);
    queue_push(queue, root);
}

// Worker:
while (true) {
    if (queue_pop(queue, &dir)) {
        process_dir(dir); // during process, for each sub-dir found:
                          //   atomic_fetch_add BEFORE queue_push
        atomic_fetch_sub(&outstanding, 1);
        continue;
    }
    if (atomic_load(&outstanding) == 0) break;
    cpu_relax();
}

The invariant is that outstanding ≥ items_in_queue + items_in_flight. Never below real work. Increment before push, decrement after process: that’s fundamental. Reversing the order opens a window where one worker can observe outstanding == 0 while another has just popped an item and hasn’t yet pushed its subdirectories.

On a reference corpus of 954k files / 16 GB, parallelizing the walk takes full SHA-256 from 4.67 s single-threaded to 2.72 s on 8 cores, warm cache — a 42 % gain — and 20.65 s → 10.33 s cold. The walk itself is no longer a bottleneck.

Phase B: merge without realloc or double-free

Once every worker is done, the main thread iterates slots via bc_concurrency_foreach_slot, reads each worker’s vector (read-only, no new allocation), and copies the entries into a global vector allocated from the main memory context. The original pointers remain valid: worker memory contexts aren’t destroyed until bc_concurrency_destroy() at program end.

One trap to flag. Once entries are copied into the global vector, the temptation is to free the individual worker allocations from main. Don’t. free on a pointer coming from another thread’s pool produces silent corruption. Cleanup happens in bulk at memory-context destruction, not entry by entry.

On the 650k-file corpus, this merge takes under 100 ms. It includes a sort — by descending size in both tools ((size desc, dev, ino) for bc-duplicate, size desc for bc-hash), so that Phase C can round-robin the largest files first and naturally balance workers — and flushing the error collectors to stderr. Phase B is not a bottleneck as long as the global fits comfortably in RAM: beyond a few million entries, a streaming architecture becomes necessary, but that’s a different conversation.

Phase C: processing, per-worker io_uring, adaptive dispatch

Phase C reuses the worker pool and hands it entries to process via bc_concurrency_for — each worker takes an independent batch. For bc-hash, each worker has its own io_uring (32 direct file-descriptor slots × 128 KB buffers; bc-duplicate uses 256 KB per slot for its full-hash pass on larger files). The openat_direct → read → close_direct sequence is chained in the same submit; the CPU only returns to userland for the final digest.

One point to remember if you ever implement this: do not combine IOSQE_CQE_SKIP_SUCCESS on close_direct with reuse of fd slots across batches. The close’s CQE is never awaited in userland, and the next batch can reuse the slot before the previous close has completed kernel-side. Result: sporadic EBADF on the read linked to the new openat, not reproducible on tmpfs but triggered about 5 in 8 runs on a real SSD corpus at scale. Simple fix: drop CQE_SKIP_SUCCESS, await all three CQEs per item. Performance identical if the workload is read-dominated.

At startup, the first run measures the machine’s hardware constants and caches them as text in $XDG_CACHE_HOME/bc-hash/throughput.txt:

xxh3 = 18.5 GB/s   xxh128 = 18.0 GB/s   sha256 = 1.7 GB/s
mem_bw = 34.7 GB/s  parallel_startup = 38 µs  per_file = 3.6 µs

On each invocation, the tool computes two predicted times: t_mono = N × per_file + bytes / throughput, and t_multi = parallel_startup + t_mono / workers. If t_multi ≥ t_mono, the tool stays mono. Typical break-even on Zen 3: around 90 files or 1 MB. Below that, the parallel startup pipeline costs more than it saves. A find | xargs -P dispatch cannot make this decision since it knows neither the machine’s throughput nor the corpus’s total size before having read the entire find output.

The startup cost deserves a nuance. The 38 µs is a warm value — hot, with the thread pool already spawned. Cold, the first dispatch includes spawning 7 pthreads, about 800 µs on x86_64. For a one-shot tool like bc-hash that dispatches once, cold is what counts. Measuring only warm gives a systematically too-aggressive mono/multi decision for small corpora.

The LPT trap: optimizing the wrong criterion

One last lesson, a negative one, and worth mentioning because it’s expensive when learned through measurement rather than theory.

LPT scheduling (Longest Processing Time) is a classic: sort tasks by decreasing duration, assign each task to the least-loaded worker. Graham’s bound (1969): LPT makespan ≤ (4/3 − 1/(3m)) × OPT on m identical machines — in other words, at worst 33 % above optimal, excellent for a polynomial heuristic. Tempting to apply here: sort files by decreasing size, assign each to the bucket with the fewest bytes.

Measured on the same 954k-file / 16 GB corpus: regression of −28 % / −17 % depending on the variant. Not an improvement — a clear regression.

The cause is simple and lies in the cost formulation. LPT balances the criterion you give it: bytes. But on this workload — 67 % of files under 4 KB — the dominant cost is not the byte, it’s the per-file syscall. Measured: about 120 µs of setup openat/read/close vs 5 µs of SHA-NI hash on 100 bytes. Ratio 24:1. LPT-by-bytes optimizes the criterion that represents 5 % of the cost, and leaves unbalanced the criterion that represents 95 %: the number of files per worker.

Round-robin striping (worker i takes files i, i+N, i+2N, ...) gives by construction N/K files per bucket — natural balance of the real bottleneck.

LPT becomes relevant again under a precise condition: a workload with high size variance (a few GB-scale files plus many small ones), and with a weighted cost model cost(f) = per_file + size(f) / throughput instead of cost(f) = size(f). The weighted cost model needs exactly the throughput constants we already measure for the mono/multi decision. It’s coherent. But it must be implemented consciously, not inherited from a paper without measuring the workload first.

What the pattern buys

The numbers that validate the architecture, under reproducible conditions (Ryzen 7 5700G, DDR4-3200, NVMe, boost disabled, performance governor, ASLR disabled, warm cache).

Large corpus — 653,591 files / 19 GB, 10 runs:

Algorithm	bc-hash	`find \| xargs -P16`	Speedup
sha256	7.31 s (σ 0.29)	`sha256sum` 11.97 s (σ 0.02)	1.64×
xxh3	6.77 s (σ 0.29)	`xxhsum -H3` 10.72 s (σ 0.05)	1.58×
xxh128	6.90 s (σ 0.24)	`xxhsum -H128` 10.73 s (σ 0.03)	1.56×

System corpus — 148,370 files / 3.2 GB, 5 runs:

Algorithm	bc-hash	`find \| xargs -P8`	Ratio
sha256	0.82 s	`sha256sum` 0.69 s	0.84×
xxh3	0.49 s	`xxhsum -H3` 0.34 s	0.69×

On the system corpus, xargs wins. That’s not a measurement error — it’s the direct consequence of break-even: at 148k files averaging ~22 KB, io_uring’s fixed cost and the MPMC coordination don’t yet amortize. Mentioning that result is honest — hiding an unfavorable number is the best way to make the other numbers suspect.

The same pattern in bc-duplicate on 767,312 files / 19 GB: 3.70 s, vs 3.93 s for fclones (1.06×) and 41.51 s for jdupes (11×). On a denser corpus — 724,414 files / 13 GB — the gap with fclones grows to 1.41×.

When to use the pattern, and when not

Three signals that justify it:

Non-trivial walk: at least 10k files, or per-file work above 5 µs. Below that, the single-threaded walk doesn’t dominate and the machinery doesn’t amortize.
Per-file parallel-safe processing: hash, parse, bounded I/O, isolated compute. Not for global aggregation that imposes an order.
At least 4 physical cores: below that, the gain-to-complexity ratio collapses.

Counter-indication: a workload where a single file dominates the wall (a 100 GB dump). In that case, the relevant parallelism is intra-file — split into chunks, hash the chunks in parallel, combine. A different pattern, with its own traps, out of scope for bc-hash.

Summary

The pattern holds in three hard rules:

Zero shared allocation between workers. Private memory contexts, post-barrier merge, no cross-thread free at cleanup.
Strict barrier between walk and process. No MPSC overlap as long as the two phases are not measured to be comparable.
Adaptive dispatch based on measured hardware constants. The pipeline is not allowed to presume that parallel is always better.

find | xargs -P implicitly encodes the opposite decisions: downstream-only parallelization, no inter-process memory, fixed dispatch. That’s enough for small to medium workloads, and it’ll always be there as an ad-hoc pipeline tool. But for a tool that needs to hold at scale, the three opposite decisions give a measurable, stable, reproducible gain — and critically, correct behavior on the small corpora where xargs -P oversizes.

The CLI tools are GPLv3: bc-hash, bc-duplicate. Reproduce the numbers: scripts/bench.sh <target> in each repo. The primitives (MPMC queue, per-worker slots, termination counter) live in bc-concurrency under LGPLv3.

Further reading: bc-duplicate: three passes to find duplicates without reading 99 % of the files — the same pattern applied to duplicate detection, with the extra fast-hash prefix-XOR-suffix pass that eliminates almost all candidates before the expensive one.

What find | xargs -P does, and what breaks at scale