bc-duplicate: three passes to find duplicates without reading 99 % of the files

1,069,096 files. 19 GB of data. 5 runs on a Ryzen 7 5700G, warm cache. jdupes takes 41 seconds. fclones takes 4 seconds. bc-duplicate finishes in 3.7 seconds.

The point isn’t the absolute number — it’s what we avoid reading. On a corpus this size, reading every file means 19 GB of I/O. A modern NVMe SSD peaks at ~3–5 GB/s on sequential reads. That’s a hard ceiling. The only way to go faster is to read less.

That’s what bc-duplicate does, via a three-pass pipeline that eliminates candidates at every step before reaching the most expensive one.

Pass 1: group by size

Two files of different sizes cannot be identical. That’s obvious. But it’s the most profitable pass in the pipeline: it reads no content at all — only the metadata returned by statx.

bc-duplicate walks the tree in parallel via a bounded, lock-free MPMC (multi-producer, multi-consumer) queue: each worker pops a directory from the queue, reads its content with getdents64, pushes the subdirectories back onto the queue, and accumulates file entries in a local vector. No shared allocation between threads during the walk — the shortest path to core saturation.

At the end of the walk, the local vectors are merged and sorted by (size desc, dev, ino). Hardlinks (files sharing a (dev, ino) pair) are collapsed at this stage — a single physical file, no need to re-hash it just to convince yourself it equals itself.

Result: every file with a unique size is eliminated in a single pass, without reading any content.

Pass 2: fast-hash on prefix XOR suffix

Among files of the same size, most differ in the first few bytes — EXIF headers for images, PDF signatures, timestamps, etc. Reading 4 KB is almost always enough to tell them apart. But not in every case: some formats have identical prefixes and diverge near the end (truncated tar archives, rotated logs, sequential dumps). To cover both, bc-duplicate combines a 4 KB pread at the start and, if the file is at least 8 KB, a 4 KB pread at the end. Both blocks are hashed with xxh3, then XORed into a 64-bit integer.

Why xxh3? Because it does roughly 18 GB/s on Zen 3 — an order of magnitude faster than SHA-256, for what is a non-critical filter here (no cryptographic guarantee is needed at this step; collisions will be resolved in the next pass).

This pass is parallelized via the bc_concurrency_for primitive on the worker pool. Work is distributed per pread, each worker handling a batch of files independently.

Pass 3: full-hash on the survivors

Files that passed the first two passes — same size, same fast-hash — are very likely real duplicates. But “very likely” isn’t enough: we then compute a full hash over the entire content.

The default hash is xxh3 (or xxh128 with --algorithm=xxh128). For cases that need a cryptographic guarantee — ruling out any theoretical collision — --algorithm=sha256 is opt-in.

This is where io_uring comes in. Each worker has its own io_uring queue with 32 direct file-descriptor slots and 256 KB buffers. The flow is chained: openat_direct → read → close_direct, without a user/kernel transition between steps. For files larger than one slot, a streaming synchronous-read fallback takes over without breaking the pipeline.

Why io_uring over mmap? Because mmap shines when you access pages multiple times, not for one-shot streaming. On files read only once to compute a hash, batched io_uring beats mmap + classic read. Measurements at 1M files back this up.

Adaptive dispatch: knowing when not to parallelize

Parallelization has a startup cost — on Zen 3, about 38 µs to wake up a worker pool. On a small corpus, that cost dwarfs the gain. bc-duplicate solves this with an adaptive dispatch: on the first run, the tool measures the machine’s hardware constants (xxh3 throughput, memory throughput, parallel startup cost, warm per-file cost) and writes them to $XDG_CACHE_HOME/bc-duplicate/throughput.txt. On subsequent runs, the cache is reused without remeasuring.

On every run, the tool computes the predicted mono-thread time and compares it to the predicted multi-thread time (including startup cost). If multi isn’t a net gain, the tool stays mono. It’s invisible to the user, and it prevents small directories from paying the parallelism tax.

The hash choice matters less than you’d think

The pipeline above is what bc-duplicate does. fclones does something similar in spirit, but with a different default hash (metrohash128) and a completely independent implementation (Rust + Rayon).

To isolate the role of the hash, I measured fclones on the same corpus (/var/benchmarks, 19 GB / 1.07M files) with each of its hash options:

Tool / hash	Wall time	CPU user	Ratio vs bc-duplicate
bc-duplicate xxh3	3.779 s	2.01 s	—
fclones metro	3.898 s	18.28 s	1.03×
fclones xxhash	3.981 s	18.54 s	1.05×
fclones blake3	4.109 s	23.53 s	1.09×
fclones default	4.124 s	18.75 s	1.09×

Two observations.

First: changing the hash changes little. Between the fastest (metro) and slowest (blake3), there’s a 5 % gap. The bottleneck isn’t compute, it’s I/O and coordination. BLAKE3 is often pitched as fast thanks to intra-file parallelization (Merkle tree) — which is true on a single large file. On a corpus of a million files already parallelized across workers, that internal parallelism brings nothing and even doubles CPU user time (23.5 s vs 18.5 s) without reducing wall time.

Second: at equal algorithm (xxh3 vs xxhash), bc-duplicate is 1.05× faster than fclones. It’s not the hash that makes the difference — it’s the architecture: io_uring direct-fd, the three-pass pipeline on size then fast-hash, and the MPMC walk via local vectors.

Correctness verified

Benchmarks are worthless if the results are wrong. Two checks are in place:

SHA-256 spot-check on a sample of groups: for each detected duplicate group, every path is re-hashed with standard sha256sum and must return the same digest. On the five groups sampled in the latest /usr/share run, five OK, zero divergence.

3-way correctness via scripts/correctness-3way.sh: bc-duplicate, fclones, and jdupes run on the same corpus, their outputs are normalized into TSV and compared line by line. To make this work, you must force all three tools to match the same scope — bc-duplicate scan --hidden, fclones group --hidden --no-ignore, jdupes -r — otherwise fclones applies its .gitignore filter and reports a subset of the groups. With the scope aligned, the three tools report exactly the same groups on the three tested corpora (7k / 18k / 724k files): diff 0.

Summary

Corpus	Size	Files	bc-duplicate xxh3	fclones	jdupes
`/usr/share`	3.2 GB	148,373	0.428 s	0.485 s	1.900 s
`/var/benchmarks`	19 GB	1,069,096	3.779 s	4.124 s	41.383 s

On a typical system corpus, the gain vs fclones is marginal (~10 %) and vs jdupes is a factor of 4. On a larger corpus, lock-free coordination and io_uring take over: 11× faster than jdupes, still ~10 % ahead of fclones.

The code is GPLv3-licensed: bc-duplicate. Reproduce with: bc-build bc-duplicate release && scripts/bench.sh <target>.