bcduplicate on 16 SMT threads: 758k files/s, 12.4 GB/s. rmlint on the same hardware, same thread count: 42k files/s, 0.7 GB/s. Factor of 18×.
That’s the headline, but the more interesting question is what accounts for it — because the answer is architectural, not algorithmic. The same thread count, the same kernel, the same SSD. The gap comes from decisions made before a single file is opened.
What bit-crafts is
bit-crafts is a C11 monorepo containing six libraries and three CLI tools for fast file inspection on Linux. Everything targets x86-64-v3 baseline (AVX2 + BMI2 + FMA + LZCNT + MOVBE) plus SHA-NI for hashing, and builds with Meson against a single dependency tree. No external runtime, no vendored C++ framework.
The three tools:
bchash— recursive directory hashing,sha256sum-compatible output. Supports SHA-1/256/512, MD5, BLAKE2b, xxh64, xxh3-128. At 16 SMT threads: 6.9 GB/s, 421k files/s on SHA-256.bcduplicate— duplicate finder via a size → fast-hash → full-hash funnel. 758k files/s at peak. Outputsjdupes-compatible groups, JSON, or actionable shell scripts.bcintegrity— directory tree manifests in JSONL (path, digest, size, mode, ownership, mtime, inode, link count). Verify and diff two manifests for change detection.
The six libraries underneath:
bc-core— CPU primitives: hardware hashing (SHA-NI / AVX2), SIMD memory ops, checked arithmetic.bc-allocators— pool, bump-pointer arena, slab, and context allocators. Nomallocin the hot path.bc-containers— vector, hashmap, set, ring buffer, bitset.bc-concurrency— worker pool, lock-free MPMC queue (Vyukov’s algorithm), per-worker slots, signal-safe shutdown.bc-io— filesystem helpers, mmap, buffered streams,io_uringwrappers.bc-runtime— application lifecycle, layered config, structured logging, metrics, CLI framework.
Each library ships its own LICENSE file so it can be redistributed independently of the others.
The monorepo structure means the tools and libraries share a single Meson build graph. scripts/bx wraps it into named variants — debug, release, asan, tsan, ubsan, bench, coverage — and a matrix command iterates over a default set. Every commit runs the full test suite (~80 cases via cmocka) plus ASan, TSan, and UBSan in CI.
The numbers
Benchmark machine: AMD Ryzen 7 5700G (8 physical / 16 SMT cores), Ubuntu 24.04, performance governor, ASLR disabled, page cache warm. Dataset: Linux 6.12 kernel + Node.js v22 sources — 2.2 GB, 113,825 files. Method: hyperfine, 2 warmup + 3 timing runs.
SHA-256, single thread:
| Tool | Wall | MB/s | files/s |
|---|---|---|---|
| bchash –threads=mono | 2.17 s | 860 | 52,453 |
| openssl dgst -sha256 | 2.75 s | 679 | 41,390 |
| sha256sum | 7.01 s | 266 | 16,237 |
| hashdeep -j 1 | 10.08 s | 185 | 11,292 |
SHA-256, 16 SMT threads:
| Tool | Wall | MB/s | files/s |
|---|---|---|---|
| bchash –threads=io | 0.27 s | 6,916 | 421,574 |
| sha256sum (P16) | 1.42 s | 1,315 | 80,158 |
| hashdeep -j 16 | 1.56 s | 1,197 | 72,964 |
Dedup, 16 SMT threads:
| Tool | Wall | MB/s | files/s |
|---|---|---|---|
| bcduplicate (io) | 0.15 s | 12,449 | 758,833 |
| czkawka_cli 10.0.0 (-m 1, T=16) | 0.33 s | — | 345,973 |
| jdupes -r | 1.49 s | 1,253 | 76,392 |
| rmlint (auto MT) | 2.67 s | 699 | 42,631 |
czkawka_cli is the modern Rust competitor in this space and is a stronger baseline than jdupes or rmlint. The -m 1 flag forces it to scan every file (its default is -m 8192, which skips files smaller than 8 KiB and finds roughly 30 % of the duplicates bcduplicate reports). On equal terms — same machine, same thread count, same minimum file size — bcduplicate finishes 2.11× faster and finds more duplicate files (6,169 vs 4,804 on the same corpus). Hyperfine, 2 warmup + 3 timing runs, warm cache.
Integrity manifest, 16 SMT threads:
| Tool | Wall | MB/s | files/s |
|---|---|---|---|
| bcintegrity (io) | 0.45 s | 4,150 | 252,944 |
| hashdeep -j 16 -l | 1.57 s | 1,189 | 72,500 |
| mtree -c -K sha256digest | 7.86 s | 238 | 14,481 |
One note on the IPC numbers: bchash runs at IPC ≈ 0.7 while the slower hashdeep reaches IPC ≈ 3.0. That’s not a pathology — SHA-NI retires 32 bytes/cycle but the SHA256RNDS2 instructions have ~6 cycle latency, so per-instruction throughput appears low. Net wall time is 4-5× faster. Higher IPC does not mean faster wall time.
Why it’s faster
Four decisions compound:
Parallel walk + strict barrier + parallel process. find | xargs -P parallelizes only the downstream half of the pipeline. On a 650k-file corpus, profiling shows the walk alone at ~63% of single-thread wall time. bit-crafts splits the work into three explicit phases: N workers walk via a bounded lock-free MPMC queue (Phase A), the main thread merges and sorts per-worker vectors (Phase B, a few dozen ms), then N workers process via per-worker io_uring (Phase C). The barrier between A and C is deliberate — pipelining walk and hash sounds appealing until you measure that on target workloads, T_A_parallel is already smaller than T_C_parallel, so overlap buys less than 10% in the best case and adds significant complexity. A deeper treatment of this decision is in parallel-walk-plus-process.
io_uring batch reads. Each worker owns its own io_uring queue with 32 direct file-descriptor slots. The openat_direct → read → close_direct sequence is chained in a single submit via IOSQE_IO_LINK + IOSQE_FIXED_FILE — no user/kernel transition between steps, no fd table contention with the rest of the process.
Hardware-accelerated hashing. SHA-NI delivers SHA-256 at 860 MB/s on a single thread without AVX-level software pipelining. xxh3-128 at 16 threads reaches 10.4 GB/s. The dedup pipeline uses xxh3 as a fast-hash filter (18 GB/s on Zen 3) for the prefix+suffix pass — 4 KB pread at the start XORed with 4 KB at the end — then falls back to a full hash only on the small fraction of files that survive. See bc-duplicate: three passes for the full filter analysis.
Per-thread arenas, zero shared allocation in hot paths. The bc-allocators pool and arena allocators are not thread-safe by design: two workers sharing a pool would silently corrupt it. Instead, each worker has a private memory context allocated from its own per-worker slot. No mutex on the allocation path, no false sharing. The merge step copies entries from per-worker vectors into a global vector post-barrier — read-only from workers’ side, no cross-thread free. This is the correctness invariant that makes lock-free coordination tractable. More on the allocator design in The 4 allocators every C developer should know.
Adaptive dispatch based on measured hardware constants. On the first run, each tool measures xxh3 throughput, memory bandwidth, parallel startup cost, and warm per-file cost, then writes them to $XDG_CACHE_HOME/bc-<tool>/throughput.txt. On every subsequent invocation, the tool computes t_mono = N × per_file + bytes / throughput vs t_multi = parallel_startup + t_mono / workers. If multi isn’t a net gain, the tool stays single-threaded. The break-even on Zen 3 is roughly 90 files or 1 MB — below that, the parallel startup overhead dominates. find | xargs -P cannot make this decision because it doesn’t know the total corpus size until it has already read all of find’s output. The cold startup cost (spawning 7 pthreads for the first time, ~800 µs on x86_64) is measured separately from the warm cost (~38 µs) — using only warm as the dispatch criterion would systematically choose multi-thread for small corpora where the first dispatch is the expensive one.
What’s in the box
Three tools with a shared interface: --threads=mono|compute|io, --memory-budget=N, --describe JSON for shell completion, standard exit codes.
bchash hash --type=sha256 /path/to/dir > manifest.sha256
bchash check manifest.sha256
bchash diff old.sha256 new.sha256
bcduplicate scan /path/to/dir
bcduplicate prune --action=hardlink /path/to/dir
bcintegrity manifest --output=tree.jsonl /path/to/dir
bcintegrity verify /path/to/dir tree.jsonl
bcintegrity diff old.jsonl new.jsonl
Correctness is verified, not assumed. scripts/bench.sh correctness cross-validates the bc-tools against reference utilities on /usr/include: 22/22 checks pass — every digest, group, and manifest matches bit-for-bit against sha256sum, sha1sum, md5sum, b2sum, b3sum, jdupes, fdupes, rdfind, dupd, and hashdeep. Where divergences exist with hashdeep -l (on symlink-heavy trees like the kernel source), they reflect an intentional design difference: bcintegrity opens with O_NOFOLLOW by default.
Six libraries as independent Meson subprojects, each distributable separately. The concurrency primitives — bounded MPMC queue, per-worker slot mechanism, atomic-counter termination — are in bc-concurrency. The io_uring helpers are in bc-io. All of bc-allocators is documented in the allocators post; the underlying syscall mechanics behind why custom allocators beat malloc at scale are in brk, sbrk, mmap: what malloc hides.
What it isn’t
The README says this plainly and the article should too.
- Not audited for production use. Bugs exist.
- Not a project with an SLA, a stability guarantee, or a security response process.
- Not C written by someone with 20 years of production C experience. The line-by-line code is AI-generated (Claude Code). I owned architecture, API shape, and every design decision; I reviewed every commit — but a seasoned C engineer reviewing this codebase would find things to push back on.
- Not optimized for portability. The baseline is x86-64-v3, Linux, kernel 6.x with
io_uringsupport. The SHA-NI path requires Zen 2+ or equivalent. - Not a replacement for
hashdeeporrmlintin environments where stability and a track record matter more than throughput.
This started as personal R&D — a way to revisit modern Linux systems programming with an AI pair-programmer. It runs on my own machine every day. That’s the correct framing.
Reproducing the numbers
# Install dependencies
scripts/install-deps.sh build # compiler, meson, cmocka
scripts/install-deps.sh bench # comparator binaries: fdupes, jdupes, rmlint, ...
scripts/install-deps.sh perf # hyperfine, perf, sysstat
# Build release
scripts/bx build release
# Quiet the kernel
sudo scripts/bench.sh perf-mode apply
# Run comparisons
scripts/bench.sh datasets fetch # pull the kernel + Node.js source dataset
scripts/bench.sh compare all # bc-tools vs sha256sum, fdupes, rmlint, ...
scripts/bench.sh correctness # cross-validate digests and groups
sudo scripts/bench.sh perf-mode restore
Every run is logged under benchmarks/<subcommand>-YYYY-MM-DDTHH-MM-SS.txt. The project keeps a historical performance record so regressions don’t go unnoticed between commits.
The correctness dataset is /usr/include — 117 MB, 6,594 files — stable on any Debian/Ubuntu host without fetching anything. CI runs correctness checks on every push.
Where to find it
Source, issues, and PRs: github.com/Unmanaged-Bytes/bit-crafts. Tools under GPLv3, libraries under LGPLv3. Each library and tool ships its own LICENSE for independent redistribution.