Measuring CI minutes saved, honestly

Neul Labs · June 4, 2026 ·

benchmarkscimethodology

Almost every faster-pytest tool advertises a speed multiplier. The multiplier on rpytest’s README is 1.9x wall clock for one specific 500-test synthetic suite on one specific machine. It is intentionally smaller and more boring than what we could have written. This post is about why we picked that framing, what the common mistakes are when measuring CI savings, and how to run the measurement against your own suite without lying to yourself.

The trap with “we are Nx faster”

If you run pytest ten times in a row and average the result, you will get a number. If you publish that number, it is technically accurate. It is also useless to almost everyone reading it, because:

Your suite is different from the one that was benchmarked.
Your hardware is different.
Your Python version, plugin set, and conftest are different.
Your invocation pattern (one big run vs many small re-runs) is different.

So the published multiplier is best read as “this is the order of magnitude” rather than “this is what you’ll see.” rpytest’s README says “1.9x wall clock” for that synthetic suite, and the comparison page on this site repeats the BENCHMARK.md numbers explicitly so you can see the methodology. The honest claim is structural: the daemon model removes a class of per-invocation overhead. The numeric size of that gain depends on you.

The four mistakes I’ve watched teams make

These are the methodology mistakes that turn a real gain into an unreproducible one.

1. Comparing first run to subsequent run

The daemon model has a cold first run (the daemon has to start) and warm subsequent runs. If you compare pytest (which is always “cold”) to rpytest (which after the first invocation is warm), the warm-vs-cold comparison flatters us by exactly the amount we don’t want it to.

The fix: either compare cold-to-cold (rpytest --daemon-stop && rpytest) or warm-to-warm (which means giving pytest the same opportunity, which doesn’t exist — pytest is always cold). The defensible move is to publish both numbers and let the reader pick the one that matches their workflow. That’s what BENCHMARK.md does: “wall clock with startup” is the cold number; “execution time” is closer to warm.

2. Measuring a single invocation when the use case is many

CI is not one pytest invocation. It is the test step plus retries plus the last-failed loop plus the sharded matrix. If you measure one invocation, you measure the worst case for any daemon-based tool, because you’ve paid the daemon’s startup cost without getting any of the amortization back.

The fix: measure the workflow you actually have. If your CI runs pytest, then on failure pytest --lf, then on failure again pytest tests/specific_module.py, measure the sum of those three. That’s the workflow rpytest is built for. A single-invocation benchmark is the wrong unit.

3. Forgetting that wall clock includes the runner

CI minutes are billed in wall clock. A 30-second cold start of a node_modules-style daemon is fine on a dev laptop and bankruptcy in CI billing. We watch this carefully: rpytest’s daemon cold-start on the BENCHMARK.md hardware is fast because the daemon doesn’t pre-import the world — it imports your suite once and gets out of the way. But the only way to be confident this is true for your CI is to put the cold-start in the measurement.

Practical version: measure the runner from time bash -c 'rpytest ...', not from rpytest’s internal timing. The internal timing is the right number for “how long did the work take.” The wrapping wall clock is the right number for “how much CI did this consume.”

4. Excluding variance

Three runs is not enough to publish a delta with two significant figures. The BENCHMARK.md table is honest about this: the per-configuration timings vary by 10-30% across three runs. If your reported delta is smaller than your measurement noise, you don’t have a delta.

The fix: report a median over at least five runs, and quote the range. If the ranges overlap, the headline number isn’t real.

A methodology you can use

Here is the methodology we used for the BENCHMARK.md numbers, lightly generalized so you can apply it to your own suite.

Pick a real suite, not a synthetic one. A synthetic suite where every test is assert True is the best case for any test runner. Pick something representative — your real suite, or a real subset of it.
Pin the environment. Lock Python version, pytest version, plugin versions, and Python interpreter (interpreted vs PyPy etc.). Note the hardware. Note whether you are on macOS, Linux, or in Docker.
Measure cold-start separately. Run with the daemon explicitly stopped: rpytest --daemon-stop && time rpytest. This is the worst case for the daemon model and the most defensible comparison to a fresh pytest invocation.
Measure the hot loop. Repeat the same invocation five times back-to-back. Drop the first run (cold) and take the median of the remaining four. This is what TDD or --lf would actually feel like.
Measure the parallel case. Run rpytest -n auto and pytest -n auto (with xdist installed). Note that on small suites, xdist often underperforms sequential pytest because of worker startup; this is the BENCHMARK.md finding too.
Report ranges, not just medians. Five runs gives you min/median/max. Publish all three. If min and max are far apart, your suite is noisy and your delta is suspect.
Convert to CI cost, not just seconds. Most CI providers bill in minute increments. A test step that drops from 0.63 s to 0.32 s on a developer machine saves zero CI minutes if CI is billed per minute. The honest delta is “wall clock saved per build × build count per day × CI rate.” If that number is small, the gain is real but not financial.

What the BENCHMARK.md numbers actually say

The published rpytest numbers, restated honestly:

On a 500-test synthetic suite, on a Ryzen 7 5700U running Linux with Python 3.12.3 and pytest 9.0.2, wall clock including startup drops from 0.63 s (pytest) to 0.32 s (rpytest with warm daemon).
The execution-only number is closer: 0.30 s for pytest, 0.25 s for rpytest. The execution itself isn’t the gain. The interpreter and collection startup is.
The parallel comparison is the largest delta: pytest -n 4 (xdist) takes 0.87 s, rpytest -n 4 takes 0.25 s. This delta is xdist’s per-worker startup. On a much bigger suite, that startup amortizes and the gap closes.
CLI memory: 39.4 MB for pytest, 5.9 MB for rpytest’s Rust CLI. The daemon adds ~80 MB shared. If you care about RAM more than CPU, this is the more interesting line of the table.

The numbers are real, the methodology is documented, and the structural argument they support is what we’d lead with: the daemon model removes per-invocation overhead. The headline multiplier is incidental.

What to do next

If you want a number for your suite, do the methodology above on your repo. The seven-step list takes about an hour. You’ll come out the other side with a number that’s worth defending to your team, regardless of whether you ship the swap.

We’re not going to publish your number on this site. But we’d rather you measure your own than read ours.