Skip to content
rpytest GitHub

← Back to writing

Plugin compatibility: where drop-in claims usually break

Neul Labs · ·
pytestpluginscompatibility

If you’ve shopped around for a faster pytest, you’ve seen the phrase “drop-in compatible” a lot. It is the single most important claim a faster-pytest tool can make, and it is also the easiest one to gloss past in a marketing page. The interesting question is not whether something says it’s drop-in. It’s whether your plugins, your conftest, and your fixtures still behave when you run it.

This post is about where the actual breakage tends to happen, what design decisions we made in rpytest to keep most of it from happening, and what --verify-dropin actually checks for you when it doesn’t.

What “drop-in” should mean

A defensible definition: you change the command (pytestrpytest) and the test session produces the same observable result. That breaks down into:

  1. Same collected node IDs. The set of tests is identical.
  2. Same pass/fail/skip outcomes. No test that passed under pytest fails under rpytest, and vice versa.
  3. Same exit code. Tooling that branches on $? keeps working.
  4. Compatible artefacts. JUnit XML, coverage, screenshots, whatever your suite produces — still produced, still parseable by the same downstream.

What “drop-in” should not mean is byte-identical stdout. Pytest’s terminal output has years of incidental formatting decisions baked in, and matching all of it byte-for-byte without rewriting the renderer is a fool’s errand. If your tooling parses pytest’s stdout — and there are real teams that do — that’s worth knowing before you swap.

Where drop-in claims usually break

In our experience writing and using pytest plugins, breakage almost always shows up in one of five places.

1. Plugins that assume process-per-invocation

A plugin written for pytest can implicitly assume “this Python process exists for exactly one test session.” Such plugins might:

  • Cache state in module-level globals that they never reset between sessions.
  • Open a network connection or file handle in pytest_configure and never close it, trusting process exit to clean up.
  • Re-seed RNG state with time.time() in module init.

In a tool that hosts pytest across multiple sessions in one process, these blow up. The cache leaks. The connection accumulates. The RNG seed is now stable across sessions, which is great for reproducibility and terrible for whatever bug was hiding behind random ordering.

2. Plugins that monkey-patch import machinery

A small set of plugins reach into sys.meta_path, sys.path, or importlib to do something clever at collection time. Coverage tools sometimes do this. Hot-reload plugins sometimes do this. Anything that wants to instrument imports happens before pytest’s own import is finished, which means the plugin’s assumptions about when its hooks run are tightly bound to “fresh interpreter, fresh import state.”

When a daemon holds the interpreter across sessions, the plugin’s invariants can no longer hold the second time around.

3. Plugins with their own concurrency models

pytest-xdist is the obvious one: it spawns worker processes and orchestrates them. Distributing a session under xdist while also running inside a long-lived host requires either rpytest’s own parallel implementation (which is what we did) or careful negotiation about who owns process startup. Anything that does cross-process IPC for its own reasons sits in the same category.

4. Plugins that depend on argv shape

Some plugins parse sys.argv themselves rather than going through pytest’s option registration. When the CLI is Rust and the argv visible inside the daemon is reconstructed, the shape can differ in small ways: leading program name, -- separators, options pytest didn’t have to forward. Plugins that read raw argv hit a different shape.

5. Plugins that fork

A small number of plugins fork the test process themselves. Forking out of a daemon is a bad idea for unrelated reasons (file descriptors, threads), and these plugins are best treated as incompatible until rewritten.

What we did in rpytest

The design choice that prevents most of this: host real pytest, don’t reimplement it.

rpytest is two processes: a Rust CLI and a Python daemon. The daemon is a real Python interpreter that imports pytest the normal way and runs sessions inside the same in-process pytest each time. Plugins load via the standard entry-point mechanism. conftest.py files are discovered by pytest. Fixtures are resolved by pytest. We don’t intercept any of that, because intercepting is exactly how a drop-in claim usually breaks.

What we do intercept is the dispatch boundary. The CLI parses flags, applies selectors, decides what to run, and sends an RPC to the daemon. The daemon translates that RPC into the pytest internal API and runs a session. Results stream back over the same channel.

We’ve also made specific choices about session cleanup. Module-level state in a plugin can be a real problem if the plugin author assumes process exit. So between sessions, the daemon does an explicit teardown of the session-scoped state pytest knows about (sessions, items, fixtures), and it logs a warning if it sees plugin-side global state accumulating. We don’t claim to fix buggy plugins, but we try not to silently let them rot.

The parallel surface (-n N, -n auto) is implemented natively in the rpytest CLI rather than going through xdist. That choice is what lets us beat pytest -n 4 (xdist) on the BENCHMARK.md suite by ~3.5x: we don’t pay xdist’s per-worker startup. If you depend on xdist-specific behavior (--dist loadfile, loadgroup), the comparison page is honest about it: that’s a case where xdist still wins.

What --verify-dropin checks

The verifier exists because we can’t be everywhere your plugins are. Run:

rpytest --verify-dropin

and it does, in this order:

  1. Runs pytest --collect-only -q and captures the set of collected node IDs.
  2. Runs rpytest --collect-only -q and captures the same.
  3. Diffs the two sets. Any missing or extra IDs are flagged with the file they came from.
  4. Runs pytest to completion and captures pass/fail/skip per node ID and the final exit code.
  5. Runs rpytest the same way.
  6. Diffs outcomes per node ID. Reports any mismatches.
  7. Compares exit codes.

The output is a report that says “your suite is verified drop-in” or “here are the N tests that disagree, and here is whether collection or execution caused the disagreement.” That second answer is what tells you whether you have a plugin issue (usually shows up at collection) or a behavioral issue (usually shows up at execution).

We chose to make this part of rpytest itself, not a separate tool, because the question “is this safe to swap” should be one command, not a research project. The verifier is also what we use during development on real-world suites — pull requests that regress drop-in behavior fail this check.

What to do before you switch

The pragmatic path is short:

  1. pip install rpytest.
  2. Run rpytest --verify-dropin against the suite you care about. Don’t start with a CI swap; start with a developer machine.
  3. If the report is clean, swap on a feature branch first. Watch one CI run.
  4. If the report flags anything, look at whether it’s a known-incompatible category (process-per-invocation, monkey-patched imports, xdist-specific). The categories in this post will cover most of what you find.

A drop-in claim should be falsifiable. The verifier is how we make ours.