cheat sheet

Testing Strategies

A practical guide to test design — the pyramid (unit/integration/e2e), fixture and mock patterns, property-based testing, snapshot tests, and CI strategies that scale.

updated 05-25-2026

Testing Strategies — Pyramid, Fixtures, Mocks, Property-Based, and CI Patterns

What it is

A testing strategy is the set of decisions that answers: at what level do we test each thing, how do we set up the world for each test, where do we use real dependencies versus fakes, what counts as "good enough" coverage, and which tests run on every push versus once a night. Good strategies make defects cheap to find (the test fails before the code ships), tests fast enough to run constantly (under 30 seconds for the unit layer), and refactors safe (a green suite means behaviour is intact, not just implementation).

The single most useful mental model is the test pyramid: many fast unit tests at the bottom, fewer integration tests in the middle, a small number of end-to-end (e2e) tests at the top. The inversion of this — a top-heavy pyramid full of slow brittle e2e tests with almost no unit tests — is the "ice cream cone" anti-pattern that haunts most large legacy codebases. The pyramid is not a religious doctrine; the shapes that work in practice vary (integration tests up to ~50% in tightly integrated services). The principle is constant: push tests as low as they can go and still catch the bug.

The pyramid in one picture

text

       /\         e2e (real browser, real DB)   — slow, brittle, few
      /  \        ~5%
     /----\
    /      \      integration (multiple modules, real DB or fakes)
   /        \     ~15-25%
  /----------\
 /            \   unit (one module, no I/O)
/______________\  ~70-80%
                  fast, isolated, many

Three layers, three goals. Unit tests answer did the function compute the right value; integration tests answer do these pieces talk to each other; e2e tests answer does the whole system, including network and UI, satisfy the user's flow.

Layer	What it tests	Speed	Use real I/O?
Unit	One function/class in isolation	< 1 ms	No
Integration	Several modules + maybe a real DB	10-100 ms	Sometimes
Contract	One service's API against a spec	10-100 ms	Stub the other side
e2e (UI / API)	Full system, real network	1-10 s	Yes
Smoke	Critical-path verification on prod	1-10 s	Yes (against prod)
Property-based	Many random inputs satisfying invariants	varies	No

Unit tests

A unit test exercises a single function or class with no I/O, no real network, no real database — anything that touches the outside world is replaced with a fake. The goal is to verify behaviour (inputs → outputs, side effects on collaborators) quickly enough that you run the whole suite on every save without breaking flow.

Three rules:

Arrange-Act-Assert. Three blocks, clearly separated. One assertion (or one logical group of assertions) per test.
Test names describe behaviour. test_returns_zero_for_empty_cart, not test_total_1.
Test through the public API. If you find yourself peeking at private state, the design is probably the problem.

python

# src/cart.py
from dataclasses import dataclass

@dataclass
class Item:
    price: float
    qty: int

def cart_total(items: list[Item]) -> float:
    return sum(i.price * i.qty for i in items)

python

# tests/test_cart.py
import pytest
from src.cart import Item, cart_total

def test_returns_zero_for_empty_cart():
    assert cart_total([]) == 0

def test_sums_a_single_line_item():
    assert cart_total([Item(price=10.00, qty=3)]) == 30.00

def test_sums_multiple_line_items():
    items = [Item(price=10, qty=2), Item(price=5, qty=4)]
    assert cart_total(items) == 40

@pytest.mark.parametrize("price,qty,expected", [
    (10.00, 1, 10.00),
    (10.00, 0, 0.00),
    (0.99,  3, 2.97),
])
def test_line_item_pricing(price, qty, expected):
    assert cart_total([Item(price=price, qty=qty)]) == pytest.approx(expected)

bash

pytest tests/test_cart.py -v

Output:

text

tests/test_cart.py::test_returns_zero_for_empty_cart PASSED
tests/test_cart.py::test_sums_a_single_line_item PASSED
tests/test_cart.py::test_sums_multiple_line_items PASSED
tests/test_cart.py::test_line_item_pricing[10.0-1-10.0] PASSED
tests/test_cart.py::test_line_item_pricing[10.0-0-0.0] PASSED
tests/test_cart.py::test_line_item_pricing[0.99-3-2.97] PASSED
6 passed in 0.03s

One-shot tests for one-shot bugs are fine. But when you've written the third test that varies only the input, reach for parametrize (Python) / test.each (Vitest) / table-driven tests (Go). One test reading three lines of data beats three near-identical tests.

Integration tests

An integration test exercises more than one unit, plus the wiring between them. The typical shape is "a real database, a real cache, real HTTP between two services, fakes for anything else". Use them where wiring is non-trivial (ORMs, message queues, auth middleware) — places where unit tests can pass while the system is broken.

python

# tests/test_user_repository.py — uses a real Postgres via testcontainers
import pytest
from testcontainers.postgres import PostgresContainer
from src.repo import UserRepository, User

@pytest.fixture(scope="session")
def db_url():
    with PostgresContainer("postgres:16-alpine") as pg:
        yield pg.get_connection_url()

@pytest.fixture
def repo(db_url):
    repo = UserRepository(db_url)
    repo.create_schema()
    yield repo
    repo.truncate_all()

def test_save_and_load_roundtrip(repo):
    user = User(id=1, email="alice@example.com", name="Alice Dev")
    repo.save(user)
    loaded = repo.find_by_id(1)
    assert loaded == user

def test_returns_none_for_missing_user(repo):
    assert repo.find_by_id(404) is None

bash

pytest tests/test_user_repository.py -v

Output:

text

tests/test_user_repository.py::test_save_and_load_roundtrip PASSED
tests/test_user_repository.py::test_returns_none_for_missing_user PASSED
2 passed in 1.42s

The fixture spins up a real Postgres container once per test session and truncates between tests. This is integration-grade isolation: the SQL, schema, and driver are all exercised, but the test still finishes in seconds.

The integration layer is where dialect-specific bugs hide. Unit tests against a stubbed repository pass with any query; integration tests against the real database catch the join you forgot.

End-to-end (e2e) tests

An e2e test exercises the whole system, from the user's edge (browser, CLI, API client) through every layer down to the persistence and back. They are the slowest and most brittle — a network hiccup or a slow render can fail a test that finds no bug — so the rule is few, focused, and on the critical paths.

typescript

// tests/e2e/checkout.spec.ts — Playwright
import { test, expect } from "@playwright/test";

test("user can checkout a single item", async ({ page }) => {
  await page.goto("/products/widget-1");
  await page.getByRole("button", { name: "Add to cart" }).click();
  await page.getByRole("link", { name: /cart/i }).click();

  await expect(page.getByText("Widget 1")).toBeVisible();

  await page.getByRole("button", { name: "Checkout" }).click();
  await page.getByLabel("Email").fill("alice@example.com");
  await page.getByLabel("Card").fill("4242 4242 4242 4242");
  await page.getByRole("button", { name: "Pay" }).click();

  await expect(page).toHaveURL(/\/order\/[0-9]+$/);
  await expect(page.getByRole("heading", { name: /thank you/i })).toBeVisible();
});

bash

npx playwright test tests/e2e/checkout.spec.ts

Output:

text

Running 1 test using 1 worker
  ✓ tests/e2e/checkout.spec.ts:3:1 › user can checkout a single item (4.5s)

  1 passed (5s)

A good e2e set covers ~5-10 critical user journeys (sign up, log in, checkout, password reset) and runs on every deploy — not on every push. The full suite stays in pre-deploy CI or nightly.

The most common e2e failure mode is flakiness. Network jitter, async UI updates, and CI noise produce false positives that erode trust in the suite. Quarantine flaky tests aggressively — disable them, file a bug, fix root cause. A flaky test that "usually passes" is worse than no test at all.

Test doubles — mocks, stubs, fakes, spies

A test double is any object that stands in for a real collaborator. The taxonomy (Gerard Meszaros, xUnit Test Patterns) is precise — and most code confuses them, with concrete consequences for fragility.

Double	What it does	Use when
Dummy	Object that is passed but never used	Filling out parameter lists
Stub	Returns canned answers	You need a specific response from a collaborator
Fake	A working but simplified implementation (in-memory DB)	You need the behaviour without the cost
Spy	A stub that records how it was called	You want to assert on the interaction
Mock	A spy that fails the test if it isn't called as expected	You're verifying a protocol — the order and arguments of calls matter

Stub vs mock — the practical difference

python

# tests/test_password_reset.py — Python

# STUB: pre-canned data — we don't care if it's called or not
class StubMailer:
    def send(self, to, subject, body): pass

# MOCK: assertion-grade — we care HOW it's called
from unittest.mock import Mock

def test_password_reset_sends_email_with_token():
    mailer = Mock()
    repo = StubUserRepo({"alice@example.com": User(id=1, email="alice@example.com")})
    svc = PasswordResetService(repo, mailer, token_generator=lambda: "TOKEN123")

    svc.request_reset("alice@example.com")

    mailer.send.assert_called_once_with(
        to="alice@example.com",
        subject="Reset your password",
        body="Use code TOKEN123 to reset your password.",
    )

bash

pytest tests/test_password_reset.py

Output:

text

tests/test_password_reset.py::test_password_reset_sends_email_with_token PASSED
1 passed in 0.04s

The mock is what makes this test express "the service must call the mailer with the right token". A stub would not enforce it.

When to prefer fakes over mocks

A Mock couples the test to the implementation — exactly how the service talks to its collaborator. Refactor the collaborator and many tests break, even when behaviour is unchanged. A Fake couples the test to the behaviour — it does the same thing as the real implementation, in memory. Refactoring the service does not break the fake.

python

# Prefer this: in-memory fake
class InMemoryUserRepo:
    def __init__(self): self.users: dict[int, User] = {}
    def save(self, u: User): self.users[u.id] = u
    def find_by_id(self, id: int) -> User | None: return self.users.get(id)

# Over this: mock that asserts on save() calls in detail
mock_repo = Mock()
mock_repo.find_by_id.return_value = User(id=1, email="alice@example.com")
# ...
mock_repo.save.assert_called_once_with(User(id=1, email="alice@example.com", name="Alice Dev"))

Use mocks sparingly and at boundaries — third-party APIs, mailers, external services. For your own classes, fakes scale better.

The mock-vs-fake rule: if it has a single side effect to the outside world (send email, post webhook), mock it. If it's storage with semantics you care about (a repository, a cache), fake it.

Fixture strategies

A fixture is the test's Arrange — the world the test exercises. Three patterns dominate:

1. Factories (preferred)

A factory builds a valid domain object with sensible defaults. Tests override only the fields that matter to the case.

python

# tests/conftest.py
import pytest
from factory import Factory, Faker
from src.user import User

class UserFactory(Factory):
    class Meta: model = User
    id = Faker("random_int", min=1, max=10000)
    email = Faker("email")
    name = Faker("name")
    is_active = True

@pytest.fixture
def user_factory():
    return UserFactory

python

def test_inactive_user_cannot_login(user_factory):
    u = user_factory(is_active=False)
    assert not can_login(u, password="anything")

The test reads as a sentence and depends on only what it cares about (is_active=False). Adding a new required field to User does not break it — the factory fills in a default.

2. Fixtures (pytest)

python

@pytest.fixture
def empty_cart(): return Cart(items=[])

@pytest.fixture
def cart_with_one_item(empty_cart):
    empty_cart.add(Item(price=10, qty=1))
    return empty_cart

Fixtures compose — cart_with_one_item takes empty_cart as input and extends it. This expresses test setup as a dependency graph.

3. Test data builders (TypeScript)

typescript

// tests/builders.ts
class UserBuilder {
  private user: User = { id: 1, email: "alice@example.com", name: "Alice Dev", isActive: true };
  withId(id: number)   { this.user.id = id;   return this; }
  withEmail(e: string) { this.user.email = e; return this; }
  inactive()           { this.user.isActive = false; return this; }
  build(): User        { return { ...this.user }; }
}

export const aUser = () => new UserBuilder();

typescript

test("inactive user cannot log in", () => {
  const u = aUser().inactive().build();
  expect(canLogin(u, "anything")).toBe(false);
});

The fluent API reads aloud — a user, inactive. The default user is a complete valid one.

Fixture scopes

Scope	When the fixture is built	Use for
function (default)	Once per test	Cheap, isolated data
class	Once per test class	Class-shared setup
module	Once per test file	Expensive shared resources
session	Once per `pytest` run	Test database, Docker container

Shared-state fixtures (scope="session") are dangerous. A test mutating the shared DB affects every later test. Use them only for resources that are immutable or that the fixture cleans up after each test.

Property-based testing

Example-based tests assert behaviour on one input. Property-based tests assert invariants — properties that should hold for any input — and the framework generates hundreds of random examples to try to break them. Tools: Hypothesis (Python), fast-check (JS/TS), QuickCheck (Haskell, the original), PropEr (Erlang).

python

# pip install hypothesis
from hypothesis import given, strategies as st

def reverse(xs: list[int]) -> list[int]:
    return xs[::-1]

@given(st.lists(st.integers()))
def test_reverse_twice_is_identity(xs):
    assert reverse(reverse(xs)) == xs

@given(st.lists(st.integers(), min_size=1))
def test_reverse_first_equals_original_last(xs):
    assert reverse(xs)[0] == xs[-1]

@given(st.lists(st.integers()))
def test_sorted_is_idempotent(xs):
    assert sorted(sorted(xs)) == sorted(xs)

bash

pytest tests/test_properties.py -v

Output:

text

tests/test_properties.py::test_reverse_twice_is_identity PASSED
tests/test_properties.py::test_reverse_first_equals_original_last PASSED
tests/test_properties.py::test_sorted_is_idempotent PASSED
3 passed in 0.21s

Hypothesis ran 100 random inputs per property. When a property fails, it shrinks the input — finds the smallest example that still fails, so you debug a 3-element list instead of a 47-element one.

Useful properties:

Property	Pattern
Round-trip	`decode(encode(x)) == x`
Idempotence	`f(f(x)) == f(x)`
Commutativity	`f(a, b) == f(b, a)`
Associativity	`f(f(a, b), c) == f(a, f(b, c))`
Inverse	`decompress(compress(x)) == x`
Oracle	`mine(x) == reference_implementation(x)`

Property-based tests are particularly potent at finding edge cases your imagination would never list: empty input, single-element input, max int, NaN, Unicode surrogate pairs, zero-width strings, lists with one billion zeros.

Add property-based tests at boundaries — parsers, serializers, encoders/decoders, math kernels. The combination "round-trip property" + "random inputs" routinely uncovers bugs in code that has 100% line coverage.

Snapshot tests

A snapshot test captures the output of code (a string, JSON, HTML, image) the first time it runs, then asserts every subsequent run produces the same output. It is fast to write — no manual assertions — but easy to abuse.

typescript

// tests/render.test.ts — Vitest
import { expect, test } from "vitest";
import { renderToString } from "react-dom/server";
import { ProductCard } from "../src/ProductCard";

test("ProductCard renders consistently", () => {
  const html = renderToString(
    <ProductCard name="Widget" price={9.99} inStock={true} />
  );
  expect(html).toMatchSnapshot();
});

bash

npx vitest run tests/render.test.ts

Output:

text

 ✓ tests/render.test.ts (1)
   ✓ ProductCard renders consistently

  Snapshots  1 written
  Test Files  1 passed (1)
       Tests  1 passed (1)

Snapshot tests are good for:

HTML/component output that changes rarely.
Generated configs (Terraform plans, codegen output).
Schema migrations.
API response shapes.

Snapshot tests are bad for:

Anything that changes per test run (timestamps, UUIDs).
Highly volatile output (UI under active redesign).
Logic verification — the snapshot says what came out, not why.

The snapshot anti-pattern: a PR fails its snapshot tests, the developer types --update-snapshots, the diff goes ungrasped. Always read a snapshot diff — that's the whole assertion. If you cannot summarise why the snapshot changed, do not update it.

Test-driven development (TDD)

TDD is a discipline, not a tool: write a failing test first, write the minimum code that makes it pass, then refactor with the test as your safety net. The rhythm is red → green → refactor.

text

1. RED:    write a failing test (it fails because the code doesn't exist yet)
2. GREEN:  write the smallest amount of code to make it pass
3. REFACTOR: clean up — duplication, naming, structure — with all tests green

TDD is most useful when the design is uncertain — writing the test first forces you to decide what the API is before the implementation locks it in. It is less useful (and arguably wasteful) for code with no behavioural questions, like simple data classes or trivial getters.

Variants:

Variant	Description
Classic / inside-out	Start with the innermost unit and grow outward
London-school / outside-in	Start at the boundary, mock collaborators, work inward
BDD (Cucumber/Gherkin)	Same loop, with `Given/When/Then` natural-language tests

Coverage — what it tells you and what it doesn't

Coverage measures which lines of code were executed during the test run. It does not measure whether they were tested correctly.

Metric	What it counts
Line coverage	Did this line run?
Branch coverage	Did both sides of this `if` run?
Statement coverage	Same as line in most languages
Path coverage	Did every possible execution path run? Rarely used — exponential.
Mutation coverage	If we mutate the code, do tests fail? The best signal.

bash

pytest --cov=src --cov-report=term-missing

Output:

text

---------- coverage: platform linux, python 3.12.3 -----------
Name              Stmts   Miss  Cover   Missing
-----------------------------------------------------
src/cart.py          12      0   100%
src/checkout.py      45      8    82%   23-25, 47, 52-55
src/payments.py      30      3    90%   18-20
-----------------------------------------------------
TOTAL                87     11    87%

Targets:

80% line coverage is a reasonable floor for most codebases.
100% coverage rarely makes the suite better; the marginal tests are usually tautological.
Critical paths at 100% (payments, auth) is much more valuable than 100% everywhere.
Mutation coverage (mutmut, Stryker) catches "tests that pass without asserting" — the deepest coverage gap.

Coverage is a necessary but not sufficient signal. A test that calls a function and asserts nothing pushes coverage up without testing anything. Use mutation testing periodically to find these.

Flakiness — root causes and fixes

A flaky test passes sometimes and fails sometimes for the same code. Each cause has a fix:

Cause	Fix
Time / dates / sleeps	Freeze time (`freezegun`, `vi.useFakeTimers`)
Random IDs / UUIDs	Inject a seedable RNG
Async race conditions	`await` the right promise; use proper test waits
Shared global state	Reset between tests; prefer DI
Network / external APIs	Mock at the HTTP layer (`responses`, `nock`, `msw`)
Test order dependence	Run with `--shuffle` to detect; fix by isolating
File system / temp dirs	`tmp_path` fixture (pytest) / `os.tmpdir()`
Database leakage	Truncate or transaction-wrap each test

python

# Freezing time
from freezegun import freeze_time

@freeze_time("2026-05-25 10:00:00")
def test_token_expiry():
    token = issue_token(ttl_seconds=60)
    assert is_valid(token)

CI patterns

A good CI pipeline runs the right tests at the right cadence — fast feedback on every push, broader checks on merges, deepest checks before deploy.

text

# .github/workflows/ci.yml
on:
  push:        # every push
  pull_request:

jobs:
  lint-and-types:
    # < 1 minute — block bad code fast
    steps:
      - run: npm ci
      - run: npm run lint
      - run: npm run typecheck

  unit:
    # < 3 minutes
    needs: lint-and-types
    steps:
      - run: npm test -- --coverage

  integration:
    # < 10 minutes — only on PRs
    if: github.event_name == 'pull_request'
    services:
      postgres: { image: postgres:16 }
    steps:
      - run: npm run test:integration

  e2e:
    # < 30 minutes — only on main
    if: github.ref == 'refs/heads/main'
    steps:
      - run: npx playwright install --with-deps
      - run: npm run test:e2e

Parallel and matrix strategies

yaml

strategy:
  matrix:
    node: [20, 22]
    os:   [ubuntu-latest, macos-latest]

For e2e tests, shard the suite across runners:

bash

npx playwright test --shard=1/4   # runner 1 of 4
npx playwright test --shard=2/4

Output: (none — exits 0 on success)

A 20-minute serial e2e run becomes a 5-minute parallel one.

Test selection on PRs

Run only tests affected by the change to keep PR feedback fast:

bash

# Only files changed in this PR
git diff --name-only origin/main... | grep -E '^src/' \
  | xargs -I{} pytest --collect-only --quiet \
  | grep '::test_' \
  | xargs pytest

Output: (none — exits 0 on success)

Tools that automate this: pytest-testmon, Vitest --changed, nx affected, turborepo --filter.

Common pitfalls

Top-heavy pyramid (ice cream cone). Many e2e tests, few units. Suite is slow, brittle, and reveals little. Fix: push tests down. Most "integration bugs" are really unit-level invariants the unit test missed.
Mocking your own code. Mocks against your own classes couple tests to implementation. Use fakes (in-memory implementations) for your own boundaries, mocks for third-party ones.
Testing private methods. If a private method needs a test, it probably wants to be public on a smaller class. Test through the public API.
Asserting on entire blobs. assert response == big_json_blob breaks on every cosmetic change. Assert on the specific fields you care about.
Setup that diverges from prod. A test factory that always sets is_active=True hides a bug where production has both. Defaults should match real defaults.
time.sleep() in tests. Almost always a race-condition smell. Use polling helpers, fake clocks, or async awaits.
One huge test method. Long tests fail at one assertion, hiding bugs in the later ones. Split.
Tests that mirror code structure. test_compute_X for every function compute_X. Tests should mirror behaviour — names like test_charges_with_tax_when_state_requires_it.
Snapshot-by-default. Every test as a snapshot — none of them tell you why. Use snapshots sparingly; for logic, use explicit assertions.
Coverage as a goal. Chasing 95% coverage by adding tests for trivial getters wastes effort. Aim for behavioural coverage — every important branch exercised by an assertion.
Flaky tests left to rot. A flaky test in main poisons the whole suite — every red build is dismissed as "probably flaky". Quarantine and fix.
Same fixture for all tests. A single God fixture forces every test to set up everything. Factories let each test build only what it needs.

Real-world recipes

Test pyramid for a typical web app

text

unit tests           500-2,000 — every push, < 30s
integration tests       50-200 — every push, < 5 min
contract tests          10-50  — every PR, < 2 min
e2e (critical paths)     5-20  — every merge to main, < 10 min
smoke tests              3-5   — after every deploy, < 2 min

Pyramid for a Python data pipeline

text

unit tests         (pure functions)   100-500 — < 10s
integration tests  (read S3, write DB) 20-50  — < 2 min
contract tests     (schema versions)    5-10  — every PR
soak / property    (Hypothesis batches)  3-5  — nightly

Mock at the HTTP layer, not the client class

python

# DON'T — coupling tests to your wrapper's implementation
def test_fetch_user(monkeypatch):
    monkeypatch.setattr("requests.get", lambda *a, **kw: FakeResponse({"id": 1}))

# DO — use a library that intercepts at HTTP layer; survives refactoring the client
import responses

@responses.activate
def test_fetch_user():
    responses.add(responses.GET, "https://api.example.com/users/1",
                  json={"id": 1, "email": "alice@example.com"}, status=200)
    u = api.fetch_user(1)
    assert u.email == "alice@example.com"

For JS, msw (Mock Service Worker) does the same at fetch-level.

Spin up a real DB per test session

python

# conftest.py
import pytest
from testcontainers.postgres import PostgresContainer

@pytest.fixture(scope="session")
def postgres_url():
    with PostgresContainer("postgres:16-alpine") as pg:
        yield pg.get_connection_url()

@pytest.fixture
def db(postgres_url):
    """Per-test connection, wrapped in a transaction that's rolled back."""
    import psycopg2
    conn = psycopg2.connect(postgres_url)
    conn.autocommit = False
    yield conn
    conn.rollback()
    conn.close()

Each test sees a clean DB; container startup happens once.

Property-based test for a parser

python

from hypothesis import given, strategies as st
from src.parser import parse, serialize

@given(st.dictionaries(st.text(min_size=1), st.integers()))
def test_parse_serialize_roundtrip(d):
    assert parse(serialize(d)) == d

If your parser drops keys with whitespace, or serialize mishandles negative integers, Hypothesis will find a minimal counterexample in seconds.

Snapshot test for generated SQL

python

def test_generates_expected_sql(snapshot):
    sql = QueryBuilder().from_("users").where("active = true").build()
    snapshot.assert_match(sql, "active_users.sql")

SQL builders evolve slowly; a snapshot catches accidental output changes immediately.

Shard e2e across CI runners

yaml

# .github/workflows/e2e.yml
strategy:
  matrix:
    shard: [1, 2, 3, 4]
steps:
  - run: npx playwright test --shard=${{ matrix.shard }}/4

Four runners cut wall time by ~4×. Combine with retries on the slowest shard for resilience.

Detect order dependence

bash

pytest --randomly-seed=last        # reproduce a failing seed
pytest --randomly-dont-shuffle      # disable for debugging

Output: (none — exits 0 on success)

bash

# Vitest equivalent
npx vitest run --shuffle

Output: (none — exits 0 on success)

If turning on shuffle reveals failures, you have hidden shared state — fix the offending test's cleanup.

Test selection on PR

bash

# pytest-testmon — runs only tests whose covered code changed
pip install pytest-testmon
pytest --testmon

Output: (none — exits 0 on success)

bash

# Vitest — uses git diff
npx vitest related src/cart.ts

Output: (none — exits 0 on success)

PR feedback drops from minutes to seconds.

A signature CI matrix for a TypeScript lib

yaml

strategy:
  matrix:
    node: [18, 20, 22]
    os:   [ubuntu-latest, macos-latest, windows-latest]
steps:
  - uses: actions/setup-node@v4
    with: { node-version: ${{ matrix.node }} }
  - run: npm ci
  - run: npm run typecheck
  - run: npm test

Nine combinations catch most cross-platform / cross-version bugs at minimal cost.

Tips

The fastest test suite you will ever ship is the one nobody runs. Optimise for runnable speed — under 30 seconds for the inner loop. If devs only run tests in CI, the feedback latency kills the practice.

Cross-link: pytest covers Python-specific fixtures and parametrize; Vitest for JS/TS unit tests; Playwright for browser e2e. See Code Review for what to look for in test PRs.

Tests are documentation that cannot lie. A green suite that names every behaviour clearly is the best onboarding document a codebase has. When you read a test name and think I have no idea what this is verifying, the test name needs more thought than the assertion.