cheat sheet
Testing Strategies
A practical guide to test design — the pyramid (unit/integration/e2e), fixture and mock patterns, property-based testing, snapshot tests, and CI strategies that scale.
Testing Strategies — Pyramid, Fixtures, Mocks, Property-Based, and CI Patterns
What it is
A testing strategy is the set of decisions that answers: at what level do we test each thing, how do we set up the world for each test, where do we use real dependencies versus fakes, what counts as "good enough" coverage, and which tests run on every push versus once a night. Good strategies make defects cheap to find (the test fails before the code ships), tests fast enough to run constantly (under 30 seconds for the unit layer), and refactors safe (a green suite means behaviour is intact, not just implementation).
The single most useful mental model is the test pyramid: many fast unit tests at the bottom, fewer integration tests in the middle, a small number of end-to-end (e2e) tests at the top. The inversion of this — a top-heavy pyramid full of slow brittle e2e tests with almost no unit tests — is the "ice cream cone" anti-pattern that haunts most large legacy codebases. The pyramid is not a religious doctrine; the shapes that work in practice vary (integration tests up to ~50% in tightly integrated services). The principle is constant: push tests as low as they can go and still catch the bug.
The pyramid in one picture
/\ e2e (real browser, real DB) — slow, brittle, few
/ \ ~5%
/----\
/ \ integration (multiple modules, real DB or fakes)
/ \ ~15-25%
/----------\
/ \ unit (one module, no I/O)
/______________\ ~70-80%
fast, isolated, many
Three layers, three goals. Unit tests answer did the function compute the right value; integration tests answer do these pieces talk to each other; e2e tests answer does the whole system, including network and UI, satisfy the user's flow.
| Layer | What it tests | Speed | Use real I/O? |
|---|---|---|---|
| Unit | One function/class in isolation | < 1 ms | No |
| Integration | Several modules + maybe a real DB | 10-100 ms | Sometimes |
| Contract | One service's API against a spec | 10-100 ms | Stub the other side |
| e2e (UI / API) | Full system, real network | 1-10 s | Yes |
| Smoke | Critical-path verification on prod | 1-10 s | Yes (against prod) |
| Property-based | Many random inputs satisfying invariants | varies | No |
Unit tests
A unit test exercises a single function or class with no I/O, no real network, no real database — anything that touches the outside world is replaced with a fake. The goal is to verify behaviour (inputs → outputs, side effects on collaborators) quickly enough that you run the whole suite on every save without breaking flow.
Three rules:
- Arrange-Act-Assert. Three blocks, clearly separated. One assertion (or one logical group of assertions) per test.
- Test names describe behaviour.
test_returns_zero_for_empty_cart, nottest_total_1. - Test through the public API. If you find yourself peeking at private state, the design is probably the problem.
# src/cart.py
from dataclasses import dataclass
@dataclass
class Item:
price: float
qty: int
def cart_total(items: list[Item]) -> float:
return sum(i.price * i.qty for i in items)
# tests/test_cart.py
import pytest
from src.cart import Item, cart_total
def test_returns_zero_for_empty_cart():
assert cart_total([]) == 0
def test_sums_a_single_line_item():
assert cart_total([Item(price=10.00, qty=3)]) == 30.00
def test_sums_multiple_line_items():
items = [Item(price=10, qty=2), Item(price=5, qty=4)]
assert cart_total(items) == 40
@pytest.mark.parametrize("price,qty,expected", [
(10.00, 1, 10.00),
(10.00, 0, 0.00),
(0.99, 3, 2.97),
])
def test_line_item_pricing(price, qty, expected):
assert cart_total([Item(price=price, qty=qty)]) == pytest.approx(expected)
pytest tests/test_cart.py -v
Output:
tests/test_cart.py::test_returns_zero_for_empty_cart PASSED
tests/test_cart.py::test_sums_a_single_line_item PASSED
tests/test_cart.py::test_sums_multiple_line_items PASSED
tests/test_cart.py::test_line_item_pricing[10.0-1-10.0] PASSED
tests/test_cart.py::test_line_item_pricing[10.0-0-0.0] PASSED
tests/test_cart.py::test_line_item_pricing[0.99-3-2.97] PASSED
6 passed in 0.03s
One-shot tests for one-shot bugs are fine. But when you've written the third test that varies only the input, reach for parametrize (Python) /
test.each(Vitest) / table-driven tests (Go). One test reading three lines of data beats three near-identical tests.
Integration tests
An integration test exercises more than one unit, plus the wiring between them. The typical shape is "a real database, a real cache, real HTTP between two services, fakes for anything else". Use them where wiring is non-trivial (ORMs, message queues, auth middleware) — places where unit tests can pass while the system is broken.
# tests/test_user_repository.py — uses a real Postgres via testcontainers
import pytest
from testcontainers.postgres import PostgresContainer
from src.repo import UserRepository, User
@pytest.fixture(scope="session")
def db_url():
with PostgresContainer("postgres:16-alpine") as pg:
yield pg.get_connection_url()
@pytest.fixture
def repo(db_url):
repo = UserRepository(db_url)
repo.create_schema()
yield repo
repo.truncate_all()
def test_save_and_load_roundtrip(repo):
user = User(id=1, email="alice@example.com", name="Alice Dev")
repo.save(user)
loaded = repo.find_by_id(1)
assert loaded == user
def test_returns_none_for_missing_user(repo):
assert repo.find_by_id(404) is None
pytest tests/test_user_repository.py -v
Output:
tests/test_user_repository.py::test_save_and_load_roundtrip PASSED
tests/test_user_repository.py::test_returns_none_for_missing_user PASSED
2 passed in 1.42s
The fixture spins up a real Postgres container once per test session and truncates between tests. This is integration-grade isolation: the SQL, schema, and driver are all exercised, but the test still finishes in seconds.
The integration layer is where dialect-specific bugs hide. Unit tests against a stubbed repository pass with any query; integration tests against the real database catch the join you forgot.
End-to-end (e2e) tests
An e2e test exercises the whole system, from the user's edge (browser, CLI, API client) through every layer down to the persistence and back. They are the slowest and most brittle — a network hiccup or a slow render can fail a test that finds no bug — so the rule is few, focused, and on the critical paths.
// tests/e2e/checkout.spec.ts — Playwright
import { test, expect } from "@playwright/test";
test("user can checkout a single item", async ({ page }) => {
await page.goto("/products/widget-1");
await page.getByRole("button", { name: "Add to cart" }).click();
await page.getByRole("link", { name: /cart/i }).click();
await expect(page.getByText("Widget 1")).toBeVisible();
await page.getByRole("button", { name: "Checkout" }).click();
await page.getByLabel("Email").fill("alice@example.com");
await page.getByLabel("Card").fill("4242 4242 4242 4242");
await page.getByRole("button", { name: "Pay" }).click();
await expect(page).toHaveURL(/\/order\/[0-9]+$/);
await expect(page.getByRole("heading", { name: /thank you/i })).toBeVisible();
});
npx playwright test tests/e2e/checkout.spec.ts
Output:
Running 1 test using 1 worker
✓ tests/e2e/checkout.spec.ts:3:1 › user can checkout a single item (4.5s)
1 passed (5s)
A good e2e set covers ~5-10 critical user journeys (sign up, log in, checkout, password reset) and runs on every deploy — not on every push. The full suite stays in pre-deploy CI or nightly.
The most common e2e failure mode is flakiness. Network jitter, async UI updates, and CI noise produce false positives that erode trust in the suite. Quarantine flaky tests aggressively — disable them, file a bug, fix root cause. A flaky test that "usually passes" is worse than no test at all.
Test doubles — mocks, stubs, fakes, spies
A test double is any object that stands in for a real collaborator. The taxonomy (Gerard Meszaros, xUnit Test Patterns) is precise — and most code confuses them, with concrete consequences for fragility.
| Double | What it does | Use when |
|---|---|---|
| Dummy | Object that is passed but never used | Filling out parameter lists |
| Stub | Returns canned answers | You need a specific response from a collaborator |
| Fake | A working but simplified implementation (in-memory DB) | You need the behaviour without the cost |
| Spy | A stub that records how it was called | You want to assert on the interaction |
| Mock | A spy that fails the test if it isn't called as expected | You're verifying a protocol — the order and arguments of calls matter |
Stub vs mock — the practical difference
# tests/test_password_reset.py — Python
# STUB: pre-canned data — we don't care if it's called or not
class StubMailer:
def send(self, to, subject, body): pass
# MOCK: assertion-grade — we care HOW it's called
from unittest.mock import Mock
def test_password_reset_sends_email_with_token():
mailer = Mock()
repo = StubUserRepo({"alice@example.com": User(id=1, email="alice@example.com")})
svc = PasswordResetService(repo, mailer, token_generator=lambda: "TOKEN123")
svc.request_reset("alice@example.com")
mailer.send.assert_called_once_with(
to="alice@example.com",
subject="Reset your password",
body="Use code TOKEN123 to reset your password.",
)
pytest tests/test_password_reset.py
Output:
tests/test_password_reset.py::test_password_reset_sends_email_with_token PASSED
1 passed in 0.04s
The mock is what makes this test express "the service must call the mailer with the right token". A stub would not enforce it.
When to prefer fakes over mocks
A Mock couples the test to the implementation — exactly how the service talks to its collaborator. Refactor the collaborator and many tests break, even when behaviour is unchanged. A Fake couples the test to the behaviour — it does the same thing as the real implementation, in memory. Refactoring the service does not break the fake.
# Prefer this: in-memory fake
class InMemoryUserRepo:
def __init__(self): self.users: dict[int, User] = {}
def save(self, u: User): self.users[u.id] = u
def find_by_id(self, id: int) -> User | None: return self.users.get(id)
# Over this: mock that asserts on save() calls in detail
mock_repo = Mock()
mock_repo.find_by_id.return_value = User(id=1, email="alice@example.com")
# ...
mock_repo.save.assert_called_once_with(User(id=1, email="alice@example.com", name="Alice Dev"))
Use mocks sparingly and at boundaries — third-party APIs, mailers, external services. For your own classes, fakes scale better.
The mock-vs-fake rule: if it has a single side effect to the outside world (send email, post webhook), mock it. If it's storage with semantics you care about (a repository, a cache), fake it.
Fixture strategies
A fixture is the test's Arrange — the world the test exercises. Three patterns dominate:
1. Factories (preferred)
A factory builds a valid domain object with sensible defaults. Tests override only the fields that matter to the case.
# tests/conftest.py
import pytest
from factory import Factory, Faker
from src.user import User
class UserFactory(Factory):
class Meta: model = User
id = Faker("random_int", min=1, max=10000)
email = Faker("email")
name = Faker("name")
is_active = True
@pytest.fixture
def user_factory():
return UserFactory
def test_inactive_user_cannot_login(user_factory):
u = user_factory(is_active=False)
assert not can_login(u, password="anything")
The test reads as a sentence and depends on only what it cares about (is_active=False). Adding a new required field to User does not break it — the factory fills in a default.
2. Fixtures (pytest)
@pytest.fixture
def empty_cart(): return Cart(items=[])
@pytest.fixture
def cart_with_one_item(empty_cart):
empty_cart.add(Item(price=10, qty=1))
return empty_cart
Fixtures compose — cart_with_one_item takes empty_cart as input and extends it. This expresses test setup as a dependency graph.
3. Test data builders (TypeScript)
// tests/builders.ts
class UserBuilder {
private user: User = { id: 1, email: "alice@example.com", name: "Alice Dev", isActive: true };
withId(id: number) { this.user.id = id; return this; }
withEmail(e: string) { this.user.email = e; return this; }
inactive() { this.user.isActive = false; return this; }
build(): User { return { ...this.user }; }
}
export const aUser = () => new UserBuilder();
test("inactive user cannot log in", () => {
const u = aUser().inactive().build();
expect(canLogin(u, "anything")).toBe(false);
});
The fluent API reads aloud — a user, inactive. The default user is a complete valid one.
Fixture scopes
| Scope | When the fixture is built | Use for |
|---|---|---|
| function (default) | Once per test | Cheap, isolated data |
| class | Once per test class | Class-shared setup |
| module | Once per test file | Expensive shared resources |
| session | Once per pytest run | Test database, Docker container |
Shared-state fixtures (
scope="session") are dangerous. A test mutating the shared DB affects every later test. Use them only for resources that are immutable or that the fixture cleans up after each test.
Property-based testing
Example-based tests assert behaviour on one input. Property-based tests assert invariants — properties that should hold for any input — and the framework generates hundreds of random examples to try to break them. Tools: Hypothesis (Python), fast-check (JS/TS), QuickCheck (Haskell, the original), PropEr (Erlang).
# pip install hypothesis
from hypothesis import given, strategies as st
def reverse(xs: list[int]) -> list[int]:
return xs[::-1]
@given(st.lists(st.integers()))
def test_reverse_twice_is_identity(xs):
assert reverse(reverse(xs)) == xs
@given(st.lists(st.integers(), min_size=1))
def test_reverse_first_equals_original_last(xs):
assert reverse(xs)[0] == xs[-1]
@given(st.lists(st.integers()))
def test_sorted_is_idempotent(xs):
assert sorted(sorted(xs)) == sorted(xs)
pytest tests/test_properties.py -v
Output:
tests/test_properties.py::test_reverse_twice_is_identity PASSED
tests/test_properties.py::test_reverse_first_equals_original_last PASSED
tests/test_properties.py::test_sorted_is_idempotent PASSED
3 passed in 0.21s
Hypothesis ran 100 random inputs per property. When a property fails, it shrinks the input — finds the smallest example that still fails, so you debug a 3-element list instead of a 47-element one.
Useful properties:
| Property | Pattern |
|---|---|
| Round-trip | decode(encode(x)) == x |
| Idempotence | f(f(x)) == f(x) |
| Commutativity | f(a, b) == f(b, a) |
| Associativity | f(f(a, b), c) == f(a, f(b, c)) |
| Inverse | decompress(compress(x)) == x |
| Oracle | mine(x) == reference_implementation(x) |
Property-based tests are particularly potent at finding edge cases your imagination would never list: empty input, single-element input, max int, NaN, Unicode surrogate pairs, zero-width strings, lists with one billion zeros.
Add property-based tests at boundaries — parsers, serializers, encoders/decoders, math kernels. The combination "round-trip property" + "random inputs" routinely uncovers bugs in code that has 100% line coverage.
Snapshot tests
A snapshot test captures the output of code (a string, JSON, HTML, image) the first time it runs, then asserts every subsequent run produces the same output. It is fast to write — no manual assertions — but easy to abuse.
// tests/render.test.ts — Vitest
import { expect, test } from "vitest";
import { renderToString } from "react-dom/server";
import { ProductCard } from "../src/ProductCard";
test("ProductCard renders consistently", () => {
const html = renderToString(
<ProductCard name="Widget" price={9.99} inStock={true} />
);
expect(html).toMatchSnapshot();
});
npx vitest run tests/render.test.ts
Output:
✓ tests/render.test.ts (1)
✓ ProductCard renders consistently
Snapshots 1 written
Test Files 1 passed (1)
Tests 1 passed (1)
Snapshot tests are good for:
- HTML/component output that changes rarely.
- Generated configs (Terraform plans, codegen output).
- Schema migrations.
- API response shapes.
Snapshot tests are bad for:
- Anything that changes per test run (timestamps, UUIDs).
- Highly volatile output (UI under active redesign).
- Logic verification — the snapshot says what came out, not why.
The snapshot anti-pattern: a PR fails its snapshot tests, the developer types
--update-snapshots, the diff goes ungrasped. Always read a snapshot diff — that's the whole assertion. If you cannot summarise why the snapshot changed, do not update it.
Test-driven development (TDD)
TDD is a discipline, not a tool: write a failing test first, write the minimum code that makes it pass, then refactor with the test as your safety net. The rhythm is red → green → refactor.
1. RED: write a failing test (it fails because the code doesn't exist yet)
2. GREEN: write the smallest amount of code to make it pass
3. REFACTOR: clean up — duplication, naming, structure — with all tests green
TDD is most useful when the design is uncertain — writing the test first forces you to decide what the API is before the implementation locks it in. It is less useful (and arguably wasteful) for code with no behavioural questions, like simple data classes or trivial getters.
Variants:
| Variant | Description |
|---|---|
| Classic / inside-out | Start with the innermost unit and grow outward |
| London-school / outside-in | Start at the boundary, mock collaborators, work inward |
| BDD (Cucumber/Gherkin) | Same loop, with Given/When/Then natural-language tests |
Coverage — what it tells you and what it doesn't
Coverage measures which lines of code were executed during the test run. It does not measure whether they were tested correctly.
| Metric | What it counts |
|---|---|
| Line coverage | Did this line run? |
| Branch coverage | Did both sides of this if run? |
| Statement coverage | Same as line in most languages |
| Path coverage | Did every possible execution path run? Rarely used — exponential. |
| Mutation coverage | If we mutate the code, do tests fail? The best signal. |
pytest --cov=src --cov-report=term-missing
Output:
---------- coverage: platform linux, python 3.12.3 -----------
Name Stmts Miss Cover Missing
-----------------------------------------------------
src/cart.py 12 0 100%
src/checkout.py 45 8 82% 23-25, 47, 52-55
src/payments.py 30 3 90% 18-20
-----------------------------------------------------
TOTAL 87 11 87%
Targets:
- 80% line coverage is a reasonable floor for most codebases.
- 100% coverage rarely makes the suite better; the marginal tests are usually tautological.
- Critical paths at 100% (payments, auth) is much more valuable than 100% everywhere.
- Mutation coverage (
mutmut,Stryker) catches "tests that pass without asserting" — the deepest coverage gap.
Coverage is a necessary but not sufficient signal. A test that calls a function and asserts nothing pushes coverage up without testing anything. Use mutation testing periodically to find these.
Flakiness — root causes and fixes
A flaky test passes sometimes and fails sometimes for the same code. Each cause has a fix:
| Cause | Fix |
|---|---|
| Time / dates / sleeps | Freeze time (freezegun, vi.useFakeTimers) |
| Random IDs / UUIDs | Inject a seedable RNG |
| Async race conditions | await the right promise; use proper test waits |
| Shared global state | Reset between tests; prefer DI |
| Network / external APIs | Mock at the HTTP layer (responses, nock, msw) |
| Test order dependence | Run with --shuffle to detect; fix by isolating |
| File system / temp dirs | tmp_path fixture (pytest) / os.tmpdir() |
| Database leakage | Truncate or transaction-wrap each test |
# Freezing time
from freezegun import freeze_time
@freeze_time("2026-05-25 10:00:00")
def test_token_expiry():
token = issue_token(ttl_seconds=60)
assert is_valid(token)
CI patterns
A good CI pipeline runs the right tests at the right cadence — fast feedback on every push, broader checks on merges, deepest checks before deploy.
# .github/workflows/ci.yml
on:
push: # every push
pull_request:
jobs:
lint-and-types:
# < 1 minute — block bad code fast
steps:
- run: npm ci
- run: npm run lint
- run: npm run typecheck
unit:
# < 3 minutes
needs: lint-and-types
steps:
- run: npm test -- --coverage
integration:
# < 10 minutes — only on PRs
if: github.event_name == 'pull_request'
services:
postgres: { image: postgres:16 }
steps:
- run: npm run test:integration
e2e:
# < 30 minutes — only on main
if: github.ref == 'refs/heads/main'
steps:
- run: npx playwright install --with-deps
- run: npm run test:e2e
Parallel and matrix strategies
strategy:
matrix:
node: [20, 22]
os: [ubuntu-latest, macos-latest]
For e2e tests, shard the suite across runners:
npx playwright test --shard=1/4 # runner 1 of 4
npx playwright test --shard=2/4
Output: (none — exits 0 on success)
A 20-minute serial e2e run becomes a 5-minute parallel one.
Test selection on PRs
Run only tests affected by the change to keep PR feedback fast:
# Only files changed in this PR
git diff --name-only origin/main... | grep -E '^src/' \
| xargs -I{} pytest --collect-only --quiet \
| grep '::test_' \
| xargs pytest
Output: (none — exits 0 on success)
Tools that automate this: pytest-testmon, Vitest --changed, nx affected, turborepo --filter.
Common pitfalls
- Top-heavy pyramid (ice cream cone). Many e2e tests, few units. Suite is slow, brittle, and reveals little. Fix: push tests down. Most "integration bugs" are really unit-level invariants the unit test missed.
- Mocking your own code. Mocks against your own classes couple tests to implementation. Use fakes (in-memory implementations) for your own boundaries, mocks for third-party ones.
- Testing private methods. If a private method needs a test, it probably wants to be public on a smaller class. Test through the public API.
- Asserting on entire blobs.
assert response == big_json_blobbreaks on every cosmetic change. Assert on the specific fields you care about. - Setup that diverges from prod. A test factory that always sets
is_active=Truehides a bug where production has both. Defaults should match real defaults. time.sleep()in tests. Almost always a race-condition smell. Use polling helpers, fake clocks, or async awaits.- One huge test method. Long tests fail at one assertion, hiding bugs in the later ones. Split.
- Tests that mirror code structure.
test_compute_Xfor every functioncompute_X. Tests should mirror behaviour — names liketest_charges_with_tax_when_state_requires_it. - Snapshot-by-default. Every test as a snapshot — none of them tell you why. Use snapshots sparingly; for logic, use explicit assertions.
- Coverage as a goal. Chasing 95% coverage by adding tests for trivial getters wastes effort. Aim for behavioural coverage — every important branch exercised by an assertion.
- Flaky tests left to rot. A flaky test in main poisons the whole suite — every red build is dismissed as "probably flaky". Quarantine and fix.
- Same fixture for all tests. A single God fixture forces every test to set up everything. Factories let each test build only what it needs.
Real-world recipes
Test pyramid for a typical web app
unit tests 500-2,000 — every push, < 30s
integration tests 50-200 — every push, < 5 min
contract tests 10-50 — every PR, < 2 min
e2e (critical paths) 5-20 — every merge to main, < 10 min
smoke tests 3-5 — after every deploy, < 2 min
Pyramid for a Python data pipeline
unit tests (pure functions) 100-500 — < 10s
integration tests (read S3, write DB) 20-50 — < 2 min
contract tests (schema versions) 5-10 — every PR
soak / property (Hypothesis batches) 3-5 — nightly
Mock at the HTTP layer, not the client class
# DON'T — coupling tests to your wrapper's implementation
def test_fetch_user(monkeypatch):
monkeypatch.setattr("requests.get", lambda *a, **kw: FakeResponse({"id": 1}))
# DO — use a library that intercepts at HTTP layer; survives refactoring the client
import responses
@responses.activate
def test_fetch_user():
responses.add(responses.GET, "https://api.example.com/users/1",
json={"id": 1, "email": "alice@example.com"}, status=200)
u = api.fetch_user(1)
assert u.email == "alice@example.com"
For JS, msw (Mock Service Worker) does the same at fetch-level.
Spin up a real DB per test session
# conftest.py
import pytest
from testcontainers.postgres import PostgresContainer
@pytest.fixture(scope="session")
def postgres_url():
with PostgresContainer("postgres:16-alpine") as pg:
yield pg.get_connection_url()
@pytest.fixture
def db(postgres_url):
"""Per-test connection, wrapped in a transaction that's rolled back."""
import psycopg2
conn = psycopg2.connect(postgres_url)
conn.autocommit = False
yield conn
conn.rollback()
conn.close()
Each test sees a clean DB; container startup happens once.
Property-based test for a parser
from hypothesis import given, strategies as st
from src.parser import parse, serialize
@given(st.dictionaries(st.text(min_size=1), st.integers()))
def test_parse_serialize_roundtrip(d):
assert parse(serialize(d)) == d
If your parser drops keys with whitespace, or serialize mishandles negative integers, Hypothesis will find a minimal counterexample in seconds.
Snapshot test for generated SQL
def test_generates_expected_sql(snapshot):
sql = QueryBuilder().from_("users").where("active = true").build()
snapshot.assert_match(sql, "active_users.sql")
SQL builders evolve slowly; a snapshot catches accidental output changes immediately.
Shard e2e across CI runners
# .github/workflows/e2e.yml
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: npx playwright test --shard=${{ matrix.shard }}/4
Four runners cut wall time by ~4×. Combine with retries on the slowest shard for resilience.
Detect order dependence
pytest --randomly-seed=last # reproduce a failing seed
pytest --randomly-dont-shuffle # disable for debugging
Output: (none — exits 0 on success)
# Vitest equivalent
npx vitest run --shuffle
Output: (none — exits 0 on success)
If turning on shuffle reveals failures, you have hidden shared state — fix the offending test's cleanup.
Test selection on PR
# pytest-testmon — runs only tests whose covered code changed
pip install pytest-testmon
pytest --testmon
Output: (none — exits 0 on success)
# Vitest — uses git diff
npx vitest related src/cart.ts
Output: (none — exits 0 on success)
PR feedback drops from minutes to seconds.
A signature CI matrix for a TypeScript lib
strategy:
matrix:
node: [18, 20, 22]
os: [ubuntu-latest, macos-latest, windows-latest]
steps:
- uses: actions/setup-node@v4
with: { node-version: ${{ matrix.node }} }
- run: npm ci
- run: npm run typecheck
- run: npm test
Nine combinations catch most cross-platform / cross-version bugs at minimal cost.
Tips
The fastest test suite you will ever ship is the one nobody runs. Optimise for runnable speed — under 30 seconds for the inner loop. If devs only run tests in CI, the feedback latency kills the practice.
Cross-link: pytest covers Python-specific fixtures and parametrize; Vitest for JS/TS unit tests; Playwright for browser e2e. See Code Review for what to look for in test PRs.
Tests are documentation that cannot lie. A green suite that names every behaviour clearly is the best onboarding document a codebase has. When you read a test name and think I have no idea what this is verifying, the test name needs more thought than the assertion.