concept · weight 5

DevOps

A culture, set of practices, and toolchain that shortens the loop between writing software and running it in production.

DevOps

Definition

DevOps is the practice — equal parts cultural and technical — of collapsing the wall between the people who write software (Dev) and the people who run it in production (Ops), so that the same team owns a service end-to-end and ships changes in small, automated, low-risk increments. It is not a job title, not a tool, and not a stage in a pipeline; it is the operating model that produces continuous integration, continuous delivery, observability-driven incident response, and infrastructure-as-code. Reach for the DevOps framing whenever the bottleneck between "we have a code change" and "it's serving traffic" is human handoff rather than engineering complexity.

Why it matters

Before DevOps became the default, software organizations were structured around quarterly releases and a hard handoff between development and operations — a model that produced large, risky deployments, long lead times for trivial fixes, and a brittle understanding of how systems behaved in production. The empirical evidence that this model is worse is overwhelming: the annual State of DevOps Report from DORA (Google's research group, now widely-cited industry data) consistently shows that "elite" performers — those who deploy multiple times per day, have lead times under an hour, change-failure rates under 15%, and time-to-restore under an hour — outperform "low" performers on every business metric, from revenue growth to employee retention. Those four signals (lead time, deployment frequency, change-failure rate, time to restore) are now standard board-level engineering KPIs and exist precisely because they measure how well a team has adopted the DevOps loop.

The cost of not working this way compounds. Without trunk-based development and automated tests, every release is a research project. Without infrastructure-as-code, every environment is a snowflake and every incident is a forensics exercise. Without observability, your "is it broken?" answer comes from customer tickets. The DevOps model is what makes a small team able to operate a system that would have needed a war room ten years ago.

DevOps overlaps with — but is not the same as — three adjacent disciplines that frequently get conflated:

  • SRE (Site Reliability Engineering) is Google's prescriptive answer to "how do we operationalize the DevOps culture?" — explicit error budgets, SLI/SLO contracts, a hard cap on operational toil, blameless postmortems, on-call rotations with a defined toil budget. SRE is one implementation of DevOps; DevOps is the broader cultural envelope.
  • Platform Engineering is the practice of treating internal developer tools (CI templates, IaC modules, golden paths, internal developer platforms / IDPs) as products with their own users — application engineers — and their own roadmap. It is how a mature DevOps org scales beyond a single team.
  • DevSecOps is DevOps with security shifted left into the same pipeline: SAST, dependency scanning, secret scanning, policy-as-code, and supply-chain attestation as gates inside the build, not afterthoughts inside a quarterly review.

How it works

DevOps in practice is a loop — Plan → Code → Build → Test → Release → Deploy → Operate → Observe → back to Plan — implemented as a single automated pipeline that takes a commit and either lands it in production or rejects it with a clear failure. The pipeline is the artifact; the team's effectiveness is largely a function of how short, how fast, and how trustworthy that pipeline is.

The pipeline (CI/CD)

Continuous Integration (CI) is the practice of merging every developer's work into a shared trunk many times a day, with an automated build + test gate. The signal is "is trunk green?" — and the gate must be fast enough (typically <10 minutes) that engineers run it on every change. Continuous Delivery (CD) extends that gate to produce a deployable artifact at every green commit, ready to be released to production with one click. Continuous Deployment (the more aggressive variant) removes the click — every green commit lands in production automatically, gated only by progressive-rollout signals.

A modern CI/CD pipeline has these stages, in roughly this order: source-control trigger → linters and formatters → unit tests → build (compile, container image) → integration tests → security scans (SAST, dependency-CVE, secret-leak) → artifact publish → infrastructure-as-code plan → staging deploy → smoke tests → progressive rollout to production (canary → percentage rollouts → 100%) → post-deploy monitoring and auto-rollback on SLO regression.

Infrastructure as Code (IaC)

Production environments are no longer mutated by hand. The cluster, the database, the networking, the IAM policies — all declared in version-controlled source (Terraform, Bicep, CloudFormation, Pulumi, Kubernetes manifests), reviewed by PR, applied by the pipeline. The reasons are familiar by now: reproducibility, drift detection, audit trail, blast-radius limits via PR review, and the ability to rebuild a region from a tag.

Two patterns dominate. Procedural IaC (Terraform, Pulumi) describes desired state and lets the tool figure out the diff against current state — the dominant pattern outside Kubernetes. GitOps (Argo CD, Flux) treats a Git repository as the source of truth for cluster state and runs a reconciler inside the cluster that continuously pulls and converges — the dominant pattern inside Kubernetes.

Observability and incident response

You can't operate what you can't see. The three pillars — metrics (numeric time-series — request rate, error rate, latency), logs (structured event records), and traces (causal chains across services) — are emitted by the application and infrastructure, ingested into a managed store (Datadog, Honeycomb, Grafana Cloud, Azure Monitor, CloudWatch), and surfaced through dashboards and alerts. Service Level Indicators (SLIs) are the specific signals that matter for users; Service Level Objectives (SLOs) are the targets ("99.9% of requests under 300 ms over a 30-day window"); the error budget is 1 - SLO and is the unit of currency between "ship faster" and "stabilize what we have." When a service has burned its error budget, the team's deploy velocity is automatically throttled until the budget recovers.

Incident response is blameless, follows a defined on-call rotation, and produces a post-incident review that focuses on systemic root cause rather than individual error. Toil — repetitive, manual, automatable operational work — is tracked and capped (Google's SRE handbook puts the limit at 50% of an SRE's time); when toil exceeds the cap, engineering work to automate it takes priority over feature work.

The DORA four

The shared scoreboard the whole industry now uses, originating in DORA's research and now embedded in tools like GitHub Insights, Azure DevOps Analytics, and most APM vendors' "delivery" dashboards:

  1. Deployment Frequency — how often does code reach production? Elite: multiple per day. Low: less than once a month.
  2. Lead Time for Changes — commit-to-production wall-clock time. Elite: under one hour. Low: over six months.
  3. Change-Failure Rate — percent of deploys that cause a degraded service. Elite: under 15%. Low: over 45%.
  4. Time to Restore Service — wall-clock time from incident detection to recovery. Elite: under one hour. Low: more than six months.

Pair them: high deployment frequency with low change-failure rate is the only healthy combination. High frequency with high failure is reckless; low frequency with low failure is brittle (the rare bad release is catastrophic).

Common pitfalls

  1. Calling a tool "DevOps" doesn't make a team DevOps. Adopting Jenkins/GitHub Actions/Azure Pipelines without changing how teams are structured produces "DevOps theater" — a pipeline, but no shared ownership of production.
  2. Hand-off cultures masquerading as DevOps. A separate "DevOps team" that owns the pipeline is just operations renamed; the model breaks because the dev team has no incentive to make their code operable. The fix is "you build it, you run it" — application teams carry their own pager.
  3. Long-running feature branches. Branches that live for weeks defeat CI's purpose; trunk goes stale, merges become research projects, and the change-failure rate climbs. Use feature flags + trunk-based development instead.
  4. Manual production changes. A kubectl apply -f from a laptop, an az command run by an SRE during an incident — any change not committed to source breaks drift detection and audit. Use break-glass procedures with explicit, time-bounded, audited access; don't normalize bypassing the pipeline.
  5. Coverage as a quality proxy. 95% line coverage with no assertions is worse than 60% coverage with meaningful tests. Track defect-escape rate (bugs found post-deploy / total bugs) instead.
  6. SLOs nobody acts on. An SLO that the team doesn't respect when it's burning — by halting feature work and prioritizing stability — is theater. Wire the SLO to deployment freezes automatically.
  7. Pipelines that run for an hour. A 60-minute CI loop kills the rapid-feedback property that makes CI worth doing. Aim for <10 minutes for the trunk-blocking path; push slower checks to nightly or pre-release.
  8. Security as a gate instead of a service. A "security review" that adds two weeks at the end of every change incentivizes hiding work from security. Shift left: scanners run on every PR, the security team owns the scanners, vulnerabilities flow as PRs into the dev team's backlog.
  9. Hero culture as on-call. A team where one person handles every page is one resignation away from a crisis. Spread on-call across a minimum-sized rotation; if you don't have one, fix the staffing, not the rotation.
  10. Ignoring lead-time for non-code changes. A pipeline that ships application code in 10 minutes but requires a ticket and a four-day wait for a new IAM role is not a 10-minute pipeline — it's a four-day pipeline. Bring the slow parts into the same model.

Where to go next

Concrete cheat sheets for the toolchain DevOps practitioners reach for every day:

  • /sections/linux/gh — GitHub CLI; PR review, GitHub Actions workflow management, releases, and the supply-chain attestation surface.
  • /sections/linux/git — the version-control substrate every other DevOps practice sits on top of; trunk-based development requires fluent Git.
  • /sections/linux/az — Azure CLI; az repos, az pipelines, and az devops cover Azure DevOps end-to-end (PR policies, build-and-release pipelines, variable groups, service connections).
  • /sections/zos/zowe — Zowe; the same pipeline-driven mental model applied to z/OS, demonstrating that DevOps is platform-agnostic.

Concept neighbours worth reading alongside this one:

  • /concepts/cloud — DevOps practices were born in the cloud era and assume API-driven infrastructure; the two concepts co-evolved.
  • /concepts/api — every CI/CD pipeline is, mechanically, a sequence of API calls against source control, the cloud, registries, and the observability backend.

Sources

References consulted while writing this concept page. Links open in a new tab.

  • DORA research / State of DevOps — Source for the four delivery metrics (deployment frequency, lead time, change-failure rate, time to restore) and the "elite vs. low performer" thresholds in The DORA four.
  • Google SRE Book — Authoritative reference for SLI/SLO/error-budget, blameless postmortems, the 50% toil cap, and the "SRE is one implementation of DevOps" framing.
  • Google SRE Workbook — How SRE relates to DevOps — Direct source for the SRE-prescribes-how / DevOps-describes-what distinction used in Why it matters.
  • Atlassian — SRE vs DevOps — Cross-vendor framing of the "culture vs engineering discipline" split between DevOps and SRE.
  • Atlassian — CI vs CD vs Continuous Deployment — Authoritative breakdown of the three terms used in The pipeline.
  • OpenGitOps principles — Reference for the GitOps pattern (Git as source of truth, in-cluster reconciler) distinguished from procedural IaC.