concept · weight 4

Cloud Computing

On-demand, network-accessible, metered access to a shared pool of configurable compute, storage, and platform services.

Cloud Computing

Definition

Cloud computing is a model of delivering compute, storage, and software services over a network, on demand, and metered by usage rather than capital cost. The widely-cited reference is NIST SP 800-145 (2011), which defines cloud computing in terms of five essential characteristics — on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service — plus three service models (IaaS, PaaS, SaaS) and four deployment models (private, community, public, hybrid). Reach for the concept whenever the question is "should we run this on someone else's infrastructure, on our own, or on both?" — the answer drives architecture, billing, compliance, and operational ownership in equal measure.

Why it matters

Cloud changed three things about how software is built and operated. First, provisioning latency collapsed from weeks to seconds — a virtual machine, a database, a queue, or a CDN edge node is now an API call away, which is why most modern architectures assume infrastructure can be created and destroyed by code (Terraform, Bicep, CloudFormation, Pulumi). Second, the economic model inverted — you no longer buy capacity for peak load and let it idle; you rent the capacity you need this hour and stop paying when the load falls. That makes elasticity a first-class architectural concern: an autoscaling group, a serverless function, or a queue-driven worker is often cheaper and more robust than a fixed-capacity fleet. Third, the platform layer absorbed huge swathes of undifferentiated heavy lifting — managed databases, object storage, identity, message brokers, container orchestration, observability — meaning a two-person team can ship a system that would have required a 30-person ops org in 2010.

The downsides are equally consequential, and a serious engineer needs to internalize them. Cloud bills are easy to grow and hard to shrink — egress charges, idle resources, and forgotten dev environments quietly compound; cost is now an architectural decision, not a finance one. Lock-in is real even when each individual service has an "open" equivalent — IAM models, identity flows, networking primitives, and managed-service quirks make portability expensive. And the shared-responsibility model is constantly misread: the provider secures the floor and the walls, but the door, the lock, and the contents of the safe are still yours. Most production cloud incidents are misconfiguration, not provider failure.

How it works

A cloud platform is, at the bottom, a fleet of physical machines in a small number of geographies, exposed to customers through a uniform control plane API (ARM for Azure, AWS API for AWS, Google Cloud API for GCP, …). Every action a user takes — create a VM, attach a disk, grant a role, query a metric — is an HTTPS call against that control plane that authenticates the caller, authorizes the operation, and dispatches the request to the appropriate data plane: a hypervisor for compute, a storage cluster for blobs, a managed database engine for SQL, an L7 load balancer for HTTP. The CLI, the web console, the Terraform provider, and the language SDKs are all thin clients over that same REST surface — they exist because typing JSON request bodies by hand is unergonomic, but the underlying contract is the API.

Three layers stack on top of each other:

  • Infrastructure as a Service (IaaS). Virtual machines, virtual networks, block storage, object storage, managed DNS — the primitives. Azure Virtual Machines, AWS EC2, GCP Compute Engine. The customer manages the OS, the application, and most of the security posture.
  • Platform as a Service (PaaS). Managed runtimes, databases, queues, container orchestration, identity providers — services that hide the OS but expose an API. Azure App Service, AWS RDS, GCP Cloud Run, Azure Kubernetes Service. The customer manages the application and the data.
  • Software as a Service (SaaS). Whole applications consumed over the network — Microsoft 365, GitHub, Salesforce, Datadog. The customer manages users, integrations, and the data they put in.

Cutting across all three layers are five cross-cutting concerns that every cloud surface exposes:

  1. Identity & access. Every API call needs a principal — a user, a service principal, a managed identity, a federated workload identity. Modern best practice avoids long-lived secrets and uses workload identity federation (OIDC trust between, say, GitHub Actions and Azure / AWS / GCP).
  2. Networking. Each provider has a virtual-network primitive (VNet on Azure, VPC on AWS/GCP) with subnets, route tables, NAT gateways, private endpoints, and peering. Egress crosses billing boundaries; ingress crosses security boundaries.
  3. Storage. Object (blob), block (disk), file (NFS/SMB), and table/document stores — each with its own consistency, durability, and access-pattern story. Pick the wrong tier and the bill explodes.
  4. Observability. Metrics, logs, distributed traces, and audit logs — emitted by every service, ingested into a managed store (Azure Monitor, CloudWatch, Cloud Logging), queryable through a vendor-specific language (Kusto, CloudWatch Logs Insights, Cloud Logging Query Language).
  5. Cost & quotas. Every resource has a billing meter and most have a soft quota; running out of quota is a far more common production failure than running out of capacity.

The operational pattern for working with a cloud has converged across vendors and is worth naming explicitly: declare desired state in code (Terraform / Bicep / CloudFormation / Pulumi), apply it through a CI pipeline that authenticates via workload identity, observe drift through provider-side audit logs and policy-as-code scanners, and roll back by re-applying an earlier version of the same declarative source. Imperative one-off CLI commands are fine for exploration and break-glass; they are not how production infrastructure is shaped.

Multi-cloud and the shared abstractions

"Multi-cloud" rarely means "the same workload running symmetrically on two providers" — that's expensive and almost never load-bearing. In practice it means three separate things: (1) different workloads on different providers because of cost, regional availability, or a regulatory boundary; (2) a single workload anchored on one provider but using SaaS from a second; (3) a portable substrate (Kubernetes, OpenTelemetry, S3-compatible object storage) deployed on top of each provider's IaaS. Kubernetes is the most successful portability layer — every major cloud now sells a managed K8s service (AKS, EKS, GKE) — and the same kubectl apply runs against any of them. Below Kubernetes, portability degrades fast.

Common pitfalls

  1. Egress fees ambush data-heavy workloads. Outbound traffic from the cloud to the public internet (and across regions, and sometimes across AZs) is billed per GB and easily becomes the largest line on the invoice. Architect for ingress-heavy workloads and use private endpoints + VNet peering for in-cloud cross-service traffic.
  2. Forgotten dev/test resources outlive their usefulness. Every long-lived sandbox is a recurring cost. Tag everything with env= and owner=, then run a weekly sweeper that deletes any resource tagged env=sandbox older than N days.
  3. Default subscriptions / accounts mix dev and prod. One Azure subscription or AWS account per environment (or at minimum, per "blast radius") is the standard; sharing a single subscription between dev and prod inevitably leads to an IAM mistake that touches production data.
  4. Long-lived static credentials in CI. Stored secrets get leaked. Modern providers all support OIDC federation from GitHub Actions / GitLab CI / Buildkite — use a short-lived token issued at job start and bound to a specific repo/branch.
  5. Quotas are silent until they bite. A region or service hits its quota in the middle of a deploy and the failure looks like a generic 4xx. Set proactive alerts on quota usage at 70%/85% and request limit increases in advance for production regions.
  6. Multi-AZ is not multi-region. Most managed databases (RDS, Cosmos, Cloud SQL) replicate synchronously within a region across AZs but require explicit setup to replicate across regions. A regional outage takes both AZs with it; design for it explicitly or accept the risk.
  7. "Serverless" is not free. Per-invocation pricing is great until something starts an infinite loop or a poison-message retry. Always set max-concurrency caps, dead-letter queues, and per-function budgets.
  8. Provider portals lie about cost. The "estimated bill" in the portal is delayed by hours-to-days and excludes some line items. The authoritative source is the billing export to BigQuery / Cost Management exports / AWS Cost & Usage Reports — query that, not the dashboard.
  9. "Region failover" requires more than a database replica. DNS TTLs, application-side connection caching, certificate pinning, and IAM-trust documents all need to be tested end-to-end before you can rely on a failover. Run a game day at least quarterly.
  10. Soft delete and lifecycle policies are not on by default. Blob containers, key vaults, and disk snapshots can usually be configured with soft-delete and retention windows — but the defaults are off. Turn them on at provisioning time, not after the first incident.

Where to go next

Concrete cheat sheets for the cloud command-line surfaces and SDKs the rest of the site covers:

  • /sections/linux/az — Microsoft Azure CLI: end-to-end management of resources, identity, ARM/Bicep deployments, and the Azure DevOps surface (az repos / az pipelines / az devops).
  • /sections/python/awscli — AWS CLI v2: the equivalent control-plane surface for AWS, with --query (JMESPath) and --profile semantics that closely mirror az.
  • /sections/packages-pip/pip-boto3 — Boto3, the canonical Python SDK for AWS; how to authenticate, paginate, and instrument calls when a CLI invocation isn't enough.
  • /sections/linux/gh — GitHub CLI; the cloud SaaS that hosts most teams' source code, issues, and (increasingly) CI via GitHub Actions.

Concept neighbours worth reading alongside this one:

  • /concepts/api — every cloud surface is, at its core, an HTTPS API; understanding rate limits, idempotency, and versioning translates directly across providers.
  • /concepts/json — the wire format for almost every cloud control plane and the input format for IaC tooling.

Sources

References consulted while writing this concept page. Links open in a new tab.

  • NIST SP 800-145 — The NIST Definition of Cloud Computing — Authoritative source for the five essential characteristics, three service models (IaaS/PaaS/SaaS), and four deployment models used throughout this page.
  • NIST publication page for SP 800-145 — Metadata, citation, and the canonical NIST landing page for the cloud-computing definition.
  • Microsoft Cloud Adoption Framework — Source for the "subscription per environment" and "tag everything with owner + env" operational patterns called out under Common pitfalls.
  • AWS Shared Responsibility Model — Canonical reference for the provider-vs-customer security split summarized in Why it matters.
  • Kubernetes — Overview — Supports the framing of Kubernetes as the dominant cross-cloud portability layer.