zsh — alex@blog
alex@blog:~$ ./welcome.sh
alex@blog:~$ ls -la
alex@blog:~/posts$ cat first-principles-devops.md

First Principles: DevOps

TL;DR

Most of what passes for “DevOps” is tool adoption dressed up as strategy. When you strip away inherited conventions and industry hype, only five truths survive: value dies in the gap between developer and user, feedback loops drive improvement, reproducibility reduces risk, handoffs cost superlinearly, and recovery beats prevention. Everything else, from Kubernetes to the “DevOps Engineer” job title, is implementation detail that should be proportional to your actual coordination problem.

This is the third post in my First Principles series, where I take a concept the industry treats as settled and strip it back to what’s actually provable. The first one tackled Software Observability, the second one Software Design. DevOps felt like the natural next target, because few topics have a wider gap between what people say and what they actually do.

The Conventional View

DevOps is the practice of unifying software development and IT operations through shared tooling, culture, and automation. Most organizations interpret this as: adopt CI/CD pipelines, containerize everything, use Infrastructure as Code, monitor with dashboards, and hire people with “DevOps Engineer” in their title.

The conversation typically centers on tools (Kubernetes, Terraform, Jenkins, GitHub Actions) rather than on the underlying problem being solved. This is the first sign that something has gone wrong: when the answer comes before the question.

I’ve seen this pattern up close. Teams that spend months building elaborate CI/CD pipelines before asking whether their actual bottleneck is deployment or something else entirely. A startup with two developers and three microservices running on Kubernetes because “that’s what you do”. An ops team automating a quarterly task into a script that took longer to debug than a year of doing it by hand. The tools aren’t the problem. The problem is reaching for them before understanding what you’re solving.

The Assumptions We Carry

Before deconstructing, it helps to name the conventional wisdom. These are the beliefs most teams treat as self-evident:

# Assumption Origin Verdict
1 DevOps requires specialized “DevOps Engineers” Job market evolution; companies needed a hiring label Baggage
2 CI/CD pipelines are essential Continuous delivery movement (Humble & Farley, 2010) Heuristic
3 Containers (Docker/Kubernetes) are a prerequisite Industry trend post-2014; conference culture Baggage
4 Infrastructure as Code is always better than manual config Repeatability arguments from config management era Heuristic
5 Dev and Ops must be merged into one team Reaction to siloed orgs; the original DevOps manifesto Heuristic
6 More automation is always better Efficiency narrative; “automate everything” mantra Baggage
7 You need a complex observability stack (metrics, logs, traces) Microservices explosion; vendor marketing Heuristic
8 Deployment frequency is a key measure of success DORA metrics research (Forsgren et al.) Heuristic
9 GitOps / declarative config is the correct paradigm Weaveworks marketing + Kubernetes ecosystem conventions Baggage
10 You should adopt the same practices as Google/Netflix/Meta Cargo-culting from tech conference talks Baggage

Three categories emerge:

  • Fundamental truth: survives all scrutiny, keep it
  • Useful heuristic: sometimes true, context-dependent, worth questioning
  • Inherited baggage: convention masquerading as truth, discard it

Four of ten assumptions are outright baggage. Five are heuristics that work in some contexts but get applied universally. None survived as fundamental truths, which means the bedrock of DevOps lies elsewhere.

What Survives Scrutiny

Strip away the heuristics and baggage, and five irreducible truths remain:

1. The gap between developer and user is where value is destroyed. Software has zero value until it’s running where users can reach it. Every delay, error, miscommunication, or manual step in that gap is pure waste.

2. Feedback loops are the mechanism by which systems improve. You cannot control what you cannot observe. Without knowing what broke, what’s slow, or what users actually do, improvement is impossible. This is an information-theoretic truth. (I explored this deeply in First Principles: Software Observability, where I stripped observability down to its own irreducible core.)

3. Reproducibility reduces risk. If you can recreate an environment or deployment exactly, you can reason about failures and recover from them. Identical inputs produce identical outputs in deterministic systems.

4. Coordination costs grow superlinearly with handoffs. Every time work crosses a team boundary, information is lost, delays are introduced, and accountability diffuses. This is Brooks’s Law and transaction cost economics in action.

5. Systems fail. The question is recovery speed, not failure prevention. Complex systems have emergent failure modes that cannot all be predicted. You cannot enumerate all states of a sufficiently complex system.

What Falls Away

With the bedrock established, it becomes clear what the industry has been carrying unnecessarily.

DevOps Engineer” as a role. The original insight was that the wall between Dev and Ops was the problem. Creating a third silo called “DevOps” that sits between them reintroduces the exact coordination cost the movement was trying to eliminate. This persists because it’s easier to hire a person than to change a culture.

Containers as prerequisite. Containers are one mechanism for achieving reproducibility and isolation. A well-configured VM, a properly managed bare-metal server, or even a PaaS like Heroku can deliver the same fundamental properties. Containers won because they’re composable and portable, but they’re a means, not a truth. Many teams adopted Kubernetes before they had the problem Kubernetes solves.

Automate everything”. Automation has a cost: it must be written, maintained, debugged, and understood. If a task is performed once a quarter, automating it may destroy more value than it creates (maintenance burden > time saved). The fundamental truth is reproducibility, not automation. A well-documented runbook can be more valuable than a brittle script.

GitOps / tool-specific paradigms. These are implementations of reproducibility and auditability, not truths in themselves. The assumption that declarative YAML in a Git repo is the correct interface for all infrastructure is a convention that spread through the Kubernetes ecosystem, not a provable law.

Copying Big Tech. Google built Kubernetes because they run millions of containers across a global fleet. Netflix built Chaos Monkey because they operate at a scale where component failure is a statistical certainty. If you’re running 3 services on 5 machines, adopting their toolchain adds complexity without solving a problem you have.

The Steel Case for Convention

First-principles thinking is powerful, but it’s also dangerous if applied lazily. Some of these conventions exist for good reasons that deserve honest engagement.

But DORA research does show deployment frequency correlates with performance”. It does. Forsgren et al. collected real data across thousands of organizations. The correlation between deployment frequency, lead time, MTTR, and change failure rate is solid. The mistake isn’t measuring these metrics. It’s Goodharting them: optimizing deployment frequency as a goal rather than treating it as a signal that your delivery gap is shrinking. A team deploying 50 times a day through an automated pipeline that nobody understands hasn’t improved, they’ve just moved the bottleneck.

Kubernetes complexity pays off at scale”. True, past a certain threshold. If you’re running dozens of services across multiple teams with different scaling characteristics, container orchestration solves real coordination problems. The line I’d draw: if you have a dedicated platform team and enough services that manual deployment tracking becomes error-prone, Kubernetes starts earning its keep. Below that, it’s overhead.

IaC prevents configuration drift”. It does, and configuration drift causes real outages. The first-principles argument isn’t that IaC is bad, it’s that IaC is one solution to the reproducibility problem, and sometimes a simpler one exists. A team running on a managed PaaS already has reproducibility without writing a single line of Terraform. The question is always: what’s the cheapest mechanism that gets you reliable reproducibility at your scale?

The DevOps Engineer role fills a real skills gap”. Fair. Most developers don’t know networking, and most ops people don’t write application code. The role exists because the skills gap is real. But filling a skills gap by creating a permanent intermediary team is different from filling it by cross-training. The first scales linearly; the second compounds.

The point isn’t that conventions are always wrong. It’s that they should be chosen, not inherited.

Rebuilt From Scratch

Starting from only the five truths, here’s what DevOps looks like designed from zero.

The core insight is that operational complexity should scale with your coordination problem, not with industry trends. Most teams are somewhere on this spectrum but reaching for tools from the wrong end:

quadrantChart
    title Automation Decision Matrix
    x-axis Low Frequency --> High Frequency
    y-axis Low Complexity --> High Complexity
    quadrant-1 Automate
    quadrant-2 Platform team territory
    quadrant-3 Just do it manually
    quadrant-4 Script it
    CI/CD pipelines: [0.9, 0.75]
    Deploys: [0.85, 0.5]
    Monitoring setup: [0.6, 0.8]
    DB migrations: [0.4, 0.7]
    Quarterly reports: [0.15, 0.3]
    SSL cert renewal: [0.2, 0.15]
    Log rotation: [0.7, 0.2]
    Dependency updates: [0.65, 0.35]

The bottom-left quadrant is where most wasted automation effort lives: tasks that happen rarely and aren’t complex, but somebody automated them anyway because “automate everything” sounded right.

1. Minimize the gap, not the tools. The goal is: code to user with minimum waste. For a solo developer, that might be git push triggering a deploy to a PaaS. For a 500-person org, it might involve staged rollouts and canary analysis. The mechanism should be proportional to the coordination problem. Start with the simplest thing that closes the gap, and add complexity only when you hit a real wall.

2. Make the people who build it responsible for running it. Not by creating a “DevOps team,” but by literally making developers own their service in production. This isn’t a cultural aspiration, it’s a structural decision. When the person who writes the code also gets paged at 2 AM, the feedback loop is zero-latency. The org chart should reflect this, not fight it.

3. Invest in observability proportional to your ignorance. You need enough visibility to answer: “Is it working? If not, where is it broken?” For a monolith with 100 users, application logs and an uptime check may suffice. For a distributed system with millions of requests, you need traces, metrics, and structured logging. The investment should match your system’s complexity, not an industry benchmark.

4. Reproducibility through the cheapest mechanism available. IaC is one option. So is imaging. So is a documented manual process with verification steps. So is a PaaS that abstracts infrastructure entirely. The question is: “Can I recreate this environment reliably?” The answer doesn’t require Terraform.

5. Optimize for recovery, not prevention. Instead of building elaborate approval gates and change advisory boards to prevent failures, invest in: fast rollbacks, blue-green deployments, feature flags, and the ability to detect problems within minutes. Accept that failures will happen and make them cheap.

6. Automate only what earns its maintenance cost. Before automating, ask: How often does this run? What’s the cost of doing it manually? What’s the cost of maintaining the automation when it breaks? Many organizations have graveyards of CI/CD pipelines that no one understands and everyone is afraid to touch. A process you run manually but understand is better than automation you can’t debug.

What Changes

Conventional Thinking First-Principles Thinking
Hire DevOps Engineers Make developers own production
Adopt Kubernetes Use the simplest deployment target that works
Automate everything Automate what earns its maintenance cost
Build a full observability stack Observe proportional to system complexity
Prevent failures with process Make failures cheap and recovery fast
Follow DORA metrics Measure the gap between “code done” and “user has it” directly
Adopt GitOps Achieve reproducibility by whatever means is cheapest
Copy Big Tech practices Solve your actual problems at your actual scale

The pattern is consistent: conventional DevOps optimizes for the wrong proxy. It measures deployment frequency instead of delivery gap. It adds tools instead of removing handoffs. It automates for automation’s sake instead of for value.

Right-Sizing Checklist

The hardest part of first-principles DevOps is knowing where you actually sit. Here’s a rough guide:

Solo developer or small team (1-5 people, < 5 services)

  • [ ] Deploy target: PaaS, single server, or managed container service. You don’t need Kubernetes.
  • [ ] CI/CD: A simple pipeline that runs tests and deploys on merge. One file, not a YAML empire.
  • [ ] Observability: Application logs + uptime monitoring + error alerting. That’s it.
  • [ ] IaC: Optional. If your infra fits in one cloud console page, a documented setup guide may be enough.
  • [ ] Automation: Automate the deploy. Script the rest only when it starts hurting.

Growing team (5-20 people, 5-20 services)

  • [ ] Deploy target: Containers start earning their keep here, especially if services have different scaling needs.
  • [ ] CI/CD: Standardized pipelines per service, but owned by the teams that use them, not a central DevOps team.
  • [ ] Observability: Structured logging + basic metrics + alerting on SLOs. Consider traces if debugging cross-service issues takes hours.
  • [ ] IaC: Worth it now. Configuration drift across 10+ services causes real outages.
  • [ ] Automation: Automate deployment, environment provisioning, and the tasks that run weekly or more. Runbooks for everything else.

Scaling org (20+ people, 20+ services)

  • [ ] Deploy target: Container orchestration (Kubernetes or equivalent) becomes justified. Consider a platform team.
  • [ ] CI/CD: Platform-provided golden paths that teams can customize, not enforce.
  • [ ] Observability: Full stack (metrics, logs, traces) with SLO-based alerting. The cost of not having this exceeds the cost of running it.
  • [ ] IaC: Non-negotiable. Everything in code, reviewed, versioned.
  • [ ] Automation: Invest in self-service infrastructure. The platform team’s job is to make the right thing the easy thing.

The boundaries are fuzzy. The principle is: match your operational investment to your actual coordination complexity.

What to Do About It

Audit your toolchain for cargo-culting. For each tool in your stack, ask: “What specific problem at our scale does this solve?” If the answer is “everyone uses it” or “we might need it someday,” it’s a candidate for removal. Simpler systems have fewer failure modes.

Kill the DevOps team (structurally, not the people). Redistribute operational ownership to the teams that build the software. Platform engineering, building internal tools that reduce friction, is valuable. A team that acts as a gateway between dev and prod is not.

Right-size your process to your coordination cost. A 5-person startup and a 500-person enterprise have fundamentally different coordination problems. The startup’s “DevOps” might be a Makefile and a deploy script. The enterprise’s might be a platform team and a service mesh. Neither is wrong; what’s wrong is applying one to the other.

Measure the end-to-end gap, not proxy metrics. Instead of tracking deployment frequency as an end in itself, measure: “How long from a developer finishing a feature to a user being able to use it?” Everything that extends that duration is waste to be examined, whether it’s a slow CI pipeline, a manual approval, or a 3-week release train.

Treat recovery speed as a first-class investment. Every hour spent making rollbacks faster, deploys more reversible, and failures more detectable pays dividends that prevention-oriented processes (change advisory boards, multi-stage approvals) rarely match.


References

  1. First-principles analysis generated in-session
  2. Humble, J. & Farley, D. (2010). Continuous Delivery. Addison-Wesley
  3. Forsgren, N., Humble, J., & Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press
  4. First Principles: Software Observability - Related post applying the same first-principles method to observability
alex@blog:~/posts$ ls -1 | head