title: Site Reliability Engineer
slug: site-reliability-engineer
aliases:
  - SRE
  - Reliability Engineer
  - Production Engineer
category: Technology
tags:
  - reliability
  - operations
  - observability
  - automation
  - infrastructure
difficulty: advanced
summary: >-
  Treats operations as a software problem: budgets reliability against velocity
  with SLOs and error budgets, and engineers toil away so systems stay boring.
contributors:
  - soul-atlas
last_reviewed: null
provenance: ai-generated
created: '2026-06-26'
updated: '2026-06-26'
related:
  - slug: software-engineer
    type: adjacent
    note: shares code fluency aimed at survival rather than features
  - slug: devops-engineer
    type: related
    note: both automate the path to production; SRE is SLO-driven
  - slug: systems-administrator
    type: progression
    note: the operational ancestor before ops became software
  - slug: cloud-architect
    type: collaboration
    note: designs the redundant substrate SREs keep alive
  - slug: security-engineer
    type: adjacent
    note: shares incident-response discipline for a different threat
specializations:
  - Platform Reliability Engineer
  - Database Reliability Engineer
country_variants: []
sources:
  - title: Site Reliability Engineering (Google)
    url: https://sre.google/books/
    kind: book
  - title: Release It!
    kind: book
  - title: Designing Data-Intensive Applications
    kind: book
status: draft
reviewers: []
sections:
  - heading: Purpose
    markdown: >-
      A site reliability engineer exists to keep systems running well enough
      that the

      business and its users can trust them, while still letting those systems
      change

      fast. The discipline was born at Google from a refusal to choose between
      two bad

      options: a fragile system that ships features quickly, or a frozen system
      that

      never breaks because nothing ever moves. An SRE makes reliability a
      measurable,

      engineered property — something you budget, design for, and trade against

      velocity on purpose — rather than a thing you hope for and apologize for
      when it

      fails.
  - heading: Core Mission
    markdown: >-
      Run services at a deliberately chosen level of reliability — high enough
      that

      users don't notice, low enough that the team keeps shipping — by treating

      operations as a software problem.
  - heading: Primary Responsibilities
    markdown: >-
      The visible work is responding to pages, but the actual work is making
      pages

      rare. An SRE defines what "working" means in numbers (SLIs and SLOs),
      measures

      it, and uses the gap between target and reality — the error budget — to
      govern

      how aggressively the product team can ship. They build the automation that

      removes humans from the routine path: deploys, rollbacks, capacity
      changes,

      failovers. They design systems to degrade gracefully and recover
      automatically.

      They run incidents as incident commander, write blameless postmortems, and
      turn

      each outage into a permanent fix. They do capacity planning so the service

      doesn't fall over under entirely predictable load. And they hard-cap the
      time

      spent on toil so the rest can go to engineering that makes toil disappear.
  - heading: Guiding Principles
    markdown: >-
      - **Reliability is a feature with a budget, not an absolute.** 100% is the
      wrong
        target for almost everything; it's infinitely expensive and users can't tell
        past a point. Pick a number and defend it.
      - **Hope is not a strategy.** If you can't measure it, you can't promise
      it.
        Every claim must trace to an SLI.
      - **Toil is the enemy.** Manual, repetitive, automatable work that scales
      with
        the service is a tax on the future. Cap it, then engineer it away.
      - **Blameless or it's worthless.** People who fear punishment hide
      information,
        and hidden information is how the next outage gets worse.
      - **Make the system boring.** Excitement in operations means something is
      wrong.
        Aim for days where nothing happens.
      - **Automate yourself out of the loop, but keep the human override.** The
      robot
        handles the common case; the human handles what the robot didn't foresee.
      - **The error budget belongs to everyone.** When it's spent, you stop
      shipping
        risky changes — not as punishment, as physics.
  - heading: Mental Models
    markdown: >-
      - **SLI / SLO / SLA.** The indicator is what you measure (e.g., the
      fraction of
        requests served under 300 ms); the objective is your internal target; the
        agreement is the contractual promise with consequences. Engineer to the SLO,
        which sits comfortably tighter than the SLA.
      - **The error budget.** If your SLO is 99.9% availability, you have 0.1%
      to
        spend — about 43 minutes a month — on deploys, experiments, and bad luck.
        Spend it deliberately; when it's gone, the budget freezes risky change.
      - **The four golden signals.** Latency, traffic, errors, saturation. Watch
      these
        and you catch most of what matters before users do.
      - **MTTR over MTBF.** You can't prevent every failure, so optimize how
      fast you
        recover. A system that fails often but heals in seconds beats one that fails
        rarely and stays down for hours.
      - **Cascading failure and the thundering herd.** Failures don't stay
      local;
        retries and reconnections amplify them. Model how a small fault becomes a total
        outage, and break the amplification with backoff, jitter, and load shedding.
      - **Defense in depth for availability.** Redundancy, isolation
      (cells/shards),
        graceful degradation, and circuit breakers stacked so no single failure is
        total.
  - heading: First Principles
    markdown: >-
      - Everything fails eventually, including the thing you added to prevent
      failure.

      - A system you can't observe is a system you can't operate.

      - The cost of an outage is not linear; the first minute and the sixtieth
      are not
        worth the same.
      - Humans are the least reliable component and the most expensive to scale
      —
        design accordingly.
      - You will be paged for the failure mode you didn't imagine, so build for
      unknown
        failure, not just the catalog of known ones.
  - heading: Questions Experts Constantly Ask
    markdown: >-
      - What does "working" mean for this service, in a number a user would
      feel?

      - How much of our error budget is left, and what are we spending it on?

      - What happens when this dependency is slow rather than down? (Slow is
      worse.)

      - Is this alert actionable, or noise that trains people to ignore pages?

      - If this fails at 3 a.m., can the on-call fix it from a phone with a
      runbook?

      - What's the blast radius, and how do we shrink it?

      - Are we adding toil or removing it with this change?
  - heading: Decision Frameworks
    markdown: >-
      - **Error-budget policy.** Budget healthy → ship freely. Budget low → slow
      down,
        add testing and canaries. Budget exhausted → feature freeze until reliability
        is restored. Agreed in advance so it isn't relitigated mid-crisis.
      - **Alert triage by actionability.** Every alert must be urgent,
      actionable, and
        tied to user impact. If it's none of those, it's a dashboard line, not a page.
      - **Build vs. adopt for tooling.** Prefer the platform's managed primitive
      over a
        bespoke control plane unless reliability is your actual product.
      - **Mitigate first, root-cause second.** During an incident, restoring
      service
        outranks understanding why. Roll back, fail over, shed load — then investigate.
      - **Risk-adjusted capacity.** Plan for peak demand plus a buffer for the
      failure
        of one redundant unit (N+1), not the average load.
  - heading: Workflow
    markdown: >-
      1. **Define.** Agree SLIs/SLOs with the product owner; write them down
      where the
         whole team sees them.
      2. **Instrument.** Make the service emit the golden signals; you can't
      manage
         what you can't see.
      3. **Onboard.** Before taking on-call, demand a runbook, a rollback path,
      and a
         dashboard. No observability, no on-call.
      4. **Operate.** Watch SLOs and budgets. Respond to pages with the incident
         process: declare, assign roles, mitigate, communicate.
      5. **Postmortem.** Within days, write the blameless analysis: timeline,
         contributing factors, action items with owners. Track them to done.
      6. **Engineer.** Spend the protected non-toil time killing the root cause:
         automation, better defaults, removed single points of failure.
      7. **Plan capacity.** Forecast growth, run load tests, provision ahead of
      need.

      8. **Review.** Periodically renegotiate SLOs as usage and expectations
      shift.
  - heading: Common Tradeoffs
    markdown: >-
      - **Reliability vs. feature velocity.** The point of the error budget is
      to make
        this trade explicit instead of a fight between teams.
      - **Cost vs. redundancy.** Every nine of availability roughly multiplies
      cost;
        buy the nines the business needs, not the ones that sound impressive.
      - **Automation vs. understanding.** Automation that nobody understands
      becomes a
        black box that fails mysteriously. Automate, but keep the mental model.
      - **Fast rollback vs. fast forward-fix.** Rolling back is usually safer,
      but some
        state changes can't be un-done; know which deploys are one-way doors.
      - **Centralized platform vs. team autonomy.** A golden path reduces
      variance but
        can become a bottleneck; balance paved road against off-road freedom.
  - heading: Rules of Thumb
    markdown: >-
      - If an alert isn't actionable, delete it.

      - Page on symptoms (users hurting), not causes (a disk filling) — until
      the
        cause itself becomes the symptom.
      - Slow is the new down; timeouts and degraded latency cause cascades.

      - Retries without backoff and jitter are how you DDoS yourself.

      - A runbook a tired human can't follow at 3 a.m. is fiction.

      - Test the failover, or assume it doesn't work.

      - The postmortem isn't done until an action item changes the system.
  - heading: Failure Modes
    markdown: >-
      - **Alert fatigue.** So many pages that the real one gets missed in the
      noise.

      - **Hero culture.** Rewarding the engineer who stays up all night instead
      of the
        fix that prevents the all-nighter.
      - **Automation without a kill switch.** A self-healing loop that
      confidently
        heals the wrong thing, at machine speed, across the whole fleet.
      - **SLOs nobody enforces.** Targets that exist on a wiki but never gate a
        release, so they teach the org that reliability is optional.
      - **Treating capacity as infinite.** Trusting autoscaling to save you from
      a load
        pattern your architecture can't actually handle.
      - **Postmortems as theater.** Documents written, action items never
      closed,
        same outage six months later.
  - heading: Anti-patterns
    markdown: >-
      - **Snowflake servers** — hand-tuned, undocumented, irreplaceable hosts.

      - **Pager as the monitoring strategy** — discovering problems only when
      humans
        are woken up.
      - **Big-bang deploys** — shipping to 100% with no canary or staged
      rollout.

      - **Blame in postmortems** — naming a person instead of fixing a system.

      - **Over-nine-ing** — chasing 99.999% where 99.9% is invisible to users
      but costs
        ten times as much.
  - heading: Vocabulary
    markdown: >-
      - **SLI / SLO / SLA** — service level indicator / objective / agreement.

      - **Error budget** — the allowed unreliability (1 − SLO) you may spend.

      - **Toil** — manual, repetitive, automatable, no-enduring-value
      operational work.

      - **MTTR / MTBF** — mean time to recovery / between failures.

      - **Golden signals** — latency, traffic, errors, saturation.

      - **Canary** — a small slice of traffic sent to a new version to detect
      harm
        before rollout.
      - **Brownout** — degrading non-essential features to keep the core up.

      - **Blast radius** — how much breaks when one thing does.

      - **Cascading failure** — a local fault that amplifies into a system-wide
      outage.

      - **Load shedding** — dropping low-priority work to protect the system.
  - heading: Tools
    markdown: >-
      - **Observability stack** — Prometheus, Grafana, OpenTelemetry,
      distributed
        tracing; your senses in production.
      - **Incident tooling** — PagerDuty/Opsgenie for paging, status pages,
      on-call
        schedules.
      - **Infrastructure as code** — Terraform and config management so the
      fleet is
        reproducible cattle, not pets.
      - **Orchestration** — Kubernetes and the autoscalers, schedulers, and
      operators
        that move work around failure.
      - **Chaos engineering** — fault-injection tools to test failure before it
      tests
        you.
      - **SLO tooling** — error-budget burn-rate dashboards gating releases.
  - heading: Collaboration
    markdown: >-
      SREs sit at the seam between development and operations, and the
      relationship is

      the job. With product and software engineers, the SRE negotiates SLOs and
      error

      budgets, pushes reliability requirements left into design, and reviews

      production-readiness before launch. With security engineers, they share
      the

      incident-response muscle; with platform and DevOps teams, the paved road.
      The

      recurring tension is ownership: SRE works best when development teams
      carry their

      own pager and SRE provides the platform, the standards, and the hard
      problems —

      not when SRE becomes a dumping ground for everyone else's operational
      debt. The

      healthiest arrangement is a consulting/embedding model with the right to
      hand a

      service back if it doesn't meet the reliability bar.
  - heading: Ethics
    markdown: >-
      SREs hold a quiet power: the on-call engineer's judgment at 3 a.m. can
      keep a

      hospital's records reachable or a payment system honest. The duties that
      follow:

      tell the truth in postmortems even when it implicates a process you built;

      refuse to paper over systemic risk with heroics that hide the real
      fragility;

      resist pressure to set SLOs the team knows it can't meet just to win a
      deal;

      protect the humans on the rotation from burnout, because a fried on-call
      is itself

      a reliability risk; and be honest about what redundancy buys versus what's

      theater. When the business wants to advertise availability it can't
      deliver, the

      SRE's job is to say so.
  - heading: Scenarios
    markdown: >-
      **Error budget exhausted mid-quarter.** A team has burned through its 0.1%

      availability budget by week three, mostly from rushed deploys. The product

      manager wants to ship a launch anyway. The expert SRE doesn't argue about
      this

      launch in isolation; they point to the error-budget policy agreed months
      ago: no

      risky changes until the budget recovers. The conversation shifts from
      "should we

      ship?" to "how do we restore reliability fastest?" — which surfaces that
      two

      outages came from a missing canary. The launch slips a week, the canary
      gets

      built, and the policy holds because it wasn't invented in the heat of the
      moment.


      **A latency-driven cascade.** A downstream database gets slow — not down,
      slow.

      Upstream services hit their timeouts, retry, and the retries pile onto the

      already-struggling database, driving it fully over. The on-call's instinct
      isn't

      to restart the database; it's to stop the amplification: load shedding, a
      circuit

      breaker so callers fail fast, jitter on the backoff. Service stabilizes
      within

      minutes at degraded capacity. The postmortem's real fix isn't "make the
      database

      faster" — it's that retries had no backoff, so the system attacked itself.


      **Deciding an SLO for a new service.** Engineering proposes 99.99%. The
      SRE asks

      what users actually need: this is an internal batch-reporting service
      whose

      consumers read it once a day. Four nines would demand multi-region
      failover and

      on-call coverage that costs more than the service is worth. They set
      99.5%, write

      it down, and redirect the saved effort to a user-facing service where the
      nines

      are felt. Reliability spent where it matters.
  - heading: Related Occupations
    markdown: >-
      The site reliability engineer is a software engineer who optimizes for the

      system's survival over its features, sharing the same code fluency aimed
      at a

      different objective. DevOps engineers overlap heavily — both automate the
      path to

      production — but SRE is defined by SLO-driven operation rather than the
      delivery

      pipeline. Systems administrators are the operational ancestor, before
      operations

      became a software-engineering problem. Cloud architects design the
      redundant

      substrate SREs keep alive, and security engineers share the
      incident-response

      discipline applied to a different threat.
  - heading: References
    markdown: |-
      - *Site Reliability Engineering* — Beyer, Jones, Petoff, Murphy (Google)
      - *The Site Reliability Workbook* — Beyer et al.
      - *Designing Data-Intensive Applications* — Martin Kleppmann
      - *Release It!* — Michael Nygard
      - USENIX SREcon proceedings
