title: Systems Administrator
slug: systems-administrator
aliases:
  - Sysadmin
  - IT Administrator
  - Infrastructure Administrator
  - Systems Engineer (Ops)
category: Technology
tags:
  - infrastructure
  - operations
  - uptime
  - identity
  - backups
difficulty: intermediate
summary: >-
  Holds entropy at bay deliberately, keeping the systems people depend on
  available, secure, and recoverable so a single failure never becomes an outage
  or breach.
contributors:
  - soul-atlas
last_reviewed: null
provenance: ai-generated
created: '2026-06-26'
updated: '2026-06-26'
related:
  - slug: site-reliability-engineer
    type: progression
    note: >-
      applies software engineering to ops at scale; descends from the sysadmin
      role
  - slug: devops-engineer
    type: adjacent
    note: blurs dev and ops with pipelines and IaC; the modern sysadmin trends here
  - slug: network-engineer
    type: related
    note: owns the network plumbing the admin depends on
  - slug: database-administrator
    type: specialization
    note: deeper specialization in the data layer
  - slug: security-engineer
    type: collaboration
    note: closest ally on hardening, access, and incident response
  - slug: cloud-architect
    type: adjacent
    note: designs the cloud infrastructure the admin increasingly operates
specializations:
  - Windows/AD Administrator
  - Linux Administrator
  - Cloud/Infrastructure Administrator
  - Storage Administrator
country_variants: []
sources:
  - title: The Practice of System and Network Administration
    kind: book
  - title: UNIX and Linux System Administration Handbook
    kind: book
  - title: Google SRE Book
    url: https://sre.google/books/
    kind: book
status: draft
reviewers: []
sections:
  - heading: Purpose
    markdown: >-
      Organizations run on systems that must simply be there — the file share,
      email,

      the directory that says who you are, the database the business depends on
      — and

      notice them only when they break. A systems administrator's reason for
      being is

      to keep that infrastructure available, secure, recoverable, and quietly
      boring,

      so everyone else can work without thinking about it. The discipline exists

      because complex systems decay, drift, fill up, expire, and get attacked

      continuously, and someone has to hold entropy at bay deliberately.
  - heading: Core Mission
    markdown: >-
      Keep the systems people depend on available, secure, and recoverable — so
      no

      single failure, expired certificate, full disk, or lost laptop becomes an
      outage

      or a breach, and so anything important can be rebuilt from a known good
      state.
  - heading: Primary Responsibilities
    markdown: >-
      The visible work is "fixing computers"; the actual work is risk management
      over

      time. A sysadmin spends their days: provisioning and patching servers and

      endpoints; managing identity and access (Active Directory, LDAP, SSO,
      groups,

      permissions); running backups and — the part that matters — testing
      restores;

      monitoring everything that can fail and getting alerted before users do;
      planning

      capacity; automating the repetitive and dangerous so it's done the same
      way every

      time; managing change so a "quick fix" doesn't take down production; and
      carrying

      the pager. Underneath it all is documentation — the runbook that lets 3
      a.m. you

      recover without the one expert who's on vacation.
  - heading: Guiding Principles
    markdown: >-
      - **If it isn't monitored, it's already broken — you just don't know
      yet.** You
        cannot manage or secure what you can't see.
      - **A backup you haven't restored is a hope, not a backup.** Follow 3-2-1:
      three
        copies, two media types, one off-site (and immutable, against ransomware), then
        test the restore on a schedule — restores fail in ways backups don't.
      - **Least privilege, always.** Every account and service gets the minimum
      access
        to do its job. A compromise's blast radius is defined by the permissions you
        granted before it happened.
      - **Automate the repeatable; document the rest.** Anything done by hand
      twice will
        eventually be done wrong at the worst time. Codify it (Ansible, scripts).
      - **Change is the leading cause of outages.** Most incidents are
      self-inflicted.
        Change deliberately: maintenance windows, tickets, a tested rollback, one change
        at a time.
      - **Patch on a cadence, not a panic.** Unpatched systems are how breaches
      happen;
        reckless patching is how outages happen. Test, stage, then roll.
  - heading: Mental Models
    markdown: >-
      - **The CIA triad.** Confidentiality, Integrity, Availability — the three
      things
        you are protecting, often in tension. Name which you're trading.
      - **Defense in depth.** No single control is trusted. Layer firewall,
      network
        segmentation, host hardening, least privilege, and monitoring so one failure
        isn't a breach — assume each layer will eventually fail.
      - **The 3-2-1-1-0 backup rule.** Three copies, two media, one off-site,
      one
        immutable/offline, zero errors on restore — the anchor for "can I recover?"
      - **RTO and RPO.** Recovery Time Objective (how long can it be down?) and
      Recovery
        Point Objective (how much data can we lose?) drive every backup and HA decision.
        Define them per system *before* the disaster, with the business.
      - **MTBF vs. MTTR.** You can invest in failing less often or recovering
      faster;
        past a point, recovery is cheaper and more reliable than prevention. Make
        recovery cheaper than perfection.
      - **Cattle, not pets.** Treat servers as interchangeable and rebuildable
      from
        config, not hand-tuned snowflakes only one person understands.
      - **The SPOF hunt.** Trace every dependency — power, network, DNS, the one
      disk,
        the one admin who knows the password — and ask what happens when each dies.
  - heading: First Principles
    markdown: >-
      - Every system fills up, expires, drifts, or gets attacked given enough
      time;
        decay is the default state, uptime is maintained.
      - You cannot fix what you cannot reproduce, nor recover what you never
      backed up
        and tested.
      - Complexity you don't understand is risk you can't manage; prefer boring,
        documented, standard configurations.
  - heading: Questions Experts Constantly Ask
    markdown: |-
      - If this dies right now, how do I know, and how fast can I bring it back?
      - Have we actually tested the restore, or just confirmed the backup ran?
      - What's the blast radius if this account or server is compromised?
      - What changed? (Because something almost always did.)
      - Can I automate this so the next person doesn't do it wrong at 3 a.m.?
  - heading: Decision Frameworks
    markdown: >-
      - **Patch now vs. patch in the window.** Classify by severity and
      exposure: an
        actively-exploited internet-facing CVE patches now; a low-risk internal one
        waits for the tested maintenance window.
      - **Build HA vs. accept downtime.** Compare the cost of redundancy against
      the
        business cost of the RTO it buys. Not every system deserves a cluster; some
        deserve a documented restore and an honest SLA.
      - **Change risk triage.** Before any change: what breaks if this is wrong,
      who's
        affected, can I roll back, is now the right time? No rollback plan, no change.
  - heading: Workflow
    markdown: >-
      1. **Inventory and baseline.** You can't protect what you don't know you
      have.
         Maintain an accurate asset and config inventory; define "known good."
      2. **Instrument.** Stand up monitoring and alerting before you need it.
      Alert on
         symptoms users feel, not just raw metrics.
      3. **Harden and least-privilege.** Apply baselines (CIS benchmarks), close
      unused
         ports and services, scope access to groups, not individuals.
      4. **Back up and test restore.** Configure 3-2-1, then schedule actual
      restore
         drills. An untested backup is a liability dressed as safety.
      5. **Patch on cadence.** Test in staging, roll to a canary group, then
      fleet, with
         a rollback path.
      6. **Change with control.** Tickets, windows, one change at a time,
      rollback
         ready. Watch the dashboards after.
      7. **Respond.** When the pager fires: stabilize first, diagnose second.
         Communicate status. Capture the timeline as you go.
      8. **Post-incident.** Blameless review. Fix the systemic cause — the
      missing
         alert, the manual step, the SPOF — and update the runbook.
  - heading: Common Tradeoffs
    markdown: >-
      - **Security vs. usability.** Tighter controls (MFA, lockouts, least
      privilege)
        add friction. Too much and users route around you with shadow IT.
      - **Uptime vs. patching speed.** Every patch is a change that can break;
        deferring it leaves exposure. The window is the negotiated middle.
      - **Cost vs. resilience.** Redundancy, off-site backups, and HA cost money
      for an
        outcome you hope never to use; spend where the RTO/RPO justifies it.
  - heading: Rules of Thumb
    markdown: >-
      - The most likely cause of an outage is the last thing that changed.

      - A monitor with no alert and an alert with no runbook are both
      decorations.

      - Fill alerts at 80% disk, not 99%; capacity problems are slow until
      they're
        sudden.
      - Never test your backups for the first time during a disaster.

      - If only one person can do it or knows the password, that's a SPOF named
      after a
        human.
  - heading: Failure Modes
    markdown: >-
      - **The untested backup.** Backups "succeed" for months; the restore fails
      when
        it finally matters because nobody ever ran it.
      - **Alert fatigue.** So many noisy alerts that the real one is ignored. A
        monitoring system that cries wolf is worse than none.
      - **Privilege sprawl.** Access granted "temporarily" and never revoked,
      until
        everyone is an admin.
      - **No change control.** "Quick fixes" straight into production with no
      ticket, no
        window, no rollback — the most reliable way to cause an outage.
  - heading: Anti-patterns
    markdown: >-
      - **Shared admin accounts** with a password in a spreadsheet — no
      accountability,
        no rotation.
      - **Flat networks** where one compromised endpoint can reach everything.

      - **Disabling monitoring "temporarily" during maintenance** and forgetting
      to
        re-enable it.
      - **Storing backups on the same system, site, or domain** they're meant to
        protect — ransomware encrypts those too.
      - **Granting Domain Admin** to solve a permissions problem nobody wanted
      to scope.
  - heading: Vocabulary
    markdown: >-
      - **RTO / RPO** — how fast you must recover, and how much data you can
      afford to
        lose.
      - **3-2-1 rule** — three backup copies, two media types, one off-site.

      - **Active Directory / LDAP** — the directory of identities, groups, and
      policy.

      - **Snowflake server** — a unique, hand-configured, hard-to-reproduce
      machine.

      - **Runbook** — step-by-step recovery procedure for a known scenario.

      - **Idempotent** — a config/automation safe to apply repeatedly to the
      same end
        state (the core promise of Ansible).
      - **SPOF** — single point of failure; a dependency whose loss takes
      everything
        down.
  - heading: Tools
    markdown: >-
      - **Configuration management** — Ansible (agentless, idempotent), Puppet,
      Chef;
        config as reviewable, repeatable code.
      - **Directory services** — Active Directory, LDAP, SSO/SAML/OIDC for
      identity.

      - **Monitoring & logging** — Zabbix, Prometheus + Grafana, the ELK/Loki
      stack;
        dashboards and alerts are your senses.
      - **Backup software** — Veeam, Bacula, restic, plus immutable/off-site
      targets.

      - **Scripting & remote access** — PowerShell, Bash/Python; SSH, RDP,
      IPMI/iLO.
  - heading: Collaboration
    markdown: >-
      Sysadmins sit between the people who depend on systems and the systems

      themselves. They work with the help desk (the first line), developers and
      DevOps

      (who want to ship; the admin wants it stable), security (allies on
      hardening and

      incident response), networking, and management (who fund resilience they
      hope

      never to use). The hardest conversations are about change windows and
      access

      requests — saying "not in production at 4 p.m. on a Friday" without
      becoming the

      department of no. Good admins translate risk into business terms managers
      can act

      on, and partner with DevOps and SRE rather than treating "infrastructure
      as code"

      as a turf threat.
  - heading: Ethics
    markdown: >-
      Sysadmins hold the keys: root, Domain Admin, the backups, the logs, the
      ability

      to read anyone's mailbox. That power is held in trust. Core duties: access
      what

      you're authorized to and only for legitimate reasons; never snoop on user
      data,

      even when you can; protect the confidentiality and integrity of the
      systems and

      the people on them; and be honest about risk — don't let a known
      vulnerability or

      an untested backup ride quietly because surfacing it is inconvenient.
      Disclose

      breaches promptly. When asked to implement surveillance, weakened
      security, or

      data retention that harms users, name the conflict rather than silently
      complying.

      The power asymmetry between an administrator and an ordinary user is
      exactly why

      restraint, transparency, and least privilege are ethical obligations.
  - heading: Scenarios
    markdown: >-
      **The ransomware morning.** A user clicks a phishing link; by 9 a.m. file
      shares

      are encrypted and a ransom note is on every desktop. The expert doesn't
      pay and

      doesn't panic-restore. First, contain: isolate affected segments, disable
      the

      compromised account, pull the network to stop lateral spread. Then assess
      scope

      from logs and EDR. Recovery hinges on the one thing prepared months ago:

      immutable, off-site backups the malware couldn't reach. They restore from
      the

      last clean, tested restore point, accept the RPO data loss already agreed
      with

      the business, rebuild compromised hosts from baseline, rotate every
      credential,

      then reconnect. The postmortem fixes the systemic gaps: MFA everywhere,
      segmented

      network, faster restore drill. The backup that saved them existed because
      someone

      tested a restore when nothing was wrong.


      **The disk that almost filled.** Monitoring fires an 80% alert on a
      database

      volume at 2 p.m. — a warning, not an outage. The novice extends the disk
      and moves

      on. The expert asks *why*: a log rotation job silently failed three weeks
      ago.

      Extending the disk would only delay the real problem. They fix rotation,
      reclaim

      the space, and add an alert for "rotation didn't run" — so the slow leak
      never

      becomes the 3 a.m. outage it was heading toward.


      **The Friday change request.** A developer needs a "quick" config change
      pushed to

      the production auth server before the weekend to unblock a release. The
      expert

      declines the cowboy push without becoming the department of no: no change
      to a

      single-point-of-failure auth system goes out on a Friday afternoon with
      the team

      leaving and no one to watch it. They schedule it for the Tuesday window,
      require a

      tested rollback, and stage it first. The instinct that protected them:
      change is

      the leading cause of outages, and the worst time to discover a bad one is
      when no

      one is watching.
  - heading: Related Occupations
    markdown: >-
      A systems administrator shares ground with several roles but is defined by

      keeping running infrastructure available and recoverable over years. Site

      reliability engineers apply software engineering to operations at scale,
      treating

      servers as code and reliability as a measured budget — the SRE descends
      from the

      sysadmin but writes the automation as a product. DevOps engineers blur dev
      and

      ops with pipelines and IaC; the modern sysadmin is increasingly one.
      Network

      engineers own the plumbing the admin depends on; database administrators
      own the

      data layer; security engineers are the closest allies on hardening,
      access, and

      incident response.
  - heading: References
    markdown: >-
      - *The Practice of System and Network Administration* — Limoncelli, Hogan,
      Chalup

      - *UNIX and Linux System Administration Handbook* — Nemeth et al.

      - *Google SRE Book* — sre.google/books

      - *Time Management for System Administrators* — Tom Limoncelli

      - CIS Benchmarks — cisecurity.org
