{"slug":"systems-administrator","title":"Systems Administrator","metadata":{"title":"Systems Administrator","slug":"systems-administrator","aliases":["Sysadmin","IT Administrator","Infrastructure Administrator","Systems Engineer (Ops)"],"category":"Technology","tags":["infrastructure","operations","uptime","identity","backups"],"difficulty":"intermediate","summary":"Holds entropy at bay deliberately, keeping the systems people depend on available, secure, and recoverable so a single failure never becomes an outage or breach.","contributors":["soul-atlas"],"last_reviewed":null,"provenance":"ai-generated","created":"2026-06-26","updated":"2026-06-26","related":[{"slug":"site-reliability-engineer","type":"progression","note":"applies software engineering to ops at scale; descends from the sysadmin role"},{"slug":"devops-engineer","type":"adjacent","note":"blurs dev and ops with pipelines and IaC; the modern sysadmin trends here"},{"slug":"network-engineer","type":"related","note":"owns the network plumbing the admin depends on"},{"slug":"database-administrator","type":"specialization","note":"deeper specialization in the data layer"},{"slug":"security-engineer","type":"collaboration","note":"closest ally on hardening, access, and incident response"},{"slug":"cloud-architect","type":"adjacent","note":"designs the cloud infrastructure the admin increasingly operates"}],"specializations":["Windows/AD Administrator","Linux Administrator","Cloud/Infrastructure Administrator","Storage Administrator"],"country_variants":[],"sources":[{"title":"The Practice of System and Network Administration","kind":"book"},{"title":"UNIX and Linux System Administration Handbook","kind":"book"},{"title":"Google SRE Book","url":"https://sre.google/books/","kind":"book"}],"status":"draft","reviewers":[]},"sections":[{"heading":"Purpose","id":"purpose","markdown":"Organizations run on systems that must simply be there — the file share, email,\nthe directory that says who you are, the database the business depends on — and\nnotice them only when they break. A systems administrator's reason for being is\nto keep that infrastructure available, secure, recoverable, and quietly boring,\nso everyone else can work without thinking about it. The discipline exists\nbecause complex systems decay, drift, fill up, expire, and get attacked\ncontinuously, and someone has to hold entropy at bay deliberately.","html":"<h2 id=\"purpose\">Purpose</h2>\n<p>Organizations run on systems that must simply be there — the file share, email,\nthe directory that says who you are, the database the business depends on — and\nnotice them only when they break. A systems administrator&#39;s reason for being is\nto keep that infrastructure available, secure, recoverable, and quietly boring,\nso everyone else can work without thinking about it. The discipline exists\nbecause complex systems decay, drift, fill up, expire, and get attacked\ncontinuously, and someone has to hold entropy at bay deliberately.</p>\n","wordCount":83},{"heading":"Core Mission","id":"core-mission","markdown":"Keep the systems people depend on available, secure, and recoverable — so no\nsingle failure, expired certificate, full disk, or lost laptop becomes an outage\nor a breach, and so anything important can be rebuilt from a known good state.","html":"<h2 id=\"core-mission\">Core Mission</h2>\n<p>Keep the systems people depend on available, secure, and recoverable — so no\nsingle failure, expired certificate, full disk, or lost laptop becomes an outage\nor a breach, and so anything important can be rebuilt from a known good state.</p>\n","wordCount":39},{"heading":"Primary Responsibilities","id":"primary-responsibilities","markdown":"The visible work is \"fixing computers\"; the actual work is risk management over\ntime. A sysadmin spends their days: provisioning and patching servers and\nendpoints; managing identity and access (Active Directory, LDAP, SSO, groups,\npermissions); running backups and — the part that matters — testing restores;\nmonitoring everything that can fail and getting alerted before users do; planning\ncapacity; automating the repetitive and dangerous so it's done the same way every\ntime; managing change so a \"quick fix\" doesn't take down production; and carrying\nthe pager. Underneath it all is documentation — the runbook that lets 3 a.m. you\nrecover without the one expert who's on vacation.","html":"<h2 id=\"primary-responsibilities\">Primary Responsibilities</h2>\n<p>The visible work is &quot;fixing computers&quot;; the actual work is risk management over\ntime. A sysadmin spends their days: provisioning and patching servers and\nendpoints; managing identity and access (Active Directory, LDAP, SSO, groups,\npermissions); running backups and — the part that matters — testing restores;\nmonitoring everything that can fail and getting alerted before users do; planning\ncapacity; automating the repetitive and dangerous so it&#39;s done the same way every\ntime; managing change so a &quot;quick fix&quot; doesn&#39;t take down production; and carrying\nthe pager. Underneath it all is documentation — the runbook that lets 3 a.m. you\nrecover without the one expert who&#39;s on vacation.</p>\n","wordCount":105},{"heading":"Guiding Principles","id":"guiding-principles","markdown":"- **If it isn't monitored, it's already broken — you just don't know yet.** You\n  cannot manage or secure what you can't see.\n- **A backup you haven't restored is a hope, not a backup.** Follow 3-2-1: three\n  copies, two media types, one off-site (and immutable, against ransomware), then\n  test the restore on a schedule — restores fail in ways backups don't.\n- **Least privilege, always.** Every account and service gets the minimum access\n  to do its job. A compromise's blast radius is defined by the permissions you\n  granted before it happened.\n- **Automate the repeatable; document the rest.** Anything done by hand twice will\n  eventually be done wrong at the worst time. Codify it (Ansible, scripts).\n- **Change is the leading cause of outages.** Most incidents are self-inflicted.\n  Change deliberately: maintenance windows, tickets, a tested rollback, one change\n  at a time.\n- **Patch on a cadence, not a panic.** Unpatched systems are how breaches happen;\n  reckless patching is how outages happen. Test, stage, then roll.","html":"<h2 id=\"guiding-principles\">Guiding Principles</h2>\n<ul>\n<li><strong>If it isn&#39;t monitored, it&#39;s already broken — you just don&#39;t know yet.</strong> You\ncannot manage or secure what you can&#39;t see.</li>\n<li><strong>A backup you haven&#39;t restored is a hope, not a backup.</strong> Follow 3-2-1: three\ncopies, two media types, one off-site (and immutable, against ransomware), then\ntest the restore on a schedule — restores fail in ways backups don&#39;t.</li>\n<li><strong>Least privilege, always.</strong> Every account and service gets the minimum access\nto do its job. A compromise&#39;s blast radius is defined by the permissions you\ngranted before it happened.</li>\n<li><strong>Automate the repeatable; document the rest.</strong> Anything done by hand twice will\neventually be done wrong at the worst time. Codify it (Ansible, scripts).</li>\n<li><strong>Change is the leading cause of outages.</strong> Most incidents are self-inflicted.\nChange deliberately: maintenance windows, tickets, a tested rollback, one change\nat a time.</li>\n<li><strong>Patch on a cadence, not a panic.</strong> Unpatched systems are how breaches happen;\nreckless patching is how outages happen. Test, stage, then roll.</li>\n</ul>\n","wordCount":162},{"heading":"Mental Models","id":"mental-models","markdown":"- **The CIA triad.** Confidentiality, Integrity, Availability — the three things\n  you are protecting, often in tension. Name which you're trading.\n- **Defense in depth.** No single control is trusted. Layer firewall, network\n  segmentation, host hardening, least privilege, and monitoring so one failure\n  isn't a breach — assume each layer will eventually fail.\n- **The 3-2-1-1-0 backup rule.** Three copies, two media, one off-site, one\n  immutable/offline, zero errors on restore — the anchor for \"can I recover?\"\n- **RTO and RPO.** Recovery Time Objective (how long can it be down?) and Recovery\n  Point Objective (how much data can we lose?) drive every backup and HA decision.\n  Define them per system *before* the disaster, with the business.\n- **MTBF vs. MTTR.** You can invest in failing less often or recovering faster;\n  past a point, recovery is cheaper and more reliable than prevention. Make\n  recovery cheaper than perfection.\n- **Cattle, not pets.** Treat servers as interchangeable and rebuildable from\n  config, not hand-tuned snowflakes only one person understands.\n- **The SPOF hunt.** Trace every dependency — power, network, DNS, the one disk,\n  the one admin who knows the password — and ask what happens when each dies.","html":"<h2 id=\"mental-models\">Mental Models</h2>\n<ul>\n<li><strong>The CIA triad.</strong> Confidentiality, Integrity, Availability — the three things\nyou are protecting, often in tension. Name which you&#39;re trading.</li>\n<li><strong>Defense in depth.</strong> No single control is trusted. Layer firewall, network\nsegmentation, host hardening, least privilege, and monitoring so one failure\nisn&#39;t a breach — assume each layer will eventually fail.</li>\n<li><strong>The 3-2-1-1-0 backup rule.</strong> Three copies, two media, one off-site, one\nimmutable/offline, zero errors on restore — the anchor for &quot;can I recover?&quot;</li>\n<li><strong>RTO and RPO.</strong> Recovery Time Objective (how long can it be down?) and Recovery\nPoint Objective (how much data can we lose?) drive every backup and HA decision.\nDefine them per system <em>before</em> the disaster, with the business.</li>\n<li><strong>MTBF vs. MTTR.</strong> You can invest in failing less often or recovering faster;\npast a point, recovery is cheaper and more reliable than prevention. Make\nrecovery cheaper than perfection.</li>\n<li><strong>Cattle, not pets.</strong> Treat servers as interchangeable and rebuildable from\nconfig, not hand-tuned snowflakes only one person understands.</li>\n<li><strong>The SPOF hunt.</strong> Trace every dependency — power, network, DNS, the one disk,\nthe one admin who knows the password — and ask what happens when each dies.</li>\n</ul>\n","wordCount":189},{"heading":"First Principles","id":"first-principles","markdown":"- Every system fills up, expires, drifts, or gets attacked given enough time;\n  decay is the default state, uptime is maintained.\n- You cannot fix what you cannot reproduce, nor recover what you never backed up\n  and tested.\n- Complexity you don't understand is risk you can't manage; prefer boring,\n  documented, standard configurations.","html":"<h2 id=\"first-principles\">First Principles</h2>\n<ul>\n<li>Every system fills up, expires, drifts, or gets attacked given enough time;\ndecay is the default state, uptime is maintained.</li>\n<li>You cannot fix what you cannot reproduce, nor recover what you never backed up\nand tested.</li>\n<li>Complexity you don&#39;t understand is risk you can&#39;t manage; prefer boring,\ndocumented, standard configurations.</li>\n</ul>\n","wordCount":50},{"heading":"Questions Experts Constantly Ask","id":"questions-experts-constantly-ask","markdown":"- If this dies right now, how do I know, and how fast can I bring it back?\n- Have we actually tested the restore, or just confirmed the backup ran?\n- What's the blast radius if this account or server is compromised?\n- What changed? (Because something almost always did.)\n- Can I automate this so the next person doesn't do it wrong at 3 a.m.?","html":"<h2 id=\"questions-experts-constantly-ask\">Questions Experts Constantly Ask</h2>\n<ul>\n<li>If this dies right now, how do I know, and how fast can I bring it back?</li>\n<li>Have we actually tested the restore, or just confirmed the backup ran?</li>\n<li>What&#39;s the blast radius if this account or server is compromised?</li>\n<li>What changed? (Because something almost always did.)</li>\n<li>Can I automate this so the next person doesn&#39;t do it wrong at 3 a.m.?</li>\n</ul>\n","wordCount":63},{"heading":"Decision Frameworks","id":"decision-frameworks","markdown":"- **Patch now vs. patch in the window.** Classify by severity and exposure: an\n  actively-exploited internet-facing CVE patches now; a low-risk internal one\n  waits for the tested maintenance window.\n- **Build HA vs. accept downtime.** Compare the cost of redundancy against the\n  business cost of the RTO it buys. Not every system deserves a cluster; some\n  deserve a documented restore and an honest SLA.\n- **Change risk triage.** Before any change: what breaks if this is wrong, who's\n  affected, can I roll back, is now the right time? No rollback plan, no change.","html":"<h2 id=\"decision-frameworks\">Decision Frameworks</h2>\n<ul>\n<li><strong>Patch now vs. patch in the window.</strong> Classify by severity and exposure: an\nactively-exploited internet-facing CVE patches now; a low-risk internal one\nwaits for the tested maintenance window.</li>\n<li><strong>Build HA vs. accept downtime.</strong> Compare the cost of redundancy against the\nbusiness cost of the RTO it buys. Not every system deserves a cluster; some\ndeserve a documented restore and an honest SLA.</li>\n<li><strong>Change risk triage.</strong> Before any change: what breaks if this is wrong, who&#39;s\naffected, can I roll back, is now the right time? No rollback plan, no change.</li>\n</ul>\n","wordCount":93},{"heading":"Workflow","id":"workflow","markdown":"1. **Inventory and baseline.** You can't protect what you don't know you have.\n   Maintain an accurate asset and config inventory; define \"known good.\"\n2. **Instrument.** Stand up monitoring and alerting before you need it. Alert on\n   symptoms users feel, not just raw metrics.\n3. **Harden and least-privilege.** Apply baselines (CIS benchmarks), close unused\n   ports and services, scope access to groups, not individuals.\n4. **Back up and test restore.** Configure 3-2-1, then schedule actual restore\n   drills. An untested backup is a liability dressed as safety.\n5. **Patch on cadence.** Test in staging, roll to a canary group, then fleet, with\n   a rollback path.\n6. **Change with control.** Tickets, windows, one change at a time, rollback\n   ready. Watch the dashboards after.\n7. **Respond.** When the pager fires: stabilize first, diagnose second.\n   Communicate status. Capture the timeline as you go.\n8. **Post-incident.** Blameless review. Fix the systemic cause — the missing\n   alert, the manual step, the SPOF — and update the runbook.","html":"<h2 id=\"workflow\">Workflow</h2>\n<ol>\n<li><strong>Inventory and baseline.</strong> You can&#39;t protect what you don&#39;t know you have.\nMaintain an accurate asset and config inventory; define &quot;known good.&quot;</li>\n<li><strong>Instrument.</strong> Stand up monitoring and alerting before you need it. Alert on\nsymptoms users feel, not just raw metrics.</li>\n<li><strong>Harden and least-privilege.</strong> Apply baselines (CIS benchmarks), close unused\nports and services, scope access to groups, not individuals.</li>\n<li><strong>Back up and test restore.</strong> Configure 3-2-1, then schedule actual restore\ndrills. An untested backup is a liability dressed as safety.</li>\n<li><strong>Patch on cadence.</strong> Test in staging, roll to a canary group, then fleet, with\na rollback path.</li>\n<li><strong>Change with control.</strong> Tickets, windows, one change at a time, rollback\nready. Watch the dashboards after.</li>\n<li><strong>Respond.</strong> When the pager fires: stabilize first, diagnose second.\nCommunicate status. Capture the timeline as you go.</li>\n<li><strong>Post-incident.</strong> Blameless review. Fix the systemic cause — the missing\nalert, the manual step, the SPOF — and update the runbook.</li>\n</ol>\n","wordCount":161},{"heading":"Common Tradeoffs","id":"common-tradeoffs","markdown":"- **Security vs. usability.** Tighter controls (MFA, lockouts, least privilege)\n  add friction. Too much and users route around you with shadow IT.\n- **Uptime vs. patching speed.** Every patch is a change that can break;\n  deferring it leaves exposure. The window is the negotiated middle.\n- **Cost vs. resilience.** Redundancy, off-site backups, and HA cost money for an\n  outcome you hope never to use; spend where the RTO/RPO justifies it.","html":"<h2 id=\"common-tradeoffs\">Common Tradeoffs</h2>\n<ul>\n<li><strong>Security vs. usability.</strong> Tighter controls (MFA, lockouts, least privilege)\nadd friction. Too much and users route around you with shadow IT.</li>\n<li><strong>Uptime vs. patching speed.</strong> Every patch is a change that can break;\ndeferring it leaves exposure. The window is the negotiated middle.</li>\n<li><strong>Cost vs. resilience.</strong> Redundancy, off-site backups, and HA cost money for an\noutcome you hope never to use; spend where the RTO/RPO justifies it.</li>\n</ul>\n","wordCount":69},{"heading":"Rules of Thumb","id":"rules-of-thumb","markdown":"- The most likely cause of an outage is the last thing that changed.\n- A monitor with no alert and an alert with no runbook are both decorations.\n- Fill alerts at 80% disk, not 99%; capacity problems are slow until they're\n  sudden.\n- Never test your backups for the first time during a disaster.\n- If only one person can do it or knows the password, that's a SPOF named after a\n  human.","html":"<h2 id=\"rules-of-thumb\">Rules of Thumb</h2>\n<ul>\n<li>The most likely cause of an outage is the last thing that changed.</li>\n<li>A monitor with no alert and an alert with no runbook are both decorations.</li>\n<li>Fill alerts at 80% disk, not 99%; capacity problems are slow until they&#39;re\nsudden.</li>\n<li>Never test your backups for the first time during a disaster.</li>\n<li>If only one person can do it or knows the password, that&#39;s a SPOF named after a\nhuman.</li>\n</ul>\n","wordCount":70},{"heading":"Failure Modes","id":"failure-modes","markdown":"- **The untested backup.** Backups \"succeed\" for months; the restore fails when\n  it finally matters because nobody ever ran it.\n- **Alert fatigue.** So many noisy alerts that the real one is ignored. A\n  monitoring system that cries wolf is worse than none.\n- **Privilege sprawl.** Access granted \"temporarily\" and never revoked, until\n  everyone is an admin.\n- **No change control.** \"Quick fixes\" straight into production with no ticket, no\n  window, no rollback — the most reliable way to cause an outage.","html":"<h2 id=\"failure-modes\">Failure Modes</h2>\n<ul>\n<li><strong>The untested backup.</strong> Backups &quot;succeed&quot; for months; the restore fails when\nit finally matters because nobody ever ran it.</li>\n<li><strong>Alert fatigue.</strong> So many noisy alerts that the real one is ignored. A\nmonitoring system that cries wolf is worse than none.</li>\n<li><strong>Privilege sprawl.</strong> Access granted &quot;temporarily&quot; and never revoked, until\neveryone is an admin.</li>\n<li><strong>No change control.</strong> &quot;Quick fixes&quot; straight into production with no ticket, no\nwindow, no rollback — the most reliable way to cause an outage.</li>\n</ul>\n","wordCount":77},{"heading":"Anti-patterns","id":"anti-patterns","markdown":"- **Shared admin accounts** with a password in a spreadsheet — no accountability,\n  no rotation.\n- **Flat networks** where one compromised endpoint can reach everything.\n- **Disabling monitoring \"temporarily\" during maintenance** and forgetting to\n  re-enable it.\n- **Storing backups on the same system, site, or domain** they're meant to\n  protect — ransomware encrypts those too.\n- **Granting Domain Admin** to solve a permissions problem nobody wanted to scope.","html":"<h2 id=\"anti-patterns\">Anti-patterns</h2>\n<ul>\n<li><strong>Shared admin accounts</strong> with a password in a spreadsheet — no accountability,\nno rotation.</li>\n<li><strong>Flat networks</strong> where one compromised endpoint can reach everything.</li>\n<li><strong>Disabling monitoring &quot;temporarily&quot; during maintenance</strong> and forgetting to\nre-enable it.</li>\n<li><strong>Storing backups on the same system, site, or domain</strong> they&#39;re meant to\nprotect — ransomware encrypts those too.</li>\n<li><strong>Granting Domain Admin</strong> to solve a permissions problem nobody wanted to scope.</li>\n</ul>\n","wordCount":62},{"heading":"Vocabulary","id":"vocabulary","markdown":"- **RTO / RPO** — how fast you must recover, and how much data you can afford to\n  lose.\n- **3-2-1 rule** — three backup copies, two media types, one off-site.\n- **Active Directory / LDAP** — the directory of identities, groups, and policy.\n- **Snowflake server** — a unique, hand-configured, hard-to-reproduce machine.\n- **Runbook** — step-by-step recovery procedure for a known scenario.\n- **Idempotent** — a config/automation safe to apply repeatedly to the same end\n  state (the core promise of Ansible).\n- **SPOF** — single point of failure; a dependency whose loss takes everything\n  down.","html":"<h2 id=\"vocabulary\">Vocabulary</h2>\n<ul>\n<li><strong>RTO / RPO</strong> — how fast you must recover, and how much data you can afford to\nlose.</li>\n<li><strong>3-2-1 rule</strong> — three backup copies, two media types, one off-site.</li>\n<li><strong>Active Directory / LDAP</strong> — the directory of identities, groups, and policy.</li>\n<li><strong>Snowflake server</strong> — a unique, hand-configured, hard-to-reproduce machine.</li>\n<li><strong>Runbook</strong> — step-by-step recovery procedure for a known scenario.</li>\n<li><strong>Idempotent</strong> — a config/automation safe to apply repeatedly to the same end\nstate (the core promise of Ansible).</li>\n<li><strong>SPOF</strong> — single point of failure; a dependency whose loss takes everything\ndown.</li>\n</ul>\n","wordCount":89},{"heading":"Tools","id":"tools","markdown":"- **Configuration management** — Ansible (agentless, idempotent), Puppet, Chef;\n  config as reviewable, repeatable code.\n- **Directory services** — Active Directory, LDAP, SSO/SAML/OIDC for identity.\n- **Monitoring & logging** — Zabbix, Prometheus + Grafana, the ELK/Loki stack;\n  dashboards and alerts are your senses.\n- **Backup software** — Veeam, Bacula, restic, plus immutable/off-site targets.\n- **Scripting & remote access** — PowerShell, Bash/Python; SSH, RDP, IPMI/iLO.","html":"<h2 id=\"tools\">Tools</h2>\n<ul>\n<li><strong>Configuration management</strong> — Ansible (agentless, idempotent), Puppet, Chef;\nconfig as reviewable, repeatable code.</li>\n<li><strong>Directory services</strong> — Active Directory, LDAP, SSO/SAML/OIDC for identity.</li>\n<li><strong>Monitoring &amp; logging</strong> — Zabbix, Prometheus + Grafana, the ELK/Loki stack;\ndashboards and alerts are your senses.</li>\n<li><strong>Backup software</strong> — Veeam, Bacula, restic, plus immutable/off-site targets.</li>\n<li><strong>Scripting &amp; remote access</strong> — PowerShell, Bash/Python; SSH, RDP, IPMI/iLO.</li>\n</ul>\n","wordCount":57},{"heading":"Collaboration","id":"collaboration","markdown":"Sysadmins sit between the people who depend on systems and the systems\nthemselves. They work with the help desk (the first line), developers and DevOps\n(who want to ship; the admin wants it stable), security (allies on hardening and\nincident response), networking, and management (who fund resilience they hope\nnever to use). The hardest conversations are about change windows and access\nrequests — saying \"not in production at 4 p.m. on a Friday\" without becoming the\ndepartment of no. Good admins translate risk into business terms managers can act\non, and partner with DevOps and SRE rather than treating \"infrastructure as code\"\nas a turf threat.","html":"<h2 id=\"collaboration\">Collaboration</h2>\n<p>Sysadmins sit between the people who depend on systems and the systems\nthemselves. They work with the help desk (the first line), developers and DevOps\n(who want to ship; the admin wants it stable), security (allies on hardening and\nincident response), networking, and management (who fund resilience they hope\nnever to use). The hardest conversations are about change windows and access\nrequests — saying &quot;not in production at 4 p.m. on a Friday&quot; without becoming the\ndepartment of no. Good admins translate risk into business terms managers can act\non, and partner with DevOps and SRE rather than treating &quot;infrastructure as code&quot;\nas a turf threat.</p>\n","wordCount":106},{"heading":"Ethics","id":"ethics","markdown":"Sysadmins hold the keys: root, Domain Admin, the backups, the logs, the ability\nto read anyone's mailbox. That power is held in trust. Core duties: access what\nyou're authorized to and only for legitimate reasons; never snoop on user data,\neven when you can; protect the confidentiality and integrity of the systems and\nthe people on them; and be honest about risk — don't let a known vulnerability or\nan untested backup ride quietly because surfacing it is inconvenient. Disclose\nbreaches promptly. When asked to implement surveillance, weakened security, or\ndata retention that harms users, name the conflict rather than silently complying.\nThe power asymmetry between an administrator and an ordinary user is exactly why\nrestraint, transparency, and least privilege are ethical obligations.","html":"<h2 id=\"ethics\">Ethics</h2>\n<p>Sysadmins hold the keys: root, Domain Admin, the backups, the logs, the ability\nto read anyone&#39;s mailbox. That power is held in trust. Core duties: access what\nyou&#39;re authorized to and only for legitimate reasons; never snoop on user data,\neven when you can; protect the confidentiality and integrity of the systems and\nthe people on them; and be honest about risk — don&#39;t let a known vulnerability or\nan untested backup ride quietly because surfacing it is inconvenient. Disclose\nbreaches promptly. When asked to implement surveillance, weakened security, or\ndata retention that harms users, name the conflict rather than silently complying.\nThe power asymmetry between an administrator and an ordinary user is exactly why\nrestraint, transparency, and least privilege are ethical obligations.</p>\n","wordCount":122},{"heading":"Scenarios","id":"scenarios","markdown":"**The ransomware morning.** A user clicks a phishing link; by 9 a.m. file shares\nare encrypted and a ransom note is on every desktop. The expert doesn't pay and\ndoesn't panic-restore. First, contain: isolate affected segments, disable the\ncompromised account, pull the network to stop lateral spread. Then assess scope\nfrom logs and EDR. Recovery hinges on the one thing prepared months ago:\nimmutable, off-site backups the malware couldn't reach. They restore from the\nlast clean, tested restore point, accept the RPO data loss already agreed with\nthe business, rebuild compromised hosts from baseline, rotate every credential,\nthen reconnect. The postmortem fixes the systemic gaps: MFA everywhere, segmented\nnetwork, faster restore drill. The backup that saved them existed because someone\ntested a restore when nothing was wrong.\n\n**The disk that almost filled.** Monitoring fires an 80% alert on a database\nvolume at 2 p.m. — a warning, not an outage. The novice extends the disk and moves\non. The expert asks *why*: a log rotation job silently failed three weeks ago.\nExtending the disk would only delay the real problem. They fix rotation, reclaim\nthe space, and add an alert for \"rotation didn't run\" — so the slow leak never\nbecomes the 3 a.m. outage it was heading toward.\n\n**The Friday change request.** A developer needs a \"quick\" config change pushed to\nthe production auth server before the weekend to unblock a release. The expert\ndeclines the cowboy push without becoming the department of no: no change to a\nsingle-point-of-failure auth system goes out on a Friday afternoon with the team\nleaving and no one to watch it. They schedule it for the Tuesday window, require a\ntested rollback, and stage it first. The instinct that protected them: change is\nthe leading cause of outages, and the worst time to discover a bad one is when no\none is watching.","html":"<h2 id=\"scenarios\">Scenarios</h2>\n<p><strong>The ransomware morning.</strong> A user clicks a phishing link; by 9 a.m. file shares\nare encrypted and a ransom note is on every desktop. The expert doesn&#39;t pay and\ndoesn&#39;t panic-restore. First, contain: isolate affected segments, disable the\ncompromised account, pull the network to stop lateral spread. Then assess scope\nfrom logs and EDR. Recovery hinges on the one thing prepared months ago:\nimmutable, off-site backups the malware couldn&#39;t reach. They restore from the\nlast clean, tested restore point, accept the RPO data loss already agreed with\nthe business, rebuild compromised hosts from baseline, rotate every credential,\nthen reconnect. The postmortem fixes the systemic gaps: MFA everywhere, segmented\nnetwork, faster restore drill. The backup that saved them existed because someone\ntested a restore when nothing was wrong.</p>\n<p><strong>The disk that almost filled.</strong> Monitoring fires an 80% alert on a database\nvolume at 2 p.m. — a warning, not an outage. The novice extends the disk and moves\non. The expert asks <em>why</em>: a log rotation job silently failed three weeks ago.\nExtending the disk would only delay the real problem. They fix rotation, reclaim\nthe space, and add an alert for &quot;rotation didn&#39;t run&quot; — so the slow leak never\nbecomes the 3 a.m. outage it was heading toward.</p>\n<p><strong>The Friday change request.</strong> A developer needs a &quot;quick&quot; config change pushed to\nthe production auth server before the weekend to unblock a release. The expert\ndeclines the cowboy push without becoming the department of no: no change to a\nsingle-point-of-failure auth system goes out on a Friday afternoon with the team\nleaving and no one to watch it. They schedule it for the Tuesday window, require a\ntested rollback, and stage it first. The instinct that protected them: change is\nthe leading cause of outages, and the worst time to discover a bad one is when no\none is watching.</p>\n","wordCount":316},{"heading":"Related Occupations","id":"related-occupations","markdown":"A systems administrator shares ground with several roles but is defined by\nkeeping running infrastructure available and recoverable over years. Site\nreliability engineers apply software engineering to operations at scale, treating\nservers as code and reliability as a measured budget — the SRE descends from the\nsysadmin but writes the automation as a product. DevOps engineers blur dev and\nops with pipelines and IaC; the modern sysadmin is increasingly one. Network\nengineers own the plumbing the admin depends on; database administrators own the\ndata layer; security engineers are the closest allies on hardening, access, and\nincident response.","html":"<h2 id=\"related-occupations\">Related Occupations</h2>\n<p>A systems administrator shares ground with several roles but is defined by\nkeeping running infrastructure available and recoverable over years. Site\nreliability engineers apply software engineering to operations at scale, treating\nservers as code and reliability as a measured budget — the SRE descends from the\nsysadmin but writes the automation as a product. DevOps engineers blur dev and\nops with pipelines and IaC; the modern sysadmin is increasingly one. Network\nengineers own the plumbing the admin depends on; database administrators own the\ndata layer; security engineers are the closest allies on hardening, access, and\nincident response.</p>\n","wordCount":96},{"heading":"References","id":"references","markdown":"- *The Practice of System and Network Administration* — Limoncelli, Hogan, Chalup\n- *UNIX and Linux System Administration Handbook* — Nemeth et al.\n- *Google SRE Book* — sre.google/books\n- *Time Management for System Administrators* — Tom Limoncelli\n- CIS Benchmarks — cisecurity.org","html":"<h2 id=\"references\">References</h2>\n<ul>\n<li><em>The Practice of System and Network Administration</em> — Limoncelli, Hogan, Chalup</li>\n<li><em>UNIX and Linux System Administration Handbook</em> — Nemeth et al.</li>\n<li><em>Google SRE Book</em> — sre.google/books</li>\n<li><em>Time Management for System Administrators</em> — Tom Limoncelli</li>\n<li>CIS Benchmarks — cisecurity.org</li>\n</ul>\n","wordCount":36}],"computed":{"wordCount":2045,"readingTimeMinutes":9,"completeness":1,"backlinks":["database-administrator","devops-engineer","it-manager","it-support-specialist","network-engineer","site-reliability-engineer"],"verified":false,"aiDrafted":true,"unverifiedAiDraft":true},"git":{"created":"2026-06-26","updated":"2026-06-26","revisions":1,"authors":[{"name":"soul-atlas","commits":1}],"timeline":[{"date":"2026-06-26","author":"soul-atlas"}]},"citation":{"apa":"soul-atlas (2026). Systems Administrator [SOUL]. SOUL Atlas. https://soul-atlas.github.io/occupations/systems-administrator","bibtex":"@misc{soulatlas-systems-administrator,\n  title        = {Systems Administrator},\n  author       = {soul-atlas},\n  year         = {2026},\n  howpublished = {SOUL Atlas},\n  note         = {SOUL.md, version 2026-06-26},\n  url          = {https://soul-atlas.github.io/occupations/systems-administrator}\n}","text":"soul-atlas. \"Systems Administrator.\" SOUL Atlas, 2026. https://soul-atlas.github.io/occupations/systems-administrator."}}