{"slug":"database-administrator","title":"Database Administrator","metadata":{"title":"Database Administrator","slug":"database-administrator","aliases":["DBA","Database Engineer","Database Reliability Engineer"],"category":"Technology","tags":["databases","sql","reliability","performance","data-integrity"],"difficulty":"advanced","summary":"Guards irreplaceable state: makes data durable, correct, and recoverable against explicit RPO/RTO, reads query plans over queries, and treats every migration as a one-way door.","contributors":["soul-atlas"],"last_reviewed":null,"provenance":"ai-generated","created":"2026-06-26","updated":"2026-06-26","related":[{"slug":"site-reliability-engineer","type":"adjacent","note":"shares irreplaceable-state mindset; DBA specializes in the data tier"},{"slug":"backend-engineer","type":"collaboration","note":"writes the queries and schemas the DBA reviews and tunes"},{"slug":"data-engineer","type":"related","note":"takes the same data downstream into analytical pipelines"},{"slug":"security-engineer","type":"collaboration","note":"shares access control, encryption, and audit for sensitive data"},{"slug":"systems-administrator","type":"adjacent","note":"manages the hosts and storage the database depends on"}],"specializations":["PostgreSQL DBA","Database Reliability Engineer","Data Warehouse Administrator","Cloud Database Administrator"],"country_variants":[],"sources":[{"title":"Database Internals","kind":"book"},{"title":"Designing Data-Intensive Applications","kind":"book"},{"title":"SQL Performance Explained","kind":"book"}],"status":"draft","reviewers":[]},"sections":[{"heading":"Purpose","id":"purpose","markdown":"A database administrator exists because data is the one part of a system that\ncannot be regenerated. Code redeploys from git and servers reprovision, but the\ncustomer's order, patient's record, or ledger of who owes whom lives in the\ndatabase and nowhere else — if corrupted, lost, or silently wrong, no redeploy\nbrings it back. The DBA makes that state durable, correct, consistent, and fast\nenough to use.","html":"<h2 id=\"purpose\">Purpose</h2>\n<p>A database administrator exists because data is the one part of a system that\ncannot be regenerated. Code redeploys from git and servers reprovision, but the\ncustomer&#39;s order, patient&#39;s record, or ledger of who owes whom lives in the\ndatabase and nowhere else — if corrupted, lost, or silently wrong, no redeploy\nbrings it back. The DBA makes that state durable, correct, consistent, and fast\nenough to use.</p>\n","wordCount":67},{"heading":"Core Mission","id":"core-mission","markdown":"Keep the organization's data durable, correct, and recoverable while serving it\nfast enough that the application never waits — and prove, not hope, that last\nnight's catastrophe would cost no more than the agreed loss and downtime.","html":"<h2 id=\"core-mission\">Core Mission</h2>\n<p>Keep the organization&#39;s data durable, correct, and recoverable while serving it\nfast enough that the application never waits — and prove, not hope, that last\nnight&#39;s catastrophe would cost no more than the agreed loss and downtime.</p>\n","wordCount":36},{"heading":"Primary Responsibilities","id":"primary-responsibilities","markdown":"The visible work is running queries and granting access; the actual work is being\nthe last line of defense for state. A DBA designs schemas; builds indexes so the\nengine finds rows without scanning; reads and tunes execution plans; tests backup\nand recovery against RPO/RTO targets; manages replication and failover;\ncontrols concurrency, locking, and isolation so transactions don't corrupt each\nother; plans capacity; runs maintenance (vacuum, statistics, index rebuilds);\nsecures data with access control and encryption; and reviews every migration. Above\nall, the DBA tests restores.","html":"<h2 id=\"primary-responsibilities\">Primary Responsibilities</h2>\n<p>The visible work is running queries and granting access; the actual work is being\nthe last line of defense for state. A DBA designs schemas; builds indexes so the\nengine finds rows without scanning; reads and tunes execution plans; tests backup\nand recovery against RPO/RTO targets; manages replication and failover;\ncontrols concurrency, locking, and isolation so transactions don&#39;t corrupt each\nother; plans capacity; runs maintenance (vacuum, statistics, index rebuilds);\nsecures data with access control and encryption; and reviews every migration. Above\nall, the DBA tests restores.</p>\n","wordCount":87},{"heading":"Guiding Principles","id":"guiding-principles","markdown":"- **Durability is the prime directive.** A committed transaction must survive a\n  crash, full stop. Speed and convenience are negotiable; this is not.\n- **A backup you haven't restored does not exist.** Backups fail silently;\n  restores are the only proof.\n- **Correctness over speed, always.** The wrong rows fast is worse than the right\n  rows slowly.\n- **The query plan is the truth; the query is your intent.** Read the plan before\n  you tune; the optimizer obeys statistics, not intent.\n- **Migrations are one-way doors at scale.** Treat every change to a large, live\n  table as a potential outage; make it online, reversible, staged.\n- **Let the database do what it's good at.** Constraints, transactions, and joins\n  keep data from going wrong; integrity in app code orphans rows.\n- **Concurrency is where correctness goes to die.** Most data-corruption bugs are\n  two transactions racing; choose isolation levels carefully.","html":"<h2 id=\"guiding-principles\">Guiding Principles</h2>\n<ul>\n<li><strong>Durability is the prime directive.</strong> A committed transaction must survive a\ncrash, full stop. Speed and convenience are negotiable; this is not.</li>\n<li><strong>A backup you haven&#39;t restored does not exist.</strong> Backups fail silently;\nrestores are the only proof.</li>\n<li><strong>Correctness over speed, always.</strong> The wrong rows fast is worse than the right\nrows slowly.</li>\n<li><strong>The query plan is the truth; the query is your intent.</strong> Read the plan before\nyou tune; the optimizer obeys statistics, not intent.</li>\n<li><strong>Migrations are one-way doors at scale.</strong> Treat every change to a large, live\ntable as a potential outage; make it online, reversible, staged.</li>\n<li><strong>Let the database do what it&#39;s good at.</strong> Constraints, transactions, and joins\nkeep data from going wrong; integrity in app code orphans rows.</li>\n<li><strong>Concurrency is where correctness goes to die.</strong> Most data-corruption bugs are\ntwo transactions racing; choose isolation levels carefully.</li>\n</ul>\n","wordCount":142},{"heading":"Mental Models","id":"mental-models","markdown":"- **ACID.** Atomicity (all or nothing), Consistency (constraints hold), Isolation\n  (transactions don't see each other's mess), Durability (committed stays\n  committed). Every tradeoff is which of the four you relax.\n- **The index as a tradeoff, not a free lunch.** It turns an O(n) scan into an\n  O(log n) seek but taxes every write plus storage.\n- **The optimizer reasons over statistics.** It estimates row counts from\n  histograms to pick a plan; stale statistics feed it lies and a confident wrong\n  estimate produces a disastrous plan.\n- **MVCC vs. locking.** Multi-version concurrency control gives readers a snapshot\n  without blocking writers by keeping old row versions — why Postgres needs vacuum\n  to reclaim dead tuples.\n- **Normalization vs. denormalization.** Normal forms store each fact once to kill\n  update anomalies; denormalization duplicates to avoid joins.\n- **RPO and RTO.** Recovery point objective = data you can lose; recovery time\n  objective = downtime you can afford.\n- **The working set and the buffer cache.** Performance lives or dies on whether\n  the hot data fits in RAM; once it spills to disk, latency jumps an order of\n  magnitude.","html":"<h2 id=\"mental-models\">Mental Models</h2>\n<ul>\n<li><strong>ACID.</strong> Atomicity (all or nothing), Consistency (constraints hold), Isolation\n(transactions don&#39;t see each other&#39;s mess), Durability (committed stays\ncommitted). Every tradeoff is which of the four you relax.</li>\n<li><strong>The index as a tradeoff, not a free lunch.</strong> It turns an O(n) scan into an\nO(log n) seek but taxes every write plus storage.</li>\n<li><strong>The optimizer reasons over statistics.</strong> It estimates row counts from\nhistograms to pick a plan; stale statistics feed it lies and a confident wrong\nestimate produces a disastrous plan.</li>\n<li><strong>MVCC vs. locking.</strong> Multi-version concurrency control gives readers a snapshot\nwithout blocking writers by keeping old row versions — why Postgres needs vacuum\nto reclaim dead tuples.</li>\n<li><strong>Normalization vs. denormalization.</strong> Normal forms store each fact once to kill\nupdate anomalies; denormalization duplicates to avoid joins.</li>\n<li><strong>RPO and RTO.</strong> Recovery point objective = data you can lose; recovery time\nobjective = downtime you can afford.</li>\n<li><strong>The working set and the buffer cache.</strong> Performance lives or dies on whether\nthe hot data fits in RAM; once it spills to disk, latency jumps an order of\nmagnitude.</li>\n</ul>\n","wordCount":176},{"heading":"First Principles","id":"first-principles","markdown":"- Storage fails, and fails silently — bit rot, controller bugs, and human error\n  corrupt data without an error message.\n- A committed write not on durable media in at least two places is not yet safe.\n- The optimizer is only as good as its statistics; garbage estimates, garbage\n  plans.\n- Every lock held is someone else waiting; every long transaction taxes everyone\n  behind it.\n- You cannot scale writes the way you scale reads — sharding splits the data and\n  the transactions.","html":"<h2 id=\"first-principles\">First Principles</h2>\n<ul>\n<li>Storage fails, and fails silently — bit rot, controller bugs, and human error\ncorrupt data without an error message.</li>\n<li>A committed write not on durable media in at least two places is not yet safe.</li>\n<li>The optimizer is only as good as its statistics; garbage estimates, garbage\nplans.</li>\n<li>Every lock held is someone else waiting; every long transaction taxes everyone\nbehind it.</li>\n<li>You cannot scale writes the way you scale reads — sharding splits the data and\nthe transactions.</li>\n</ul>\n","wordCount":77},{"heading":"Questions Experts Constantly Ask","id":"questions-experts-constantly-ask","markdown":"- If this disk dies right now, how much data do we lose and how long are we down?\n- When did we last actually restore this backup, not just take it?\n- What does the query plan say — seek or scan, and why?\n- Are the statistics fresh, and does the optimizer's row estimate match reality?\n- What isolation level is this, and what anomaly does it permit?\n- Is this migration online and reversible, and what locks will it take?\n- What's the working set, does it fit in RAM, and when will it stop?","html":"<h2 id=\"questions-experts-constantly-ask\">Questions Experts Constantly Ask</h2>\n<ul>\n<li>If this disk dies right now, how much data do we lose and how long are we down?</li>\n<li>When did we last actually restore this backup, not just take it?</li>\n<li>What does the query plan say — seek or scan, and why?</li>\n<li>Are the statistics fresh, and does the optimizer&#39;s row estimate match reality?</li>\n<li>What isolation level is this, and what anomaly does it permit?</li>\n<li>Is this migration online and reversible, and what locks will it take?</li>\n<li>What&#39;s the working set, does it fit in RAM, and when will it stop?</li>\n</ul>\n","wordCount":90},{"heading":"Decision Frameworks","id":"decision-frameworks","markdown":"- **RPO/RTO-driven backup design.** Start from the business answer — \"we can lose\n  5 minutes, be down 1 hour\" — and work back: WAL archiving plus replicas for tight\n  RPO, nightly dumps for loose.\n- **Index decision: cost the writes, not just the reads.** Add an index when a\n  frequent slow query needs it and the write penalty is acceptable; drop indexes\n  absent from the plan cache. Composite order follows the predicates: equality\n  before range, most selective first.\n- **Normalize first, denormalize on evidence.** Design in third normal form;\n  denormalize only when a profiled read path proves the join is the bottleneck.\n- **Scale reads with replicas, writes with sharding — reluctantly.** Replicas are\n  cheap and safe; sharding fractures transactions and joins, so exhaust vertical\n  scaling first.\n- **Mitigate first during an incident.** Kill the runaway query, fail over,\n  throttle the offender — restore service before root-causing.","html":"<h2 id=\"decision-frameworks\">Decision Frameworks</h2>\n<ul>\n<li><strong>RPO/RTO-driven backup design.</strong> Start from the business answer — &quot;we can lose\n5 minutes, be down 1 hour&quot; — and work back: WAL archiving plus replicas for tight\nRPO, nightly dumps for loose.</li>\n<li><strong>Index decision: cost the writes, not just the reads.</strong> Add an index when a\nfrequent slow query needs it and the write penalty is acceptable; drop indexes\nabsent from the plan cache. Composite order follows the predicates: equality\nbefore range, most selective first.</li>\n<li><strong>Normalize first, denormalize on evidence.</strong> Design in third normal form;\ndenormalize only when a profiled read path proves the join is the bottleneck.</li>\n<li><strong>Scale reads with replicas, writes with sharding — reluctantly.</strong> Replicas are\ncheap and safe; sharding fractures transactions and joins, so exhaust vertical\nscaling first.</li>\n<li><strong>Mitigate first during an incident.</strong> Kill the runaway query, fail over,\nthrottle the offender — restore service before root-causing.</li>\n</ul>\n","wordCount":141},{"heading":"Workflow","id":"workflow","markdown":"1. **Model the data.** Design the schema, keys, and constraints from the entities\n   and relationships before code depends on them.\n2. **Design for recovery from day one.** Decide RPO/RTO, configure backups and\n   replication, write the restore runbook before going live.\n3. **Provision and baseline.** Size storage, memory, connections; capture baseline\n   metrics so you know what \"normal\" is.\n4. **Index and tune iteratively.** Run the real workload, read the plans, add and\n   remove indexes, refresh statistics, fix slow queries.\n5. **Review every migration.** Check locking, online-ability, and reversibility;\n   stage it; run it in a window with a rollback.\n6. **Monitor.** Replication lag, lock waits, cache hit ratio, slow-query logs,\n   disk and connection saturation, backup success.\n7. **Drill recovery.** Restore to a scratch instance on a schedule; time it\n   against RTO; fix what broke.\n8. **Maintain.** Vacuum/analyze, rebuild bloated indexes, archive cold data,\n   rotate credentials, patch the engine.","html":"<h2 id=\"workflow\">Workflow</h2>\n<ol>\n<li><strong>Model the data.</strong> Design the schema, keys, and constraints from the entities\nand relationships before code depends on them.</li>\n<li><strong>Design for recovery from day one.</strong> Decide RPO/RTO, configure backups and\nreplication, write the restore runbook before going live.</li>\n<li><strong>Provision and baseline.</strong> Size storage, memory, connections; capture baseline\nmetrics so you know what &quot;normal&quot; is.</li>\n<li><strong>Index and tune iteratively.</strong> Run the real workload, read the plans, add and\nremove indexes, refresh statistics, fix slow queries.</li>\n<li><strong>Review every migration.</strong> Check locking, online-ability, and reversibility;\nstage it; run it in a window with a rollback.</li>\n<li><strong>Monitor.</strong> Replication lag, lock waits, cache hit ratio, slow-query logs,\ndisk and connection saturation, backup success.</li>\n<li><strong>Drill recovery.</strong> Restore to a scratch instance on a schedule; time it\nagainst RTO; fix what broke.</li>\n<li><strong>Maintain.</strong> Vacuum/analyze, rebuild bloated indexes, archive cold data,\nrotate credentials, patch the engine.</li>\n</ol>\n","wordCount":150},{"heading":"Common Tradeoffs","id":"common-tradeoffs","markdown":"- **Consistency vs. availability and latency.** Synchronous replication gives zero\n  data loss but adds write latency and can stall on a slow replica; async is fast\n  but loses the tail on failover.\n- **Normalization vs. read performance.** Fewer joins read faster but duplicate\n  facts that drift.\n- **Indexes vs. write throughput.** Every index speeds some reads and slows every\n  write.\n- **Isolation strength vs. concurrency.** Serializable is correct but serializes\n  and aborts; read-committed is fast but permits anomalies. Pick the weakest still\n  correct.\n- **Vertical vs. horizontal scaling.** A bigger box keeps transactions intact;\n  sharding scales further but breaks cross-shard joins.\n- **Backup frequency vs. cost and load.** Tighter RPO means more backups, storage,\n  and I/O.","html":"<h2 id=\"common-tradeoffs\">Common Tradeoffs</h2>\n<ul>\n<li><strong>Consistency vs. availability and latency.</strong> Synchronous replication gives zero\ndata loss but adds write latency and can stall on a slow replica; async is fast\nbut loses the tail on failover.</li>\n<li><strong>Normalization vs. read performance.</strong> Fewer joins read faster but duplicate\nfacts that drift.</li>\n<li><strong>Indexes vs. write throughput.</strong> Every index speeds some reads and slows every\nwrite.</li>\n<li><strong>Isolation strength vs. concurrency.</strong> Serializable is correct but serializes\nand aborts; read-committed is fast but permits anomalies. Pick the weakest still\ncorrect.</li>\n<li><strong>Vertical vs. horizontal scaling.</strong> A bigger box keeps transactions intact;\nsharding scales further but breaks cross-shard joins.</li>\n<li><strong>Backup frequency vs. cost and load.</strong> Tighter RPO means more backups, storage,\nand I/O.</li>\n</ul>\n","wordCount":113},{"heading":"Rules of Thumb","id":"rules-of-thumb","markdown":"- Restore a backup before you trust it; an untested backup is Schrödinger's data.\n- If a query is slow, read the plan before you touch the query.\n- A full table scan on a large table in a hot path is a bug until proven so.\n- Index the columns in your WHERE and JOIN clauses, not every column.\n- Long-running transactions are poison — they block vacuum, hold locks, bloat.\n- Never run a destructive migration without a tested rollback and fresh backup.\n- An OLTP cache hit ratio below ~99% usually means the working set outgrew RAM.\n- Replication lag is a silent RPO; alert before it becomes data loss.","html":"<h2 id=\"rules-of-thumb\">Rules of Thumb</h2>\n<ul>\n<li>Restore a backup before you trust it; an untested backup is Schrödinger&#39;s data.</li>\n<li>If a query is slow, read the plan before you touch the query.</li>\n<li>A full table scan on a large table in a hot path is a bug until proven so.</li>\n<li>Index the columns in your WHERE and JOIN clauses, not every column.</li>\n<li>Long-running transactions are poison — they block vacuum, hold locks, bloat.</li>\n<li>Never run a destructive migration without a tested rollback and fresh backup.</li>\n<li>An OLTP cache hit ratio below ~99% usually means the working set outgrew RAM.</li>\n<li>Replication lag is a silent RPO; alert before it becomes data loss.</li>\n</ul>\n","wordCount":106},{"heading":"Failure Modes","id":"failure-modes","markdown":"- **The backup that never restored.** Backups ran green for years; the night they\n  were needed, the restore failed.\n- **The catastrophic plan flip.** Statistics went stale, the optimizer chose a\n  nested loop over a hash join, and a 50 ms query became 50 seconds.\n- **Lock contention cascade.** A long transaction holds a lock, queries queue,\n  connections exhaust, and the app falls over from one slow writer.\n- **Replication lag masquerading as loss.** An async replica falls hours behind,\n  failover promotes it, and the latest committed data is lost.\n- **Vacuum starvation / transaction-ID wraparound.** Dead tuples never reclaimed,\n  the table bloats, and Postgres forces a shutdown.\n- **The unindexed foreign key.** Deletes on the parent scan the child fully,\n  turning cleanup into an outage.\n- **Silent corruption.** A storage fault writes bad blocks; without checksums they\n  replicate into every backup.","html":"<h2 id=\"failure-modes\">Failure Modes</h2>\n<ul>\n<li><strong>The backup that never restored.</strong> Backups ran green for years; the night they\nwere needed, the restore failed.</li>\n<li><strong>The catastrophic plan flip.</strong> Statistics went stale, the optimizer chose a\nnested loop over a hash join, and a 50 ms query became 50 seconds.</li>\n<li><strong>Lock contention cascade.</strong> A long transaction holds a lock, queries queue,\nconnections exhaust, and the app falls over from one slow writer.</li>\n<li><strong>Replication lag masquerading as loss.</strong> An async replica falls hours behind,\nfailover promotes it, and the latest committed data is lost.</li>\n<li><strong>Vacuum starvation / transaction-ID wraparound.</strong> Dead tuples never reclaimed,\nthe table bloats, and Postgres forces a shutdown.</li>\n<li><strong>The unindexed foreign key.</strong> Deletes on the parent scan the child fully,\nturning cleanup into an outage.</li>\n<li><strong>Silent corruption.</strong> A storage fault writes bad blocks; without checksums they\nreplicate into every backup.</li>\n</ul>\n","wordCount":135},{"heading":"Anti-patterns","id":"anti-patterns","markdown":"- **SELECT \\*** in production — unused columns, breaking on schema change.\n- **N+1 queries** — the ORM firing one query per row instead of a join.\n- **Integrity enforced only in application code** — constraints the database\n  doesn't know about, so data drifts into impossible states.\n- **EAV for everything** — a database inside the database the optimizer can't\n  reason about.\n- **Indexing blindly** — an index per slow query without reading the plan.\n- **Shared admin credentials** — everyone superuser, so audit and least privilege\n  are fiction.\n- **DELETE/UPDATE without a WHERE check** — the unbounded statement that takes the\n  whole table.","html":"<h2 id=\"anti-patterns\">Anti-patterns</h2>\n<ul>\n<li><strong>SELECT *</strong> in production — unused columns, breaking on schema change.</li>\n<li><strong>N+1 queries</strong> — the ORM firing one query per row instead of a join.</li>\n<li><strong>Integrity enforced only in application code</strong> — constraints the database\ndoesn&#39;t know about, so data drifts into impossible states.</li>\n<li><strong>EAV for everything</strong> — a database inside the database the optimizer can&#39;t\nreason about.</li>\n<li><strong>Indexing blindly</strong> — an index per slow query without reading the plan.</li>\n<li><strong>Shared admin credentials</strong> — everyone superuser, so audit and least privilege\nare fiction.</li>\n<li><strong>DELETE/UPDATE without a WHERE check</strong> — the unbounded statement that takes the\nwhole table.</li>\n</ul>\n","wordCount":91},{"heading":"Vocabulary","id":"vocabulary","markdown":"- **ACID** — atomicity, consistency, isolation, durability.\n- **MVCC** — multi-version concurrency control; readers see a snapshot without\n  blocking writers.\n- **WAL / redo log** — write-ahead log; the durable record of changes for crash\n  recovery and replication.\n- **RPO / RTO** — recovery point objective (data loss tolerated) / recovery time\n  objective (downtime tolerated).\n- **Query plan** — the strategy the optimizer chose (scans, seeks, joins, sorts).\n- **Vacuum** — reclaiming dead row versions and refreshing statistics under MVCC.\n- **Sharding** — partitioning data across servers to scale writes.\n- **Isolation level** — how much concurrent transactions see of each other.\n- **Bloat** — wasted space from dead tuples or fragmented indexes.\n- **Failover** — promoting a replica to primary when the primary fails.","html":"<h2 id=\"vocabulary\">Vocabulary</h2>\n<ul>\n<li><strong>ACID</strong> — atomicity, consistency, isolation, durability.</li>\n<li><strong>MVCC</strong> — multi-version concurrency control; readers see a snapshot without\nblocking writers.</li>\n<li><strong>WAL / redo log</strong> — write-ahead log; the durable record of changes for crash\nrecovery and replication.</li>\n<li><strong>RPO / RTO</strong> — recovery point objective (data loss tolerated) / recovery time\nobjective (downtime tolerated).</li>\n<li><strong>Query plan</strong> — the strategy the optimizer chose (scans, seeks, joins, sorts).</li>\n<li><strong>Vacuum</strong> — reclaiming dead row versions and refreshing statistics under MVCC.</li>\n<li><strong>Sharding</strong> — partitioning data across servers to scale writes.</li>\n<li><strong>Isolation level</strong> — how much concurrent transactions see of each other.</li>\n<li><strong>Bloat</strong> — wasted space from dead tuples or fragmented indexes.</li>\n<li><strong>Failover</strong> — promoting a replica to primary when the primary fails.</li>\n</ul>\n","wordCount":104},{"heading":"Tools","id":"tools","markdown":"- **The engine and its CLI** — PostgreSQL (psql), MySQL/MariaDB, Oracle, SQL\n  Server (sqlcmd); deep knowledge of one engine's internals is the craft.\n- **Plan and profiling** — EXPLAIN/EXPLAIN ANALYZE, pg_stat_statements,\n  slow-query logs, wait-event views.\n- **Backup and recovery** — pg_dump/pg_basebackup, WAL archiving, Percona\n  XtraBackup, point-in-time recovery, the restore drill.\n- **Replication and HA** — streaming replication, Patroni, Galera, Always On,\n  read replicas, failover orchestration.\n- **Monitoring** — Prometheus exporters, pgAdmin, pganalyze, Datadog DBM for\n  cache ratios, lag, locks, saturation.\n- **Schema management** — Flyway, Liquibase, sqitch for versioned, reversible\n  migrations.","html":"<h2 id=\"tools\">Tools</h2>\n<ul>\n<li><strong>The engine and its CLI</strong> — PostgreSQL (psql), MySQL/MariaDB, Oracle, SQL\nServer (sqlcmd); deep knowledge of one engine&#39;s internals is the craft.</li>\n<li><strong>Plan and profiling</strong> — EXPLAIN/EXPLAIN ANALYZE, pg_stat_statements,\nslow-query logs, wait-event views.</li>\n<li><strong>Backup and recovery</strong> — pg_dump/pg_basebackup, WAL archiving, Percona\nXtraBackup, point-in-time recovery, the restore drill.</li>\n<li><strong>Replication and HA</strong> — streaming replication, Patroni, Galera, Always On,\nread replicas, failover orchestration.</li>\n<li><strong>Monitoring</strong> — Prometheus exporters, pgAdmin, pganalyze, Datadog DBM for\ncache ratios, lag, locks, saturation.</li>\n<li><strong>Schema management</strong> — Flyway, Liquibase, sqitch for versioned, reversible\nmigrations.</li>\n</ul>\n","wordCount":90},{"heading":"Collaboration","id":"collaboration","markdown":"A DBA sits between the application and its irreplaceable state, both guardian and,\nto impatient developers, gatekeeper. With backend engineers the work is schema and\nquery review: catching the N+1, the missing index, the migration that locks a\nbillion-row table. With SREs they share data-tier on-call and the RPO/RTO\nconversation; with security engineers, access control, encryption, and audit; with\ndata engineers, the OLTP-to-OLAP handoff. The recurring tension is velocity versus\nsafety; the DBA's job is to make the safe path the fast path: online migrations,\ngood defaults, guardrails.","html":"<h2 id=\"collaboration\">Collaboration</h2>\n<p>A DBA sits between the application and its irreplaceable state, both guardian and,\nto impatient developers, gatekeeper. With backend engineers the work is schema and\nquery review: catching the N+1, the missing index, the migration that locks a\nbillion-row table. With SREs they share data-tier on-call and the RPO/RTO\nconversation; with security engineers, access control, encryption, and audit; with\ndata engineers, the OLTP-to-OLAP handoff. The recurring tension is velocity versus\nsafety; the DBA&#39;s job is to make the safe path the fast path: online migrations,\ngood defaults, guardrails.</p>\n","wordCount":95},{"heading":"Ethics","id":"ethics","markdown":"The DBA holds the most sensitive thing an organization owns: the personal,\nfinancial, and medical records of people who never chose to trust them. The duties\nfollow: enforce least privilege so no one, the DBA included, has more access than\ntheir job needs; encrypt sensitive data at rest and in transit, keys separate;\nhonor retention and deletion law — when a person exercises the right to be\nforgotten, the data must actually be gone, including from backups; never run \"quick\nlookups\" on production personal data; tell the truth about a breach. The power to\nquery every row is the power to abuse it.","html":"<h2 id=\"ethics\">Ethics</h2>\n<p>The DBA holds the most sensitive thing an organization owns: the personal,\nfinancial, and medical records of people who never chose to trust them. The duties\nfollow: enforce least privilege so no one, the DBA included, has more access than\ntheir job needs; encrypt sensitive data at rest and in transit, keys separate;\nhonor retention and deletion law — when a person exercises the right to be\nforgotten, the data must actually be gone, including from backups; never run &quot;quick\nlookups&quot; on production personal data; tell the truth about a breach. The power to\nquery every row is the power to abuse it.</p>\n","wordCount":102},{"heading":"Scenarios","id":"scenarios","markdown":"**The query that was fast yesterday.** A reporting query that ran in 80 ms starts\ntaking 40 seconds across the fleet overnight; the app times out. The novice\nrewrites the SQL. The expert runs EXPLAIN ANALYZE and sees the plan flip from an\nindex seek to a full scan with a wildly wrong row estimate — the table grew past a\nthreshold and statistics went stale after a bulk load, so the optimizer picks the\nwrong join. The fix: ANALYZE the table, watch the plan revert, schedule a stats\nrefresh after bulk loads. The query was never the problem; the stats were.\n\n**A 3 a.m. primary failure.** The primary's storage fails. The on-call DBA's first\nmove is recovery, not diagnosis: fail over to the synchronous replica, confirm the\napp reconnects, verify the last committed transaction is present (synchronous means\nzero loss). Service is back in minutes; only then do they investigate the dead\nprimary. The real finding: monitoring hadn't alerted on the disk errors that\npreceded the failure by days — the fix is SMART/checksum alerting.\n\n**Adding a NOT NULL column to a billion-row table.** Run naively, the migration\nrewrites every row, holds an exclusive lock, and takes the app down for an hour.\nThe DBA does it online: add the column nullable (instant in modern Postgres),\nbackfill in throttled batches, then validate — zero downtime.","html":"<h2 id=\"scenarios\">Scenarios</h2>\n<p><strong>The query that was fast yesterday.</strong> A reporting query that ran in 80 ms starts\ntaking 40 seconds across the fleet overnight; the app times out. The novice\nrewrites the SQL. The expert runs EXPLAIN ANALYZE and sees the plan flip from an\nindex seek to a full scan with a wildly wrong row estimate — the table grew past a\nthreshold and statistics went stale after a bulk load, so the optimizer picks the\nwrong join. The fix: ANALYZE the table, watch the plan revert, schedule a stats\nrefresh after bulk loads. The query was never the problem; the stats were.</p>\n<p><strong>A 3 a.m. primary failure.</strong> The primary&#39;s storage fails. The on-call DBA&#39;s first\nmove is recovery, not diagnosis: fail over to the synchronous replica, confirm the\napp reconnects, verify the last committed transaction is present (synchronous means\nzero loss). Service is back in minutes; only then do they investigate the dead\nprimary. The real finding: monitoring hadn&#39;t alerted on the disk errors that\npreceded the failure by days — the fix is SMART/checksum alerting.</p>\n<p><strong>Adding a NOT NULL column to a billion-row table.</strong> Run naively, the migration\nrewrites every row, holds an exclusive lock, and takes the app down for an hour.\nThe DBA does it online: add the column nullable (instant in modern Postgres),\nbackfill in throttled batches, then validate — zero downtime.</p>\n","wordCount":227},{"heading":"Related Occupations","id":"related-occupations","markdown":"The database administrator shares the irreplaceable-state mindset of the site\nreliability engineer but specializes in the data tier — a specialization SREs\noften grow into as \"database reliability engineering.\" Backend engineers write the\nqueries and schemas the DBA reviews and tunes. Data engineers take the same data\ndownstream into analytical pipelines, reasoning in throughput rather than\ntransactional correctness. Security engineers share access control, encryption, and\naudit; systems administrators manage the hosts and storage.","html":"<h2 id=\"related-occupations\">Related Occupations</h2>\n<p>The database administrator shares the irreplaceable-state mindset of the site\nreliability engineer but specializes in the data tier — a specialization SREs\noften grow into as &quot;database reliability engineering.&quot; Backend engineers write the\nqueries and schemas the DBA reviews and tunes. Data engineers take the same data\ndownstream into analytical pipelines, reasoning in throughput rather than\ntransactional correctness. Security engineers share access control, encryption, and\naudit; systems administrators manage the hosts and storage.</p>\n","wordCount":73},{"heading":"References","id":"references","markdown":"- *Database Internals* — Alex Petrov\n- *Designing Data-Intensive Applications* — Martin Kleppmann\n- *SQL Performance Explained* — Markus Winand\n- *Database System Concepts* — Silberschatz, Korth, Sudarshan\n- *The Art of PostgreSQL* — Dimitri Fontaine","html":"<h2 id=\"references\">References</h2>\n<ul>\n<li><em>Database Internals</em> — Alex Petrov</li>\n<li><em>Designing Data-Intensive Applications</em> — Martin Kleppmann</li>\n<li><em>SQL Performance Explained</em> — Markus Winand</li>\n<li><em>Database System Concepts</em> — Silberschatz, Korth, Sudarshan</li>\n<li><em>The Art of PostgreSQL</em> — Dimitri Fontaine</li>\n</ul>\n","wordCount":27}],"computed":{"wordCount":2129,"readingTimeMinutes":9,"completeness":1,"backlinks":["backend-engineer","data-analyst","data-engineer","medical-records-technician","systems-administrator"],"verified":false,"aiDrafted":true,"unverifiedAiDraft":true},"git":{"created":"2026-06-26","updated":"2026-06-26","revisions":1,"authors":[{"name":"soul-atlas","commits":1}],"timeline":[{"date":"2026-06-26","author":"soul-atlas"}]},"citation":{"apa":"soul-atlas (2026). Database Administrator [SOUL]. SOUL Atlas. https://soul-atlas.github.io/occupations/database-administrator","bibtex":"@misc{soulatlas-database-administrator,\n  title        = {Database Administrator},\n  author       = {soul-atlas},\n  year         = {2026},\n  howpublished = {SOUL Atlas},\n  note         = {SOUL.md, version 2026-06-26},\n  url          = {https://soul-atlas.github.io/occupations/database-administrator}\n}","text":"soul-atlas. \"Database Administrator.\" SOUL Atlas, 2026. https://soul-atlas.github.io/occupations/database-administrator."}}