{"slug":"data-engineer","title":"Data Engineer","metadata":{"title":"Data Engineer","slug":"data-engineer","aliases":["Data Pipeline Engineer","ETL Developer","Analytics Engineer"],"category":"Technology","tags":["data-pipelines","etl","data-warehousing","orchestration","data-quality"],"difficulty":"advanced","summary":"Treats data as guilty until proven clean: builds idempotent, replayable pipelines and tests tables like code so the same input always yields the same trustworthy answer.","contributors":["soul-atlas"],"last_reviewed":null,"provenance":"ai-generated","created":"2026-06-26","updated":"2026-06-26","related":[{"slug":"data-scientist","type":"collaboration","note":"primary downstream consumer of clean, well-grained tables"},{"slug":"database-administrator","type":"adjacent","note":"operational cousin tuning the stores pipelines read and write"},{"slug":"software-engineer","type":"related","note":"shares testable, version-controlled code aimed at data correctness"},{"slug":"machine-learning-engineer","type":"collaboration","note":"trains and serves models on the tables data engineers guarantee"},{"slug":"backend-engineer","type":"adjacent","note":"produces the operational data and event streams pipelines ingest"},{"slug":"cloud-architect","type":"related","note":"designs the storage and compute substrate the data stack runs on"}],"specializations":["Analytics Engineer","Streaming Data Engineer","Data Platform Engineer"],"country_variants":[],"sources":[{"title":"The Data Warehouse Toolkit","kind":"book"},{"title":"Fundamentals of Data Engineering","kind":"book"},{"title":"Designing Data-Intensive Applications","kind":"book"}],"status":"draft","reviewers":[]},"sections":[{"heading":"Purpose","id":"purpose","markdown":"A data engineer exists so the rest of an organization can trust the numbers. Raw\ndata arrives messy, late, duplicated, and lying — from databases, event streams,\nAPIs, and hand-edited spreadsheets — and somebody must turn it into tables an\nanalyst, a model, or an executive can rely on without a footnote. The gap between\n\"the data is somewhere\" and \"we have a correct, timely answer\" is bridged by\nplumbing that holds when the source schema changes at 3 a.m. unannounced.","html":"<h2 id=\"purpose\">Purpose</h2>\n<p>A data engineer exists so the rest of an organization can trust the numbers. Raw\ndata arrives messy, late, duplicated, and lying — from databases, event streams,\nAPIs, and hand-edited spreadsheets — and somebody must turn it into tables an\nanalyst, a model, or an executive can rely on without a footnote. The gap between\n&quot;the data is somewhere&quot; and &quot;we have a correct, timely answer&quot; is bridged by\nplumbing that holds when the source schema changes at 3 a.m. unannounced.</p>\n","wordCount":81},{"heading":"Core Mission","id":"core-mission","markdown":"Move data from where it is produced to where it is needed — correct, complete, on\ntime, and reproducible — so the same input always yields the same trustworthy\noutput, run after run.","html":"<h2 id=\"core-mission\">Core Mission</h2>\n<p>Move data from where it is produced to where it is needed — correct, complete, on\ntime, and reproducible — so the same input always yields the same trustworthy\noutput, run after run.</p>\n","wordCount":31},{"heading":"Primary Responsibilities","id":"primary-responsibilities","markdown":"The visible work is writing pipelines; the actual work is defending data\ncorrectness across systems never designed to agree. A data engineer designs\ningestion from operational sources without melting them; models the\nwarehouse/lakehouse so a question maps to a query, not a forensic investigation;\nwrites idempotent, replayable transformation logic (ELT or ETL); enforces quality\nwith tests and contracts; orchestrates the dependency graph so the daily run\nfinishes before dashboards refresh; manages schema evolution so one upstream change\ndoesn't break twelve downstream jobs; and owns compute cost that can bill five\nfigures a month. When a number looks wrong, the data engineer is asked why — and \"I\ndon't know\" won't hold for long.","html":"<h2 id=\"primary-responsibilities\">Primary Responsibilities</h2>\n<p>The visible work is writing pipelines; the actual work is defending data\ncorrectness across systems never designed to agree. A data engineer designs\ningestion from operational sources without melting them; models the\nwarehouse/lakehouse so a question maps to a query, not a forensic investigation;\nwrites idempotent, replayable transformation logic (ELT or ETL); enforces quality\nwith tests and contracts; orchestrates the dependency graph so the daily run\nfinishes before dashboards refresh; manages schema evolution so one upstream change\ndoesn&#39;t break twelve downstream jobs; and owns compute cost that can bill five\nfigures a month. When a number looks wrong, the data engineer is asked why — and &quot;I\ndon&#39;t know&quot; won&#39;t hold for long.</p>\n","wordCount":113},{"heading":"Guiding Principles","id":"guiding-principles","markdown":"- **Idempotency is non-negotiable.** A pipeline you can't re-run can't be operated;\n  re-running yesterday's load must reproduce it, not double it.\n- **Make pipelines replayable from source.** Keep raw immutable and partitioned;\n  rebuilding any table from raw means recovering from any bug.\n- **Schema is a contract, and contracts break loudly.** Detect drift at the\n  boundary; never let a silently-changed upstream column poison a gold table.\n- **Correctness beats freshness beats cost.** Get it right, then timely, then\n  cheap. A fast, cheap wrong number is the worst outcome.\n- **Test data like you test code.** Row counts, uniqueness, null rates, and\n  referential integrity are assertions, not hopes.\n- **Push computation to the data.** Move the query into the warehouse; don't drag a\n  billion rows across the network to filter.\n- **Late-arriving data is the normal case.** Design for events hours late, not as\n  an exception.","html":"<h2 id=\"guiding-principles\">Guiding Principles</h2>\n<ul>\n<li><strong>Idempotency is non-negotiable.</strong> A pipeline you can&#39;t re-run can&#39;t be operated;\nre-running yesterday&#39;s load must reproduce it, not double it.</li>\n<li><strong>Make pipelines replayable from source.</strong> Keep raw immutable and partitioned;\nrebuilding any table from raw means recovering from any bug.</li>\n<li><strong>Schema is a contract, and contracts break loudly.</strong> Detect drift at the\nboundary; never let a silently-changed upstream column poison a gold table.</li>\n<li><strong>Correctness beats freshness beats cost.</strong> Get it right, then timely, then\ncheap. A fast, cheap wrong number is the worst outcome.</li>\n<li><strong>Test data like you test code.</strong> Row counts, uniqueness, null rates, and\nreferential integrity are assertions, not hopes.</li>\n<li><strong>Push computation to the data.</strong> Move the query into the warehouse; don&#39;t drag a\nbillion rows across the network to filter.</li>\n<li><strong>Late-arriving data is the normal case.</strong> Design for events hours late, not as\nan exception.</li>\n</ul>\n","wordCount":143},{"heading":"Mental Models","id":"mental-models","markdown":"- **ETL vs. ELT.** With cheap storage and powerful warehouses (Snowflake,\n  BigQuery, Databricks), load raw first and transform in-warehouse; prefer ELT.\n- **The medallion architecture (bronze/silver/gold).** Raw ingested (bronze),\n  cleaned (silver), business aggregates (gold) — each a checkpoint rebuildable from\n  the one below.\n- **Batch vs. stream as a latency spectrum.** Batch is a stream with a big window,\n  streaming a batch of one; ask what latency the consumer needs.\n- **The Lambda / Kappa debate.** Lambda runs batch and stream in parallel; Kappa\n  keeps one path and replays it. Two paths computing one number is a reconciliation\n  tax — prefer one.\n- **Event time vs. processing time.** When something happened versus when you saw\n  it — home of windowing, watermarks, late data, and most \"totals don't match\"\n  bugs.\n- **Slowly Changing Dimensions (SCD).** How history survives a dimension change —\n  overwrite (Type 1) or new versioned row (Type 2); the wrong type destroys \"what\n  was true back then?\"\n- **The DAG.** Every pipeline is a directed acyclic graph of dependencies that\n  orchestration runs in order, retries, and backfills; a cycle means a broken model.","html":"<h2 id=\"mental-models\">Mental Models</h2>\n<ul>\n<li><strong>ETL vs. ELT.</strong> With cheap storage and powerful warehouses (Snowflake,\nBigQuery, Databricks), load raw first and transform in-warehouse; prefer ELT.</li>\n<li><strong>The medallion architecture (bronze/silver/gold).</strong> Raw ingested (bronze),\ncleaned (silver), business aggregates (gold) — each a checkpoint rebuildable from\nthe one below.</li>\n<li><strong>Batch vs. stream as a latency spectrum.</strong> Batch is a stream with a big window,\nstreaming a batch of one; ask what latency the consumer needs.</li>\n<li><strong>The Lambda / Kappa debate.</strong> Lambda runs batch and stream in parallel; Kappa\nkeeps one path and replays it. Two paths computing one number is a reconciliation\ntax — prefer one.</li>\n<li><strong>Event time vs. processing time.</strong> When something happened versus when you saw\nit — home of windowing, watermarks, late data, and most &quot;totals don&#39;t match&quot;\nbugs.</li>\n<li><strong>Slowly Changing Dimensions (SCD).</strong> How history survives a dimension change —\noverwrite (Type 1) or new versioned row (Type 2); the wrong type destroys &quot;what\nwas true back then?&quot;</li>\n<li><strong>The DAG.</strong> Every pipeline is a directed acyclic graph of dependencies that\norchestration runs in order, retries, and backfills; a cycle means a broken model.</li>\n</ul>\n","wordCount":176},{"heading":"First Principles","id":"first-principles","markdown":"- Data is guilty until proven clean; every source lies in its own dialect.\n- The same query run twice should return the same answer, or you have a bug, not a\n  result.\n- Storage and recomputation are cheap; a wrong number that reached a decision-maker\n  is expensive — favor recoverability.\n- Every join is a claim about cardinality you must verify.\n- A pipeline with no tests asserts correctness on faith.","html":"<h2 id=\"first-principles\">First Principles</h2>\n<ul>\n<li>Data is guilty until proven clean; every source lies in its own dialect.</li>\n<li>The same query run twice should return the same answer, or you have a bug, not a\nresult.</li>\n<li>Storage and recomputation are cheap; a wrong number that reached a decision-maker\nis expensive — favor recoverability.</li>\n<li>Every join is a claim about cardinality you must verify.</li>\n<li>A pipeline with no tests asserts correctness on faith.</li>\n</ul>\n","wordCount":67},{"heading":"Questions Experts Constantly Ask","id":"questions-experts-constantly-ask","markdown":"- What is the grain of this table — one row per what?\n- Is this transformation idempotent, and can I backfill it safely?\n- What happens when this source sends a duplicate, null, or late event?\n- Who consumes this table, and what freshness SLA do they need?\n- If this column disappears upstream tomorrow, what breaks, how loudly?\n- Are these two numbers different from a bug or event-time skew?\n- What will this query cost, and is a full scan needed?\n- Can I rebuild this from raw?","html":"<h2 id=\"questions-experts-constantly-ask\">Questions Experts Constantly Ask</h2>\n<ul>\n<li>What is the grain of this table — one row per what?</li>\n<li>Is this transformation idempotent, and can I backfill it safely?</li>\n<li>What happens when this source sends a duplicate, null, or late event?</li>\n<li>Who consumes this table, and what freshness SLA do they need?</li>\n<li>If this column disappears upstream tomorrow, what breaks, how loudly?</li>\n<li>Are these two numbers different from a bug or event-time skew?</li>\n<li>What will this query cost, and is a full scan needed?</li>\n<li>Can I rebuild this from raw?</li>\n</ul>\n","wordCount":83},{"heading":"Decision Frameworks","id":"decision-frameworks","markdown":"- **Batch vs. streaming.** Default to batch; it's simpler and cheaper. Reach for\n  streaming only when a decision needs sub-minute latency (fraud, real-time\n  inventory).\n- **Build vs. buy for ingestion.** Connectors to common SaaS sources are solved;\n  use Fivetran/Airbyte rather than maintaining brittle API clients, unless the\n  source is a core differentiator.\n- **Normalize vs. denormalize.** Normalize where data is written and integrity\n  matters; denormalize (star schema, wide tables) where it's read.\n- **Where to enforce quality.** Block at the contract boundary for failures that\n  corrupt downstream (schema, key uniqueness); warn-and-continue for the merely\n  suspicious — don't fail a DAG over a 0.1% null rate on a nullable field.\n- **Full refresh vs. incremental.** Full refresh is simple and self-healing but\n  scales badly; go incremental only when volume forces it, with a periodic full\n  rebuild as the safety net.","html":"<h2 id=\"decision-frameworks\">Decision Frameworks</h2>\n<ul>\n<li><strong>Batch vs. streaming.</strong> Default to batch; it&#39;s simpler and cheaper. Reach for\nstreaming only when a decision needs sub-minute latency (fraud, real-time\ninventory).</li>\n<li><strong>Build vs. buy for ingestion.</strong> Connectors to common SaaS sources are solved;\nuse Fivetran/Airbyte rather than maintaining brittle API clients, unless the\nsource is a core differentiator.</li>\n<li><strong>Normalize vs. denormalize.</strong> Normalize where data is written and integrity\nmatters; denormalize (star schema, wide tables) where it&#39;s read.</li>\n<li><strong>Where to enforce quality.</strong> Block at the contract boundary for failures that\ncorrupt downstream (schema, key uniqueness); warn-and-continue for the merely\nsuspicious — don&#39;t fail a DAG over a 0.1% null rate on a nullable field.</li>\n<li><strong>Full refresh vs. incremental.</strong> Full refresh is simple and self-healing but\nscales badly; go incremental only when volume forces it, with a periodic full\nrebuild as the safety net.</li>\n</ul>\n","wordCount":140},{"heading":"Workflow","id":"workflow","markdown":"1. **Understand the consumer.** What question must this answer, at what grain, how\n   fresh, and who is hurt if it's wrong? Define the SLA before the schema.\n2. **Profile the source.** Volume, update pattern, key uniqueness, null rates, how\n   it signals deletes and late data. Assume nothing the docs claim.\n3. **Land raw, immutably.** Ingest to bronze with full fidelity, partitioned by\n   ingestion or event date — never transform on the way in.\n4. **Model.** Decide grain, keys, and SCD type. Write transformations as\n   version-controlled, tested SQL (dbt) — silver then gold.\n5. **Test.** Assert uniqueness, not-null, accepted values, and referential\n   integrity; a model without tests doesn't merge.\n6. **Orchestrate.** Wire the DAG in Airflow/Dagster with retries, upstream-readiness\n   sensors, and idempotent tasks. Make backfill a first-class command.\n7. **Observe.** Monitor freshness, volume, and schema drift; alert when row counts\n   swing outside expected bounds.\n8. **Tune cost.** Right-size warehouses, partition and cluster on real query\n   patterns, kill the full scans.","html":"<h2 id=\"workflow\">Workflow</h2>\n<ol>\n<li><strong>Understand the consumer.</strong> What question must this answer, at what grain, how\nfresh, and who is hurt if it&#39;s wrong? Define the SLA before the schema.</li>\n<li><strong>Profile the source.</strong> Volume, update pattern, key uniqueness, null rates, how\nit signals deletes and late data. Assume nothing the docs claim.</li>\n<li><strong>Land raw, immutably.</strong> Ingest to bronze with full fidelity, partitioned by\ningestion or event date — never transform on the way in.</li>\n<li><strong>Model.</strong> Decide grain, keys, and SCD type. Write transformations as\nversion-controlled, tested SQL (dbt) — silver then gold.</li>\n<li><strong>Test.</strong> Assert uniqueness, not-null, accepted values, and referential\nintegrity; a model without tests doesn&#39;t merge.</li>\n<li><strong>Orchestrate.</strong> Wire the DAG in Airflow/Dagster with retries, upstream-readiness\nsensors, and idempotent tasks. Make backfill a first-class command.</li>\n<li><strong>Observe.</strong> Monitor freshness, volume, and schema drift; alert when row counts\nswing outside expected bounds.</li>\n<li><strong>Tune cost.</strong> Right-size warehouses, partition and cluster on real query\npatterns, kill the full scans.</li>\n</ol>\n","wordCount":163},{"heading":"Common Tradeoffs","id":"common-tradeoffs","markdown":"- **Freshness vs. cost.** Sub-minute pipelines cost far more than hourly ones; buy\n  the latency the decision needs, not the one that sounds modern.\n- **Correctness vs. timeliness.** Wait for late data and be right but late, or cut\n  the window and be timely but provisional; state which, per table.\n- **Normalization vs. query performance.** Fewer joins read faster but duplicate\n  data and risk drift; normalization is cleaner but slower.\n- **Schema-on-write vs. schema-on-read.** Structure on ingest catches problems\n  early but rejects messy reality; schema-on-read defers the pain.\n- **Generic framework vs. specific pipeline.** A mega-pipeline handling every source\n  poorly versus tailored jobs each handling one well.\n- **One-path (Kappa) vs. two-path (Lambda).** One path is simpler to keep correct;\n  two buy latency at the cost of constant reconciliation.","html":"<h2 id=\"common-tradeoffs\">Common Tradeoffs</h2>\n<ul>\n<li><strong>Freshness vs. cost.</strong> Sub-minute pipelines cost far more than hourly ones; buy\nthe latency the decision needs, not the one that sounds modern.</li>\n<li><strong>Correctness vs. timeliness.</strong> Wait for late data and be right but late, or cut\nthe window and be timely but provisional; state which, per table.</li>\n<li><strong>Normalization vs. query performance.</strong> Fewer joins read faster but duplicate\ndata and risk drift; normalization is cleaner but slower.</li>\n<li><strong>Schema-on-write vs. schema-on-read.</strong> Structure on ingest catches problems\nearly but rejects messy reality; schema-on-read defers the pain.</li>\n<li><strong>Generic framework vs. specific pipeline.</strong> A mega-pipeline handling every source\npoorly versus tailored jobs each handling one well.</li>\n<li><strong>One-path (Kappa) vs. two-path (Lambda).</strong> One path is simpler to keep correct;\ntwo buy latency at the cost of constant reconciliation.</li>\n</ul>\n","wordCount":133},{"heading":"Rules of Thumb","id":"rules-of-thumb","markdown":"- Always know the grain of a table before you query it.\n- If a pipeline isn't idempotent, treat it as broken even if it \"works\" today.\n- Partition by the column you filter on; cluster by the column you join on.\n- A dedupe step belongs right after ingestion, not layers downstream.\n- Never `SELECT *` in production transforms — it's how schema drift sneaks in.\n- Test the keys: a non-unique primary key makes every downstream count suspect.\n- Backfill is not an emergency procedure; design for it from day one.\n- When two numbers disagree, suspect event-time vs. processing-time before\n  arithmetic.","html":"<h2 id=\"rules-of-thumb\">Rules of Thumb</h2>\n<ul>\n<li>Always know the grain of a table before you query it.</li>\n<li>If a pipeline isn&#39;t idempotent, treat it as broken even if it &quot;works&quot; today.</li>\n<li>Partition by the column you filter on; cluster by the column you join on.</li>\n<li>A dedupe step belongs right after ingestion, not layers downstream.</li>\n<li>Never <code>SELECT *</code> in production transforms — it&#39;s how schema drift sneaks in.</li>\n<li>Test the keys: a non-unique primary key makes every downstream count suspect.</li>\n<li>Backfill is not an emergency procedure; design for it from day one.</li>\n<li>When two numbers disagree, suspect event-time vs. processing-time before\narithmetic.</li>\n</ul>\n","wordCount":96},{"heading":"Failure Modes","id":"failure-modes","markdown":"- **Silent duplication.** A non-idempotent load run twice, doubling revenue in a\n  dashboard nobody reconciled.\n- **Fan-out joins.** A many-to-many join nobody checked, inflating every metric.\n- **Schema drift poisoning.** An upstream type change flowing unnoticed into gold,\n  corrupting a quarter of reporting.\n- **The midnight backfill that didn't.** A \"rerun\" that wasn't idempotent, leaving\n  partial state worse than the failure.\n- **Pipeline spaghetti.** Hundreds of interdependent jobs with no lineage; no one\n  can say what breaks if a table is dropped.\n- **Cost surprise.** A naive full-scan scheduled hourly, billing more than the\n  team's salaries.\n- **Freshness theater.** A pipeline that runs \"successfully\" on stale data because\n  nobody monitored whether new data arrived.","html":"<h2 id=\"failure-modes\">Failure Modes</h2>\n<ul>\n<li><strong>Silent duplication.</strong> A non-idempotent load run twice, doubling revenue in a\ndashboard nobody reconciled.</li>\n<li><strong>Fan-out joins.</strong> A many-to-many join nobody checked, inflating every metric.</li>\n<li><strong>Schema drift poisoning.</strong> An upstream type change flowing unnoticed into gold,\ncorrupting a quarter of reporting.</li>\n<li><strong>The midnight backfill that didn&#39;t.</strong> A &quot;rerun&quot; that wasn&#39;t idempotent, leaving\npartial state worse than the failure.</li>\n<li><strong>Pipeline spaghetti.</strong> Hundreds of interdependent jobs with no lineage; no one\ncan say what breaks if a table is dropped.</li>\n<li><strong>Cost surprise.</strong> A naive full-scan scheduled hourly, billing more than the\nteam&#39;s salaries.</li>\n<li><strong>Freshness theater.</strong> A pipeline that runs &quot;successfully&quot; on stale data because\nnobody monitored whether new data arrived.</li>\n</ul>\n","wordCount":112},{"heading":"Anti-patterns","id":"anti-patterns","markdown":"- **Transform-on-ingest** — destroying the raw record, so you can never replay.\n- **The one giant SQL file** — a 2,000-line transform no one can test.\n- **Manual fixes in production** — hand-editing a warehouse table, breaking\n  reproducibility.\n- **`MERGE` without idempotency** — upserts that aren't safe to re-run.\n- **Reverse ETL sprawl** — syncing the warehouse into a dozen unowned tools.\n- **Trusting source documentation** — building on the schema the API claims, not\n  what it sends.\n- **No data contracts** — letting upstream teams change shape with no notice.","html":"<h2 id=\"anti-patterns\">Anti-patterns</h2>\n<ul>\n<li><strong>Transform-on-ingest</strong> — destroying the raw record, so you can never replay.</li>\n<li><strong>The one giant SQL file</strong> — a 2,000-line transform no one can test.</li>\n<li><strong>Manual fixes in production</strong> — hand-editing a warehouse table, breaking\nreproducibility.</li>\n<li><strong><code>MERGE</code> without idempotency</strong> — upserts that aren&#39;t safe to re-run.</li>\n<li><strong>Reverse ETL sprawl</strong> — syncing the warehouse into a dozen unowned tools.</li>\n<li><strong>Trusting source documentation</strong> — building on the schema the API claims, not\nwhat it sends.</li>\n<li><strong>No data contracts</strong> — letting upstream teams change shape with no notice.</li>\n</ul>\n","wordCount":82},{"heading":"Vocabulary","id":"vocabulary","markdown":"- **Idempotent** — re-running a load yields the same state, not duplicate data.\n- **Grain** — what a single row of a table represents.\n- **Backfill** — recomputing historical partitions after a fix or new column.\n- **Watermark** — the event-time point past which a window is emitted.\n- **SCD (Slowly Changing Dimension)** — how a dimension preserves or discards\n  history on change.\n- **CDC (Change Data Capture)** — streaming a database's row-level changes from its\n  transaction log, not polling.\n- **Lakehouse** — object-storage lake with warehouse-grade table semantics via\n  Delta, Iceberg, or Hudi.\n- **Data contract** — an enforced agreement on the schema a producer guarantees.\n- **Fan-out** — a join that multiplies rows because the key isn't unique.\n- **Lineage** — how each table derives from its sources.","html":"<h2 id=\"vocabulary\">Vocabulary</h2>\n<ul>\n<li><strong>Idempotent</strong> — re-running a load yields the same state, not duplicate data.</li>\n<li><strong>Grain</strong> — what a single row of a table represents.</li>\n<li><strong>Backfill</strong> — recomputing historical partitions after a fix or new column.</li>\n<li><strong>Watermark</strong> — the event-time point past which a window is emitted.</li>\n<li><strong>SCD (Slowly Changing Dimension)</strong> — how a dimension preserves or discards\nhistory on change.</li>\n<li><strong>CDC (Change Data Capture)</strong> — streaming a database&#39;s row-level changes from its\ntransaction log, not polling.</li>\n<li><strong>Lakehouse</strong> — object-storage lake with warehouse-grade table semantics via\nDelta, Iceberg, or Hudi.</li>\n<li><strong>Data contract</strong> — an enforced agreement on the schema a producer guarantees.</li>\n<li><strong>Fan-out</strong> — a join that multiplies rows because the key isn&#39;t unique.</li>\n<li><strong>Lineage</strong> — how each table derives from its sources.</li>\n</ul>\n","wordCount":116},{"heading":"Tools","id":"tools","markdown":"- **Orchestration** — Airflow, Dagster, or Prefect to schedule and retry the DAG.\n- **Transformation** — dbt for version-controlled, tested, modular SQL.\n- **Warehouses / lakehouses** — Snowflake, BigQuery, Databricks, Redshift; table\n  formats Iceberg/Delta/Hudi over object storage.\n- **Ingestion / CDC** — Fivetran, Airbyte, Debezium, Kafka Connect.\n- **Streaming** — Kafka, Kinesis, Flink, Spark Structured Streaming.\n- **Processing** — Spark for big batch; DuckDB/Polars for what fits on one machine.\n- **Quality / observability** — Great Expectations, dbt tests, Monte Carlo,\n  freshness/volume monitors.","html":"<h2 id=\"tools\">Tools</h2>\n<ul>\n<li><strong>Orchestration</strong> — Airflow, Dagster, or Prefect to schedule and retry the DAG.</li>\n<li><strong>Transformation</strong> — dbt for version-controlled, tested, modular SQL.</li>\n<li><strong>Warehouses / lakehouses</strong> — Snowflake, BigQuery, Databricks, Redshift; table\nformats Iceberg/Delta/Hudi over object storage.</li>\n<li><strong>Ingestion / CDC</strong> — Fivetran, Airbyte, Debezium, Kafka Connect.</li>\n<li><strong>Streaming</strong> — Kafka, Kinesis, Flink, Spark Structured Streaming.</li>\n<li><strong>Processing</strong> — Spark for big batch; DuckDB/Polars for what fits on one machine.</li>\n<li><strong>Quality / observability</strong> — Great Expectations, dbt tests, Monte Carlo,\nfreshness/volume monitors.</li>\n</ul>\n","wordCount":71},{"heading":"Collaboration","id":"collaboration","markdown":"A data engineer sits between the people who produce data and those who consume it,\nand most friction lives at those two seams. Upstream, software/backend engineers\ndecide whether you get stable schemas or a moving target — the best outcome is a\ndata contract enforced in their CI, not a Slack apology after a break. Downstream,\nanalysts and data scientists are the customers, given clean, well-grained tables.\nData engineering is invisible when it works and blamed when a number is wrong, so\nover-communicate lineage, freshness, and caveats.","html":"<h2 id=\"collaboration\">Collaboration</h2>\n<p>A data engineer sits between the people who produce data and those who consume it,\nand most friction lives at those two seams. Upstream, software/backend engineers\ndecide whether you get stable schemas or a moving target — the best outcome is a\ndata contract enforced in their CI, not a Slack apology after a break. Downstream,\nanalysts and data scientists are the customers, given clean, well-grained tables.\nData engineering is invisible when it works and blamed when a number is wrong, so\nover-communicate lineage, freshness, and caveats.</p>\n","wordCount":89},{"heading":"Ethics","id":"ethics","markdown":"Data engineers handle the most sensitive material an organization holds — who its\nusers are, what they did, what they're worth — and build the pipes that make it\ntrivially copyable. The duties: minimize and pseudonymize PII rather than hoard it\nbecause storage is cheap; honor deletion and consent (GDPR, CCPA) as pipeline\nrequirements, tracking where data flows so you can purge it; resist the join that\nre-identifies anonymized records; and be honest about quality, since a wrong number\ndoes more harm than an admitted gap. The hardest cases are where the\neasy pipeline and the right one diverge — combining datasets users never consented\nto, retaining logs \"just in case\" — and those should be named, not defaulted.","html":"<h2 id=\"ethics\">Ethics</h2>\n<p>Data engineers handle the most sensitive material an organization holds — who its\nusers are, what they did, what they&#39;re worth — and build the pipes that make it\ntrivially copyable. The duties: minimize and pseudonymize PII rather than hoard it\nbecause storage is cheap; honor deletion and consent (GDPR, CCPA) as pipeline\nrequirements, tracking where data flows so you can purge it; resist the join that\nre-identifies anonymized records; and be honest about quality, since a wrong number\ndoes more harm than an admitted gap. The hardest cases are where the\neasy pipeline and the right one diverge — combining datasets users never consented\nto, retaining logs &quot;just in case&quot; — and those should be named, not defaulted.</p>\n","wordCount":116},{"heading":"Scenarios","id":"scenarios","markdown":"**A revenue dashboard doubled overnight.** Finance reports today's revenue is 2x\nyesterday's, with no price change. The expert checks the run log first, not the\ndashboard query, and finds last night's ingestion failed mid-run, was manually\nre-triggered, and the load wasn't idempotent — so a day's events landed twice.\nImmediate fix: rebuild the partition from immutable bronze. In the postmortem, the\nplain `INSERT` becomes an idempotent `MERGE` keyed on event ID with a dedupe step,\nplus a row-count anomaly monitor so the next double-load alerts first.\n\n**\"Make the dashboard real-time.\"** A product manager wants the metrics dashboard\nreal-time. Rather than reaching for Kafka and Flink, the engineer asks what\ndecision the freshness enables — and learns the team reviews it in a standup each\nmorning. The honest requirement is \"fresh by 9 a.m.,\" not \"real-time.\" They\nschedule the hourly batch's final run for 8:30 and redirect the saved streaming\nwork toward the data-quality tests that were missing.\n\n**Onboarding a new third-party source.** A vendor API will feed a new customer\ntable. The engineer profiles it first: the \"unique\" customer ID repeats for ~2% of\nrows, timestamps are in vendor local time with no offset, and deletes are a\nvanishing row, not a tombstone. They land raw to bronze, dedupe on the natural key,\nnormalize to UTC at silver, and detect soft deletes by diffing snapshots —\nbuilding the pipeline to survive the vendor breaking the ID contract, because\nvendors do.","html":"<h2 id=\"scenarios\">Scenarios</h2>\n<p><strong>A revenue dashboard doubled overnight.</strong> Finance reports today&#39;s revenue is 2x\nyesterday&#39;s, with no price change. The expert checks the run log first, not the\ndashboard query, and finds last night&#39;s ingestion failed mid-run, was manually\nre-triggered, and the load wasn&#39;t idempotent — so a day&#39;s events landed twice.\nImmediate fix: rebuild the partition from immutable bronze. In the postmortem, the\nplain <code>INSERT</code> becomes an idempotent <code>MERGE</code> keyed on event ID with a dedupe step,\nplus a row-count anomaly monitor so the next double-load alerts first.</p>\n<p><strong>&quot;Make the dashboard real-time.&quot;</strong> A product manager wants the metrics dashboard\nreal-time. Rather than reaching for Kafka and Flink, the engineer asks what\ndecision the freshness enables — and learns the team reviews it in a standup each\nmorning. The honest requirement is &quot;fresh by 9 a.m.,&quot; not &quot;real-time.&quot; They\nschedule the hourly batch&#39;s final run for 8:30 and redirect the saved streaming\nwork toward the data-quality tests that were missing.</p>\n<p><strong>Onboarding a new third-party source.</strong> A vendor API will feed a new customer\ntable. The engineer profiles it first: the &quot;unique&quot; customer ID repeats for ~2% of\nrows, timestamps are in vendor local time with no offset, and deletes are a\nvanishing row, not a tombstone. They land raw to bronze, dedupe on the natural key,\nnormalize to UTC at silver, and detect soft deletes by diffing snapshots —\nbuilding the pipeline to survive the vendor breaking the ID contract, because\nvendors do.</p>\n","wordCount":246},{"heading":"Related Occupations","id":"related-occupations","markdown":"A data engineer shares the software engineer's discipline of testable,\nversion-controlled code but aims it at data correctness over time, not application\nbehavior. Data scientists and machine-learning engineers are the primary\ndownstream consumers, reasoning in distributions and models while the data engineer\nguarantees the tables they train on. Database administrators are the operational\ncousins tuning the stores pipelines read and write. Backend engineers produce the\noperational data and event streams pipelines ingest, making data contracts a shared\nconcern. Cloud architects design the storage and compute substrate underneath.","html":"<h2 id=\"related-occupations\">Related Occupations</h2>\n<p>A data engineer shares the software engineer&#39;s discipline of testable,\nversion-controlled code but aims it at data correctness over time, not application\nbehavior. Data scientists and machine-learning engineers are the primary\ndownstream consumers, reasoning in distributions and models while the data engineer\nguarantees the tables they train on. Database administrators are the operational\ncousins tuning the stores pipelines read and write. Backend engineers produce the\noperational data and event streams pipelines ingest, making data contracts a shared\nconcern. Cloud architects design the storage and compute substrate underneath.</p>\n","wordCount":89},{"heading":"References","id":"references","markdown":"- *The Data Warehouse Toolkit* — Ralph Kimball & Margy Ross\n- *Designing Data-Intensive Applications* — Martin Kleppmann\n- *Fundamentals of Data Engineering* — Joe Reis & Matt Housley\n- *Streaming Systems* — Akidau, Chernyak & Lax\n- dbt and Apache Airflow documentation","html":"<h2 id=\"references\">References</h2>\n<ul>\n<li><em>The Data Warehouse Toolkit</em> — Ralph Kimball &amp; Margy Ross</li>\n<li><em>Designing Data-Intensive Applications</em> — Martin Kleppmann</li>\n<li><em>Fundamentals of Data Engineering</em> — Joe Reis &amp; Matt Housley</li>\n<li><em>Streaming Systems</em> — Akidau, Chernyak &amp; Lax</li>\n<li>dbt and Apache Airflow documentation</li>\n</ul>\n","wordCount":32}],"computed":{"wordCount":2179,"readingTimeMinutes":10,"completeness":1,"backlinks":["backend-engineer","bioinformatics-scientist","cloud-architect","data-analyst","data-scientist","database-administrator","machine-learning-engineer","sports-analyst"],"verified":false,"aiDrafted":true,"unverifiedAiDraft":true},"git":{"created":"2026-06-26","updated":"2026-06-26","revisions":1,"authors":[{"name":"soul-atlas","commits":1}],"timeline":[{"date":"2026-06-26","author":"soul-atlas"}]},"citation":{"apa":"soul-atlas (2026). Data Engineer [SOUL]. SOUL Atlas. https://soul-atlas.github.io/occupations/data-engineer","bibtex":"@misc{soulatlas-data-engineer,\n  title        = {Data Engineer},\n  author       = {soul-atlas},\n  year         = {2026},\n  howpublished = {SOUL Atlas},\n  note         = {SOUL.md, version 2026-06-26},\n  url          = {https://soul-atlas.github.io/occupations/data-engineer}\n}","text":"soul-atlas. \"Data Engineer.\" SOUL Atlas, 2026. https://soul-atlas.github.io/occupations/data-engineer."}}