title: Data Engineer
slug: data-engineer
aliases:
  - Data Pipeline Engineer
  - ETL Developer
  - Analytics Engineer
category: Technology
tags:
  - data-pipelines
  - etl
  - data-warehousing
  - orchestration
  - data-quality
difficulty: advanced
summary: >-
  Treats data as guilty until proven clean: builds idempotent, replayable
  pipelines and tests tables like code so the same input always yields the same
  trustworthy answer.
contributors:
  - soul-atlas
last_reviewed: null
provenance: ai-generated
created: '2026-06-26'
updated: '2026-06-26'
related:
  - slug: data-scientist
    type: collaboration
    note: primary downstream consumer of clean, well-grained tables
  - slug: database-administrator
    type: adjacent
    note: operational cousin tuning the stores pipelines read and write
  - slug: software-engineer
    type: related
    note: shares testable, version-controlled code aimed at data correctness
  - slug: machine-learning-engineer
    type: collaboration
    note: trains and serves models on the tables data engineers guarantee
  - slug: backend-engineer
    type: adjacent
    note: produces the operational data and event streams pipelines ingest
  - slug: cloud-architect
    type: related
    note: designs the storage and compute substrate the data stack runs on
specializations:
  - Analytics Engineer
  - Streaming Data Engineer
  - Data Platform Engineer
country_variants: []
sources:
  - title: The Data Warehouse Toolkit
    kind: book
  - title: Fundamentals of Data Engineering
    kind: book
  - title: Designing Data-Intensive Applications
    kind: book
status: draft
reviewers: []
sections:
  - heading: Purpose
    markdown: >-
      A data engineer exists so the rest of an organization can trust the
      numbers. Raw

      data arrives messy, late, duplicated, and lying — from databases, event
      streams,

      APIs, and hand-edited spreadsheets — and somebody must turn it into tables
      an

      analyst, a model, or an executive can rely on without a footnote. The gap
      between

      "the data is somewhere" and "we have a correct, timely answer" is bridged
      by

      plumbing that holds when the source schema changes at 3 a.m. unannounced.
  - heading: Core Mission
    markdown: >-
      Move data from where it is produced to where it is needed — correct,
      complete, on

      time, and reproducible — so the same input always yields the same
      trustworthy

      output, run after run.
  - heading: Primary Responsibilities
    markdown: >-
      The visible work is writing pipelines; the actual work is defending data

      correctness across systems never designed to agree. A data engineer
      designs

      ingestion from operational sources without melting them; models the

      warehouse/lakehouse so a question maps to a query, not a forensic
      investigation;

      writes idempotent, replayable transformation logic (ELT or ETL); enforces
      quality

      with tests and contracts; orchestrates the dependency graph so the daily
      run

      finishes before dashboards refresh; manages schema evolution so one
      upstream change

      doesn't break twelve downstream jobs; and owns compute cost that can bill
      five

      figures a month. When a number looks wrong, the data engineer is asked why
      — and "I

      don't know" won't hold for long.
  - heading: Guiding Principles
    markdown: >-
      - **Idempotency is non-negotiable.** A pipeline you can't re-run can't be
      operated;
        re-running yesterday's load must reproduce it, not double it.
      - **Make pipelines replayable from source.** Keep raw immutable and
      partitioned;
        rebuilding any table from raw means recovering from any bug.
      - **Schema is a contract, and contracts break loudly.** Detect drift at
      the
        boundary; never let a silently-changed upstream column poison a gold table.
      - **Correctness beats freshness beats cost.** Get it right, then timely,
      then
        cheap. A fast, cheap wrong number is the worst outcome.
      - **Test data like you test code.** Row counts, uniqueness, null rates,
      and
        referential integrity are assertions, not hopes.
      - **Push computation to the data.** Move the query into the warehouse;
      don't drag a
        billion rows across the network to filter.
      - **Late-arriving data is the normal case.** Design for events hours late,
      not as
        an exception.
  - heading: Mental Models
    markdown: >-
      - **ETL vs. ELT.** With cheap storage and powerful warehouses (Snowflake,
        BigQuery, Databricks), load raw first and transform in-warehouse; prefer ELT.
      - **The medallion architecture (bronze/silver/gold).** Raw ingested
      (bronze),
        cleaned (silver), business aggregates (gold) — each a checkpoint rebuildable from
        the one below.
      - **Batch vs. stream as a latency spectrum.** Batch is a stream with a big
      window,
        streaming a batch of one; ask what latency the consumer needs.
      - **The Lambda / Kappa debate.** Lambda runs batch and stream in parallel;
      Kappa
        keeps one path and replays it. Two paths computing one number is a reconciliation
        tax — prefer one.
      - **Event time vs. processing time.** When something happened versus when
      you saw
        it — home of windowing, watermarks, late data, and most "totals don't match"
        bugs.
      - **Slowly Changing Dimensions (SCD).** How history survives a dimension
      change —
        overwrite (Type 1) or new versioned row (Type 2); the wrong type destroys "what
        was true back then?"
      - **The DAG.** Every pipeline is a directed acyclic graph of dependencies
      that
        orchestration runs in order, retries, and backfills; a cycle means a broken model.
  - heading: First Principles
    markdown: >-
      - Data is guilty until proven clean; every source lies in its own dialect.

      - The same query run twice should return the same answer, or you have a
      bug, not a
        result.
      - Storage and recomputation are cheap; a wrong number that reached a
      decision-maker
        is expensive — favor recoverability.
      - Every join is a claim about cardinality you must verify.

      - A pipeline with no tests asserts correctness on faith.
  - heading: Questions Experts Constantly Ask
    markdown: |-
      - What is the grain of this table — one row per what?
      - Is this transformation idempotent, and can I backfill it safely?
      - What happens when this source sends a duplicate, null, or late event?
      - Who consumes this table, and what freshness SLA do they need?
      - If this column disappears upstream tomorrow, what breaks, how loudly?
      - Are these two numbers different from a bug or event-time skew?
      - What will this query cost, and is a full scan needed?
      - Can I rebuild this from raw?
  - heading: Decision Frameworks
    markdown: >-
      - **Batch vs. streaming.** Default to batch; it's simpler and cheaper.
      Reach for
        streaming only when a decision needs sub-minute latency (fraud, real-time
        inventory).
      - **Build vs. buy for ingestion.** Connectors to common SaaS sources are
      solved;
        use Fivetran/Airbyte rather than maintaining brittle API clients, unless the
        source is a core differentiator.
      - **Normalize vs. denormalize.** Normalize where data is written and
      integrity
        matters; denormalize (star schema, wide tables) where it's read.
      - **Where to enforce quality.** Block at the contract boundary for
      failures that
        corrupt downstream (schema, key uniqueness); warn-and-continue for the merely
        suspicious — don't fail a DAG over a 0.1% null rate on a nullable field.
      - **Full refresh vs. incremental.** Full refresh is simple and
      self-healing but
        scales badly; go incremental only when volume forces it, with a periodic full
        rebuild as the safety net.
  - heading: Workflow
    markdown: >-
      1. **Understand the consumer.** What question must this answer, at what
      grain, how
         fresh, and who is hurt if it's wrong? Define the SLA before the schema.
      2. **Profile the source.** Volume, update pattern, key uniqueness, null
      rates, how
         it signals deletes and late data. Assume nothing the docs claim.
      3. **Land raw, immutably.** Ingest to bronze with full fidelity,
      partitioned by
         ingestion or event date — never transform on the way in.
      4. **Model.** Decide grain, keys, and SCD type. Write transformations as
         version-controlled, tested SQL (dbt) — silver then gold.
      5. **Test.** Assert uniqueness, not-null, accepted values, and referential
         integrity; a model without tests doesn't merge.
      6. **Orchestrate.** Wire the DAG in Airflow/Dagster with retries,
      upstream-readiness
         sensors, and idempotent tasks. Make backfill a first-class command.
      7. **Observe.** Monitor freshness, volume, and schema drift; alert when
      row counts
         swing outside expected bounds.
      8. **Tune cost.** Right-size warehouses, partition and cluster on real
      query
         patterns, kill the full scans.
  - heading: Common Tradeoffs
    markdown: >-
      - **Freshness vs. cost.** Sub-minute pipelines cost far more than hourly
      ones; buy
        the latency the decision needs, not the one that sounds modern.
      - **Correctness vs. timeliness.** Wait for late data and be right but
      late, or cut
        the window and be timely but provisional; state which, per table.
      - **Normalization vs. query performance.** Fewer joins read faster but
      duplicate
        data and risk drift; normalization is cleaner but slower.
      - **Schema-on-write vs. schema-on-read.** Structure on ingest catches
      problems
        early but rejects messy reality; schema-on-read defers the pain.
      - **Generic framework vs. specific pipeline.** A mega-pipeline handling
      every source
        poorly versus tailored jobs each handling one well.
      - **One-path (Kappa) vs. two-path (Lambda).** One path is simpler to keep
      correct;
        two buy latency at the cost of constant reconciliation.
  - heading: Rules of Thumb
    markdown: >-
      - Always know the grain of a table before you query it.

      - If a pipeline isn't idempotent, treat it as broken even if it "works"
      today.

      - Partition by the column you filter on; cluster by the column you join
      on.

      - A dedupe step belongs right after ingestion, not layers downstream.

      - Never `SELECT *` in production transforms — it's how schema drift sneaks
      in.

      - Test the keys: a non-unique primary key makes every downstream count
      suspect.

      - Backfill is not an emergency procedure; design for it from day one.

      - When two numbers disagree, suspect event-time vs. processing-time before
        arithmetic.
  - heading: Failure Modes
    markdown: >-
      - **Silent duplication.** A non-idempotent load run twice, doubling
      revenue in a
        dashboard nobody reconciled.
      - **Fan-out joins.** A many-to-many join nobody checked, inflating every
      metric.

      - **Schema drift poisoning.** An upstream type change flowing unnoticed
      into gold,
        corrupting a quarter of reporting.
      - **The midnight backfill that didn't.** A "rerun" that wasn't idempotent,
      leaving
        partial state worse than the failure.
      - **Pipeline spaghetti.** Hundreds of interdependent jobs with no lineage;
      no one
        can say what breaks if a table is dropped.
      - **Cost surprise.** A naive full-scan scheduled hourly, billing more than
      the
        team's salaries.
      - **Freshness theater.** A pipeline that runs "successfully" on stale data
      because
        nobody monitored whether new data arrived.
  - heading: Anti-patterns
    markdown: >-
      - **Transform-on-ingest** — destroying the raw record, so you can never
      replay.

      - **The one giant SQL file** — a 2,000-line transform no one can test.

      - **Manual fixes in production** — hand-editing a warehouse table,
      breaking
        reproducibility.
      - **`MERGE` without idempotency** — upserts that aren't safe to re-run.

      - **Reverse ETL sprawl** — syncing the warehouse into a dozen unowned
      tools.

      - **Trusting source documentation** — building on the schema the API
      claims, not
        what it sends.
      - **No data contracts** — letting upstream teams change shape with no
      notice.
  - heading: Vocabulary
    markdown: >-
      - **Idempotent** — re-running a load yields the same state, not duplicate
      data.

      - **Grain** — what a single row of a table represents.

      - **Backfill** — recomputing historical partitions after a fix or new
      column.

      - **Watermark** — the event-time point past which a window is emitted.

      - **SCD (Slowly Changing Dimension)** — how a dimension preserves or
      discards
        history on change.
      - **CDC (Change Data Capture)** — streaming a database's row-level changes
      from its
        transaction log, not polling.
      - **Lakehouse** — object-storage lake with warehouse-grade table semantics
      via
        Delta, Iceberg, or Hudi.
      - **Data contract** — an enforced agreement on the schema a producer
      guarantees.

      - **Fan-out** — a join that multiplies rows because the key isn't unique.

      - **Lineage** — how each table derives from its sources.
  - heading: Tools
    markdown: >-
      - **Orchestration** — Airflow, Dagster, or Prefect to schedule and retry
      the DAG.

      - **Transformation** — dbt for version-controlled, tested, modular SQL.

      - **Warehouses / lakehouses** — Snowflake, BigQuery, Databricks, Redshift;
      table
        formats Iceberg/Delta/Hudi over object storage.
      - **Ingestion / CDC** — Fivetran, Airbyte, Debezium, Kafka Connect.

      - **Streaming** — Kafka, Kinesis, Flink, Spark Structured Streaming.

      - **Processing** — Spark for big batch; DuckDB/Polars for what fits on one
      machine.

      - **Quality / observability** — Great Expectations, dbt tests, Monte
      Carlo,
        freshness/volume monitors.
  - heading: Collaboration
    markdown: >-
      A data engineer sits between the people who produce data and those who
      consume it,

      and most friction lives at those two seams. Upstream, software/backend
      engineers

      decide whether you get stable schemas or a moving target — the best
      outcome is a

      data contract enforced in their CI, not a Slack apology after a break.
      Downstream,

      analysts and data scientists are the customers, given clean, well-grained
      tables.

      Data engineering is invisible when it works and blamed when a number is
      wrong, so

      over-communicate lineage, freshness, and caveats.
  - heading: Ethics
    markdown: >-
      Data engineers handle the most sensitive material an organization holds —
      who its

      users are, what they did, what they're worth — and build the pipes that
      make it

      trivially copyable. The duties: minimize and pseudonymize PII rather than
      hoard it

      because storage is cheap; honor deletion and consent (GDPR, CCPA) as
      pipeline

      requirements, tracking where data flows so you can purge it; resist the
      join that

      re-identifies anonymized records; and be honest about quality, since a
      wrong number

      does more harm than an admitted gap. The hardest cases are where the

      easy pipeline and the right one diverge — combining datasets users never
      consented

      to, retaining logs "just in case" — and those should be named, not
      defaulted.
  - heading: Scenarios
    markdown: >-
      **A revenue dashboard doubled overnight.** Finance reports today's revenue
      is 2x

      yesterday's, with no price change. The expert checks the run log first,
      not the

      dashboard query, and finds last night's ingestion failed mid-run, was
      manually

      re-triggered, and the load wasn't idempotent — so a day's events landed
      twice.

      Immediate fix: rebuild the partition from immutable bronze. In the
      postmortem, the

      plain `INSERT` becomes an idempotent `MERGE` keyed on event ID with a
      dedupe step,

      plus a row-count anomaly monitor so the next double-load alerts first.


      **"Make the dashboard real-time."** A product manager wants the metrics
      dashboard

      real-time. Rather than reaching for Kafka and Flink, the engineer asks
      what

      decision the freshness enables — and learns the team reviews it in a
      standup each

      morning. The honest requirement is "fresh by 9 a.m.," not "real-time."
      They

      schedule the hourly batch's final run for 8:30 and redirect the saved
      streaming

      work toward the data-quality tests that were missing.


      **Onboarding a new third-party source.** A vendor API will feed a new
      customer

      table. The engineer profiles it first: the "unique" customer ID repeats
      for ~2% of

      rows, timestamps are in vendor local time with no offset, and deletes are
      a

      vanishing row, not a tombstone. They land raw to bronze, dedupe on the
      natural key,

      normalize to UTC at silver, and detect soft deletes by diffing snapshots —

      building the pipeline to survive the vendor breaking the ID contract,
      because

      vendors do.
  - heading: Related Occupations
    markdown: >-
      A data engineer shares the software engineer's discipline of testable,

      version-controlled code but aims it at data correctness over time, not
      application

      behavior. Data scientists and machine-learning engineers are the primary

      downstream consumers, reasoning in distributions and models while the data
      engineer

      guarantees the tables they train on. Database administrators are the
      operational

      cousins tuning the stores pipelines read and write. Backend engineers
      produce the

      operational data and event streams pipelines ingest, making data contracts
      a shared

      concern. Cloud architects design the storage and compute substrate
      underneath.
  - heading: References
    markdown: |-
      - *The Data Warehouse Toolkit* — Ralph Kimball & Margy Ross
      - *Designing Data-Intensive Applications* — Martin Kleppmann
      - *Fundamentals of Data Engineering* — Joe Reis & Matt Housley
      - *Streaming Systems* — Akidau, Chernyak & Lax
      - dbt and Apache Airflow documentation
