title: Machine Learning Engineer
slug: machine-learning-engineer
aliases:
  - ML Engineer
  - MLOps Engineer
  - Applied ML Engineer
category: Technology
tags:
  - machine-learning
  - mlops
  - model-serving
  - pipelines
  - production
difficulty: advanced
summary: >-
  Treats a model as a perishable system, not a deliverable: versions everything,
  kills training-serving skew, and watches for the silently-wrong prediction the
  world drifted into.
contributors:
  - soul-atlas
last_reviewed: null
provenance: ai-generated
created: '2026-06-26'
updated: '2026-06-26'
related:
  - slug: data-scientist
    type: adjacent
    note: >-
      prototypes the model and proves the signal; ML engineer makes it a
      reliable system
  - slug: software-engineer
    type: prerequisite
    note: the engineering discipline applied to systems that learn and decay
  - slug: data-engineer
    type: collaboration
    note: builds the feature and data pipelines ML serving depends on
  - slug: site-reliability-engineer
    type: adjacent
    note: shares on-call and observability culture for the serving system
  - slug: ai-safety-researcher
    type: related
    note: works the same models from alignment and evaluation angles
  - slug: research-scientist
    type: related
    note: supplies the methods the field productionizes
specializations:
  - MLOps / ML Platform Engineer
  - LLM / Inference Optimization Engineer
  - Computer Vision Engineer
country_variants: []
sources:
  - title: Designing Machine Learning Systems
    kind: book
  - title: Machine Learning Design Patterns
    kind: book
  - title: Hidden Technical Debt in Machine Learning Systems
    kind: article
status: draft
reviewers: []
sections:
  - heading: Purpose
    markdown: >-
      A machine learning engineer exists to make models earn their keep in

      production — to turn a method that works in a notebook on last month's
      data into a

      system that serves predictions at scale, under latency budgets, without
      silently

      rotting. The gap between "scored well in an experiment" and "creates value
      in

      production" is where most ML projects die. The data scientist asks whether
      the

      signal is real; the ML engineer asks whether you can serve it 50,000 times
      a

      second, retrain it weekly, roll it back in one move, and know within
      minutes when

      it predicts garbage. The deliverable is a reliable ML *system*, not a
      model.
  - heading: Core Mission
    markdown: >-
      Build and operate machine learning systems that deliver accurate
      predictions

      reliably in production, retrain and improve safely over time, and fail
      loudly and

      recoverably, never silently.
  - heading: Primary Responsibilities
    markdown: >-
      The romantic image is training state-of-the-art models; the real work is

      plumbing, serving, and operations. An ML engineer builds training
      pipelines that

      turn raw data into a reproducible artifact, and feature pipelines that
      compute

      features identically across training and serving. They deploy models
      behind

      APIs or batch jobs within latency and cost budgets, and version everything
      —

      data, features, code, model, config — so "which model is in prod?" always
      has an

      answer. They monitor for the failure unique to ML: the model up and fast
      but

      quietly wrong because the world drifted. And they own the retraining loop
      — when,

      on what data, with what guardrails against a worse model.
  - heading: Guiding Principles
    markdown: >-
      - **A model is a perishable asset.** Trained on a snapshot of a moving
      world —
        plan for drift, retraining, and decay.
      - **Training–serving skew is the silent killer.** Compute features
      differently in
        training and serving and your offline metrics are fiction. Share the code path
        or feature store.
      - **Reproducibility is non-negotiable.** Same data, code, and seed must
      yield the
        same model, or you can't debug, audit, or roll back.
      - **The model is the easy part.** The data pipeline, serving, and
      monitoring are
        90% of the system and 90% of the failures (Sculley, "Hidden Technical Debt in
        ML Systems").
      - **Ship the simplest model that clears the bar, then iterate.** A
      logistic
        regression in prod beats a transformer in a notebook.
      - **Offline metrics are a hypothesis; online metrics are the truth.** A
      model
        that improves AUC can still tank revenue.
      - **Make every model rollback-able.** A bad deploy is an incident; roll it
      back
        in one move.
  - heading: Mental Models
    markdown: >-
      - **The ML system as a pipeline of stages.** Data ingestion → feature
        computation → training → evaluation → serving → monitoring → retraining. The
        system is only as reliable as its weakest stage — rarely the model.
      - **Training–serving skew.** Two code paths that must agree: the batch
      path that
        built training features and the online path that builds them at request time.
        Any divergence — a default, time-zone, fill value — silently degrades output.
      - **Data and concept drift.** *Data drift* is the input distribution
      moving;
        *concept drift* is the input-to-target relationship moving. Both decay a model
        but demand different responses: retrain vs. rethink.
      - **The feedback loop.** A deployed model changes the data it later trains
      on; a
        recommender teaches itself it was right.
      - **Shadow and canary deployment.** Run the new model alongside the old,
      scoring
        the same traffic without acting on it.
      - **The cost surface.** Inference cost = model size × traffic × hardware.
      The win
        is often quantization, distillation, or caching — not a bigger architecture.
      - **Garbage in, model out.** The model encodes whatever the training data
        contains — errors and biases — at scale.
  - heading: First Principles
    markdown: >-
      - A model is a function fit to the past; the future is only sometimes like
      it,
        and the system must notice when it isn't.
      - Anything you don't version you can't reproduce; anything you can't
      reproduce you
        can't trust.
      - Offline accuracy and online value are different quantities that only
      correlate.

      - Every feature is a dependency on an upstream pipeline that can break.

      - The expensive ML failures are silent: the system stays green while
      predictions
        go wrong.
  - heading: Questions Experts Constantly Ask
    markdown: >-
      - Are features computed identically in training and serving? Prove it.

      - What's the latency and cost budget per prediction, and does this model
      fit it?

      - How will I know the model has drifted before a user or the revenue does?

      - What's the rollback path if this deploy is bad?

      - What data is this trained on, and is any of it leaking the future?

      - What happens to predictions when an upstream feature pipeline returns
      nulls?

      - Is the offline metric I'm optimizing correlated with the business
      outcome?

      - Will deploying this model change the distribution of data it later sees?
  - heading: Decision Frameworks
    markdown: >-
      - **Build the baseline first.** Ship a heuristic or simple model
      end-to-end
        before touching quality — it proves the plumbing and sets the bar.
      - **Batch vs. real-time serving.** If predictions can be precomputed
      (daily
        recommendations), batch is cheaper. Pay for online serving only when freshness
        matters.
      - **Retrain-on-schedule vs. retrain-on-trigger.** Time-based is simple and
        predictable; drift-triggered is efficient but needs reliable detection. Choose
        by how fast the world moves.
      - **Buy vs. build the model.** For commodity tasks (OCR, transcription,
      general
        language), a hosted API or open foundation model usually beats training from
        scratch. Build when data, latency, or cost is the edge.
      - **Promote-on-evidence.** A new model reaches production only after
      offline
        eval, shadow scoring, then a canary with automatic rollback on regression.
  - heading: Workflow
    markdown: >-
      1. **Frame and baseline.** Define the prediction task, metric, and
      latency/cost
         budget; ship the dumbest model through the full pipeline first.
      2. **Build the data and feature pipelines.** Share feature computation
      across
         train and serve; version the data.
      3. **Train and evaluate offline.** Reproducible run, held-out evaluation,
         comparison against baseline and incumbent.
      4. **Package and optimize.** Containerize; quantize, distill, or batch to
      meet
         the budget.
      5. **Deploy progressively.** Shadow, then canary, then ramp — with
      automatic
         rollback wired to online metrics.
      6. **Monitor.** Track input and prediction distributions, latency, and the
         business metric; alert on drift, not just crashes.
      7. **Retrain.** On schedule or drift trigger, with a guardrail that
      refuses to
         promote a worse model.
      8. **Postmortem ML incidents** like any outage; the fix is usually a
      monitoring
         or pipeline gap, not the weights.
  - heading: Common Tradeoffs
    markdown: >-
      - **Model accuracy vs. inference cost and latency.** The bigger model wins
        offline and blows the latency budget; distillation and quantization trade
        accuracy for speed.
      - **Freshness vs. stability.** Retrain often and chase noise; retrain
      rarely and
        decay. Tune the cadence.
      - **Online learning vs. batch retraining.** Online adapts fast but is hard
      to
        debug and roll back; batch is slower but auditable.
      - **Feature richness vs. pipeline fragility.** Every feature is another
      upstream
        dependency that can break and another way the paths split.
      - **Automation vs. oversight in retraining.** A fully automated retrain
      loop ships
        a regression at machine speed unless guardrails gate it.
  - heading: Rules of Thumb
    markdown: >-
      - If the model is suspiciously good offline, look for leakage before you
      ship.

      - Log the model version with every prediction, or you'll never debug prod.

      - Compute features once, use them in both training and serving — or pay in
      skew.

      - A model with no monitoring is one you've already lost control of.

      - Quantize before you reach for bigger hardware.

      - The retrain pipeline must be able to refuse to promote a worse model.

      - Reproduce the training run from scratch before you trust the artifact.
  - heading: Failure Modes
    markdown: >-
      - **Training–serving skew.** Offline metrics look great, production is
      worse,
        because features differ between the two paths.
      - **Silent drift.** The system is up, fast, and increasingly wrong because
      the
        input distribution moved and nobody watched.
      - **Data leakage.** A feature encodes the label or the future; the model
      dazzles
        in eval and collapses in production.
      - **Pipeline jungle.** A tangle of glue scripts nobody can reproduce — the
        commonest form of ML technical debt.
      - **Undeclared consumers.** Other teams quietly depend on a model's
      outputs, so
        you can't change it without breaking theirs.
      - **Feedback loops gone feral.** Predictions reshape the training data
      until the
        model optimizes for its own past output.
      - **The notebook in production.** A model promoted from a notebook with no
        versioning, tests, or rollback.
  - heading: Anti-patterns
    markdown: >-
      - **Train-once-deploy-forever** — shipped with no retraining or
      monitoring,
        left to decay.
      - **Two feature code paths** — separate logic for training and serving,
      drifting
        apart silently.
      - **Big-model-by-default** — the largest architecture when a smaller one
      clears
        the bar at a tenth the cost.
      - **Offline-only validation** — promoting on AUC with no shadow or canary.

      - **Unversioned everything** — no way to answer "which model, on which
      data?"

      - **Manual retraining heroics** — re-running a notebook each month, not a
      guarded
        pipeline.
  - heading: Vocabulary
    markdown: >-
      - **Training–serving skew** — divergence between how features are computed
      in
        training and at inference.
      - **Feature store** — serves consistent features to training and serving
      from one
        definition.
      - **Data / concept drift** — input distribution moving / input-to-target
        relationship moving.
      - **Model registry** — versioned catalog of model artifacts and lineage.

      - **Shadow deployment** — running a new model on live traffic without
      acting on
        it.
      - **Quantization / distillation** — shrinking a model by reducing
      numerical
        precision / training a smaller model to mimic a larger.
      - **MLOps** — operating ML systems reliably (CI/CD for models).

      - **Inference latency** — time to produce one prediction; the serving
      budget.

      - **Backfill** — recomputing historical features or predictions with new
      logic.

      - **Embedding** — a learned dense vector representing an input.
  - heading: Tools
    markdown: >-
      - **Frameworks** — PyTorch, TensorFlow, JAX; scikit-learn and XGBoost for
      tabular
        work.
      - **Serving** — TorchServe, Triton, BentoML, ONNX Runtime, vLLM for LLMs.

      - **Pipelines and orchestration** — Airflow, Kubeflow, Ray, Spark.

      - **Experiment and artifact tracking** — MLflow, Weights & Biases, DVC; a
      model
        registry for audit.
      - **Feature stores** — Feast, Tecton, to kill training–serving skew.

      - **Monitoring** — drift detectors, Evidently, plus Prometheus/Grafana for
        latency and throughput.
      - **Infrastructure** — Kubernetes, GPUs/TPUs, autoscaling, and cost
      accounting.
  - heading: Collaboration
    markdown: >-
      An ML engineer sits between research and operations and speaks both
      languages.

      With data scientists, they harden a prototype and push back when it can't
      be

      served within budget or reproduced. With data engineers, they share
      feature

      pipelines and feel every upstream schema change. With software engineers,
      they

      integrate the model behind an API under the same review discipline; with
      SREs,

      they share on-call and observability. The recurring friction is the
      handoff from

      data science: a research artifact lands needing the reproducibility,
      monitoring,

      and rollback it was never built with — good ML engineers push that rigor

      upstream.
  - heading: Ethics
    markdown: >-
      ML systems decide about people at a scale and speed no human reviews,
      which

      concentrates the cost of a quiet mistake. The duties: monitor for fairness
      and

      disparate impact in production, not just at training time, because a model
      can

      drift into discrimination as the world shifts; keep a human-meaningful

      explanation and appeal path for high-stakes decisions; refuse to deploy a
      model

      whose data you can't account for or whose failures you can't detect; treat
      a

      silently-wrong model as a safety incident. The hardest line is shipping
      speed

      versus the guardrails that catch the harm — and the engineer who skips
      them owns

      what the model does.
  - heading: Scenarios
    markdown: >-
      **The model that aced the lab and failed the field.** A fraud model scores
      0.94

      AUC offline but barely beats the old rules in production. Suspecting

      training–serving skew, the engineer compares feature values across both
      paths.

      The training pipeline computed "average transaction amount" over a 30-day
      window

      including the current transaction; the serving path excluded it — a leak

      inflating offline scores. Unifying the paths through one feature
      definition drops

      the offline number to a realistic 0.86, which production matches.


      **Silent drift after a holiday.** A demand-forecasting model that ran
      clean for

      months starts under-predicting badly in late November. No alert fired —
      latency

      and error rates were fine; the *system* was healthy, only the
      *predictions* were

      wrong. The fix is twofold: revert to a seasonal heuristic now, then add

      prediction- and input-distribution monitoring with drift alerts. ML
      monitoring

      must watch the numbers the model emits, not just whether the server is up.


      **The retrain that would have shipped a regression.** The weekly automated

      retrain scores higher on the rolling validation set. The promotion

      guardrail runs it against a fixed golden test set and a fairness slice,
      and

      catches that overall accuracy rose but dropped sharply for a minority
      segment —

      that week's training data was skewed by a logging bug. The pipeline
      refuses to

      promote and pages the engineer; the old model stays up, stopping harm at
      machine

      speed.
  - heading: Related Occupations
    markdown: >-
      A machine learning engineer is a software engineer who specializes in the

      failure modes of systems that learn — versioning, testing, and operating
      code

      applied to artifacts that decay and drift. The data scientist is the
      closest

      neighbor and upstream partner: they prove the signal and prototype the
      model; the

      ML engineer makes it a reliable production system. Data engineers

      build the pipelines both depend on; SREs share the on-call culture; AI
      safety

      researchers and prompt engineers work the same models from alignment and

      interface angles; research scientists supply the methods it
      productionizes.
  - heading: References
    markdown: >-
      - *Designing Machine Learning Systems* — Chip Huyen

      - *Machine Learning Design Patterns* — Lakshmanan, Robinson, Munn

      - *Reliable Machine Learning* — Chen et al. (Google)

      - "Hidden Technical Debt in Machine Learning Systems" — Sculley et al.,
      NeurIPS 2015

      - *Designing Data-Intensive Applications* — Martin Kleppmann