{"slug":"machine-learning-engineer","title":"Machine Learning Engineer","metadata":{"title":"Machine Learning Engineer","slug":"machine-learning-engineer","aliases":["ML Engineer","MLOps Engineer","Applied ML Engineer"],"category":"Technology","tags":["machine-learning","mlops","model-serving","pipelines","production"],"difficulty":"advanced","summary":"Treats a model as a perishable system, not a deliverable: versions everything, kills training-serving skew, and watches for the silently-wrong prediction the world drifted into.","contributors":["soul-atlas"],"last_reviewed":null,"provenance":"ai-generated","created":"2026-06-26","updated":"2026-06-26","related":[{"slug":"data-scientist","type":"adjacent","note":"prototypes the model and proves the signal; ML engineer makes it a reliable system"},{"slug":"software-engineer","type":"prerequisite","note":"the engineering discipline applied to systems that learn and decay"},{"slug":"data-engineer","type":"collaboration","note":"builds the feature and data pipelines ML serving depends on"},{"slug":"site-reliability-engineer","type":"adjacent","note":"shares on-call and observability culture for the serving system"},{"slug":"ai-safety-researcher","type":"related","note":"works the same models from alignment and evaluation angles"},{"slug":"research-scientist","type":"related","note":"supplies the methods the field productionizes"}],"specializations":["MLOps / ML Platform Engineer","LLM / Inference Optimization Engineer","Computer Vision Engineer"],"country_variants":[],"sources":[{"title":"Designing Machine Learning Systems","kind":"book"},{"title":"Machine Learning Design Patterns","kind":"book"},{"title":"Hidden Technical Debt in Machine Learning Systems","kind":"article"}],"status":"draft","reviewers":[]},"sections":[{"heading":"Purpose","id":"purpose","markdown":"A machine learning engineer exists to make models earn their keep in\nproduction — to turn a method that works in a notebook on last month's data into a\nsystem that serves predictions at scale, under latency budgets, without silently\nrotting. The gap between \"scored well in an experiment\" and \"creates value in\nproduction\" is where most ML projects die. The data scientist asks whether the\nsignal is real; the ML engineer asks whether you can serve it 50,000 times a\nsecond, retrain it weekly, roll it back in one move, and know within minutes when\nit predicts garbage. The deliverable is a reliable ML *system*, not a model.","html":"<h2 id=\"purpose\">Purpose</h2>\n<p>A machine learning engineer exists to make models earn their keep in\nproduction — to turn a method that works in a notebook on last month&#39;s data into a\nsystem that serves predictions at scale, under latency budgets, without silently\nrotting. The gap between &quot;scored well in an experiment&quot; and &quot;creates value in\nproduction&quot; is where most ML projects die. The data scientist asks whether the\nsignal is real; the ML engineer asks whether you can serve it 50,000 times a\nsecond, retrain it weekly, roll it back in one move, and know within minutes when\nit predicts garbage. The deliverable is a reliable ML <em>system</em>, not a model.</p>\n","wordCount":109},{"heading":"Core Mission","id":"core-mission","markdown":"Build and operate machine learning systems that deliver accurate predictions\nreliably in production, retrain and improve safely over time, and fail loudly and\nrecoverably, never silently.","html":"<h2 id=\"core-mission\">Core Mission</h2>\n<p>Build and operate machine learning systems that deliver accurate predictions\nreliably in production, retrain and improve safely over time, and fail loudly and\nrecoverably, never silently.</p>\n","wordCount":26},{"heading":"Primary Responsibilities","id":"primary-responsibilities","markdown":"The romantic image is training state-of-the-art models; the real work is\nplumbing, serving, and operations. An ML engineer builds training pipelines that\nturn raw data into a reproducible artifact, and feature pipelines that compute\nfeatures identically across training and serving. They deploy models behind\nAPIs or batch jobs within latency and cost budgets, and version everything —\ndata, features, code, model, config — so \"which model is in prod?\" always has an\nanswer. They monitor for the failure unique to ML: the model up and fast but\nquietly wrong because the world drifted. And they own the retraining loop — when,\non what data, with what guardrails against a worse model.","html":"<h2 id=\"primary-responsibilities\">Primary Responsibilities</h2>\n<p>The romantic image is training state-of-the-art models; the real work is\nplumbing, serving, and operations. An ML engineer builds training pipelines that\nturn raw data into a reproducible artifact, and feature pipelines that compute\nfeatures identically across training and serving. They deploy models behind\nAPIs or batch jobs within latency and cost budgets, and version everything —\ndata, features, code, model, config — so &quot;which model is in prod?&quot; always has an\nanswer. They monitor for the failure unique to ML: the model up and fast but\nquietly wrong because the world drifted. And they own the retraining loop — when,\non what data, with what guardrails against a worse model.</p>\n","wordCount":111},{"heading":"Guiding Principles","id":"guiding-principles","markdown":"- **A model is a perishable asset.** Trained on a snapshot of a moving world —\n  plan for drift, retraining, and decay.\n- **Training–serving skew is the silent killer.** Compute features differently in\n  training and serving and your offline metrics are fiction. Share the code path\n  or feature store.\n- **Reproducibility is non-negotiable.** Same data, code, and seed must yield the\n  same model, or you can't debug, audit, or roll back.\n- **The model is the easy part.** The data pipeline, serving, and monitoring are\n  90% of the system and 90% of the failures (Sculley, \"Hidden Technical Debt in\n  ML Systems\").\n- **Ship the simplest model that clears the bar, then iterate.** A logistic\n  regression in prod beats a transformer in a notebook.\n- **Offline metrics are a hypothesis; online metrics are the truth.** A model\n  that improves AUC can still tank revenue.\n- **Make every model rollback-able.** A bad deploy is an incident; roll it back\n  in one move.","html":"<h2 id=\"guiding-principles\">Guiding Principles</h2>\n<ul>\n<li><strong>A model is a perishable asset.</strong> Trained on a snapshot of a moving world —\nplan for drift, retraining, and decay.</li>\n<li><strong>Training–serving skew is the silent killer.</strong> Compute features differently in\ntraining and serving and your offline metrics are fiction. Share the code path\nor feature store.</li>\n<li><strong>Reproducibility is non-negotiable.</strong> Same data, code, and seed must yield the\nsame model, or you can&#39;t debug, audit, or roll back.</li>\n<li><strong>The model is the easy part.</strong> The data pipeline, serving, and monitoring are\n90% of the system and 90% of the failures (Sculley, &quot;Hidden Technical Debt in\nML Systems&quot;).</li>\n<li><strong>Ship the simplest model that clears the bar, then iterate.</strong> A logistic\nregression in prod beats a transformer in a notebook.</li>\n<li><strong>Offline metrics are a hypothesis; online metrics are the truth.</strong> A model\nthat improves AUC can still tank revenue.</li>\n<li><strong>Make every model rollback-able.</strong> A bad deploy is an incident; roll it back\nin one move.</li>\n</ul>\n","wordCount":155},{"heading":"Mental Models","id":"mental-models","markdown":"- **The ML system as a pipeline of stages.** Data ingestion → feature\n  computation → training → evaluation → serving → monitoring → retraining. The\n  system is only as reliable as its weakest stage — rarely the model.\n- **Training–serving skew.** Two code paths that must agree: the batch path that\n  built training features and the online path that builds them at request time.\n  Any divergence — a default, time-zone, fill value — silently degrades output.\n- **Data and concept drift.** *Data drift* is the input distribution moving;\n  *concept drift* is the input-to-target relationship moving. Both decay a model\n  but demand different responses: retrain vs. rethink.\n- **The feedback loop.** A deployed model changes the data it later trains on; a\n  recommender teaches itself it was right.\n- **Shadow and canary deployment.** Run the new model alongside the old, scoring\n  the same traffic without acting on it.\n- **The cost surface.** Inference cost = model size × traffic × hardware. The win\n  is often quantization, distillation, or caching — not a bigger architecture.\n- **Garbage in, model out.** The model encodes whatever the training data\n  contains — errors and biases — at scale.","html":"<h2 id=\"mental-models\">Mental Models</h2>\n<ul>\n<li><strong>The ML system as a pipeline of stages.</strong> Data ingestion → feature\ncomputation → training → evaluation → serving → monitoring → retraining. The\nsystem is only as reliable as its weakest stage — rarely the model.</li>\n<li><strong>Training–serving skew.</strong> Two code paths that must agree: the batch path that\nbuilt training features and the online path that builds them at request time.\nAny divergence — a default, time-zone, fill value — silently degrades output.</li>\n<li><strong>Data and concept drift.</strong> <em>Data drift</em> is the input distribution moving;\n<em>concept drift</em> is the input-to-target relationship moving. Both decay a model\nbut demand different responses: retrain vs. rethink.</li>\n<li><strong>The feedback loop.</strong> A deployed model changes the data it later trains on; a\nrecommender teaches itself it was right.</li>\n<li><strong>Shadow and canary deployment.</strong> Run the new model alongside the old, scoring\nthe same traffic without acting on it.</li>\n<li><strong>The cost surface.</strong> Inference cost = model size × traffic × hardware. The win\nis often quantization, distillation, or caching — not a bigger architecture.</li>\n<li><strong>Garbage in, model out.</strong> The model encodes whatever the training data\ncontains — errors and biases — at scale.</li>\n</ul>\n","wordCount":175},{"heading":"First Principles","id":"first-principles","markdown":"- A model is a function fit to the past; the future is only sometimes like it,\n  and the system must notice when it isn't.\n- Anything you don't version you can't reproduce; anything you can't reproduce you\n  can't trust.\n- Offline accuracy and online value are different quantities that only correlate.\n- Every feature is a dependency on an upstream pipeline that can break.\n- The expensive ML failures are silent: the system stays green while predictions\n  go wrong.","html":"<h2 id=\"first-principles\">First Principles</h2>\n<ul>\n<li>A model is a function fit to the past; the future is only sometimes like it,\nand the system must notice when it isn&#39;t.</li>\n<li>Anything you don&#39;t version you can&#39;t reproduce; anything you can&#39;t reproduce you\ncan&#39;t trust.</li>\n<li>Offline accuracy and online value are different quantities that only correlate.</li>\n<li>Every feature is a dependency on an upstream pipeline that can break.</li>\n<li>The expensive ML failures are silent: the system stays green while predictions\ngo wrong.</li>\n</ul>\n","wordCount":75},{"heading":"Questions Experts Constantly Ask","id":"questions-experts-constantly-ask","markdown":"- Are features computed identically in training and serving? Prove it.\n- What's the latency and cost budget per prediction, and does this model fit it?\n- How will I know the model has drifted before a user or the revenue does?\n- What's the rollback path if this deploy is bad?\n- What data is this trained on, and is any of it leaking the future?\n- What happens to predictions when an upstream feature pipeline returns nulls?\n- Is the offline metric I'm optimizing correlated with the business outcome?\n- Will deploying this model change the distribution of data it later sees?","html":"<h2 id=\"questions-experts-constantly-ask\">Questions Experts Constantly Ask</h2>\n<ul>\n<li>Are features computed identically in training and serving? Prove it.</li>\n<li>What&#39;s the latency and cost budget per prediction, and does this model fit it?</li>\n<li>How will I know the model has drifted before a user or the revenue does?</li>\n<li>What&#39;s the rollback path if this deploy is bad?</li>\n<li>What data is this trained on, and is any of it leaking the future?</li>\n<li>What happens to predictions when an upstream feature pipeline returns nulls?</li>\n<li>Is the offline metric I&#39;m optimizing correlated with the business outcome?</li>\n<li>Will deploying this model change the distribution of data it later sees?</li>\n</ul>\n","wordCount":96},{"heading":"Decision Frameworks","id":"decision-frameworks","markdown":"- **Build the baseline first.** Ship a heuristic or simple model end-to-end\n  before touching quality — it proves the plumbing and sets the bar.\n- **Batch vs. real-time serving.** If predictions can be precomputed (daily\n  recommendations), batch is cheaper. Pay for online serving only when freshness\n  matters.\n- **Retrain-on-schedule vs. retrain-on-trigger.** Time-based is simple and\n  predictable; drift-triggered is efficient but needs reliable detection. Choose\n  by how fast the world moves.\n- **Buy vs. build the model.** For commodity tasks (OCR, transcription, general\n  language), a hosted API or open foundation model usually beats training from\n  scratch. Build when data, latency, or cost is the edge.\n- **Promote-on-evidence.** A new model reaches production only after offline\n  eval, shadow scoring, then a canary with automatic rollback on regression.","html":"<h2 id=\"decision-frameworks\">Decision Frameworks</h2>\n<ul>\n<li><strong>Build the baseline first.</strong> Ship a heuristic or simple model end-to-end\nbefore touching quality — it proves the plumbing and sets the bar.</li>\n<li><strong>Batch vs. real-time serving.</strong> If predictions can be precomputed (daily\nrecommendations), batch is cheaper. Pay for online serving only when freshness\nmatters.</li>\n<li><strong>Retrain-on-schedule vs. retrain-on-trigger.</strong> Time-based is simple and\npredictable; drift-triggered is efficient but needs reliable detection. Choose\nby how fast the world moves.</li>\n<li><strong>Buy vs. build the model.</strong> For commodity tasks (OCR, transcription, general\nlanguage), a hosted API or open foundation model usually beats training from\nscratch. Build when data, latency, or cost is the edge.</li>\n<li><strong>Promote-on-evidence.</strong> A new model reaches production only after offline\neval, shadow scoring, then a canary with automatic rollback on regression.</li>\n</ul>\n","wordCount":130},{"heading":"Workflow","id":"workflow","markdown":"1. **Frame and baseline.** Define the prediction task, metric, and latency/cost\n   budget; ship the dumbest model through the full pipeline first.\n2. **Build the data and feature pipelines.** Share feature computation across\n   train and serve; version the data.\n3. **Train and evaluate offline.** Reproducible run, held-out evaluation,\n   comparison against baseline and incumbent.\n4. **Package and optimize.** Containerize; quantize, distill, or batch to meet\n   the budget.\n5. **Deploy progressively.** Shadow, then canary, then ramp — with automatic\n   rollback wired to online metrics.\n6. **Monitor.** Track input and prediction distributions, latency, and the\n   business metric; alert on drift, not just crashes.\n7. **Retrain.** On schedule or drift trigger, with a guardrail that refuses to\n   promote a worse model.\n8. **Postmortem ML incidents** like any outage; the fix is usually a monitoring\n   or pipeline gap, not the weights.","html":"<h2 id=\"workflow\">Workflow</h2>\n<ol>\n<li><strong>Frame and baseline.</strong> Define the prediction task, metric, and latency/cost\nbudget; ship the dumbest model through the full pipeline first.</li>\n<li><strong>Build the data and feature pipelines.</strong> Share feature computation across\ntrain and serve; version the data.</li>\n<li><strong>Train and evaluate offline.</strong> Reproducible run, held-out evaluation,\ncomparison against baseline and incumbent.</li>\n<li><strong>Package and optimize.</strong> Containerize; quantize, distill, or batch to meet\nthe budget.</li>\n<li><strong>Deploy progressively.</strong> Shadow, then canary, then ramp — with automatic\nrollback wired to online metrics.</li>\n<li><strong>Monitor.</strong> Track input and prediction distributions, latency, and the\nbusiness metric; alert on drift, not just crashes.</li>\n<li><strong>Retrain.</strong> On schedule or drift trigger, with a guardrail that refuses to\npromote a worse model.</li>\n<li><strong>Postmortem ML incidents</strong> like any outage; the fix is usually a monitoring\nor pipeline gap, not the weights.</li>\n</ol>\n","wordCount":136},{"heading":"Common Tradeoffs","id":"common-tradeoffs","markdown":"- **Model accuracy vs. inference cost and latency.** The bigger model wins\n  offline and blows the latency budget; distillation and quantization trade\n  accuracy for speed.\n- **Freshness vs. stability.** Retrain often and chase noise; retrain rarely and\n  decay. Tune the cadence.\n- **Online learning vs. batch retraining.** Online adapts fast but is hard to\n  debug and roll back; batch is slower but auditable.\n- **Feature richness vs. pipeline fragility.** Every feature is another upstream\n  dependency that can break and another way the paths split.\n- **Automation vs. oversight in retraining.** A fully automated retrain loop ships\n  a regression at machine speed unless guardrails gate it.","html":"<h2 id=\"common-tradeoffs\">Common Tradeoffs</h2>\n<ul>\n<li><strong>Model accuracy vs. inference cost and latency.</strong> The bigger model wins\noffline and blows the latency budget; distillation and quantization trade\naccuracy for speed.</li>\n<li><strong>Freshness vs. stability.</strong> Retrain often and chase noise; retrain rarely and\ndecay. Tune the cadence.</li>\n<li><strong>Online learning vs. batch retraining.</strong> Online adapts fast but is hard to\ndebug and roll back; batch is slower but auditable.</li>\n<li><strong>Feature richness vs. pipeline fragility.</strong> Every feature is another upstream\ndependency that can break and another way the paths split.</li>\n<li><strong>Automation vs. oversight in retraining.</strong> A fully automated retrain loop ships\na regression at machine speed unless guardrails gate it.</li>\n</ul>\n","wordCount":100},{"heading":"Rules of Thumb","id":"rules-of-thumb","markdown":"- If the model is suspiciously good offline, look for leakage before you ship.\n- Log the model version with every prediction, or you'll never debug prod.\n- Compute features once, use them in both training and serving — or pay in skew.\n- A model with no monitoring is one you've already lost control of.\n- Quantize before you reach for bigger hardware.\n- The retrain pipeline must be able to refuse to promote a worse model.\n- Reproduce the training run from scratch before you trust the artifact.","html":"<h2 id=\"rules-of-thumb\">Rules of Thumb</h2>\n<ul>\n<li>If the model is suspiciously good offline, look for leakage before you ship.</li>\n<li>Log the model version with every prediction, or you&#39;ll never debug prod.</li>\n<li>Compute features once, use them in both training and serving — or pay in skew.</li>\n<li>A model with no monitoring is one you&#39;ve already lost control of.</li>\n<li>Quantize before you reach for bigger hardware.</li>\n<li>The retrain pipeline must be able to refuse to promote a worse model.</li>\n<li>Reproduce the training run from scratch before you trust the artifact.</li>\n</ul>\n","wordCount":82},{"heading":"Failure Modes","id":"failure-modes","markdown":"- **Training–serving skew.** Offline metrics look great, production is worse,\n  because features differ between the two paths.\n- **Silent drift.** The system is up, fast, and increasingly wrong because the\n  input distribution moved and nobody watched.\n- **Data leakage.** A feature encodes the label or the future; the model dazzles\n  in eval and collapses in production.\n- **Pipeline jungle.** A tangle of glue scripts nobody can reproduce — the\n  commonest form of ML technical debt.\n- **Undeclared consumers.** Other teams quietly depend on a model's outputs, so\n  you can't change it without breaking theirs.\n- **Feedback loops gone feral.** Predictions reshape the training data until the\n  model optimizes for its own past output.\n- **The notebook in production.** A model promoted from a notebook with no\n  versioning, tests, or rollback.","html":"<h2 id=\"failure-modes\">Failure Modes</h2>\n<ul>\n<li><strong>Training–serving skew.</strong> Offline metrics look great, production is worse,\nbecause features differ between the two paths.</li>\n<li><strong>Silent drift.</strong> The system is up, fast, and increasingly wrong because the\ninput distribution moved and nobody watched.</li>\n<li><strong>Data leakage.</strong> A feature encodes the label or the future; the model dazzles\nin eval and collapses in production.</li>\n<li><strong>Pipeline jungle.</strong> A tangle of glue scripts nobody can reproduce — the\ncommonest form of ML technical debt.</li>\n<li><strong>Undeclared consumers.</strong> Other teams quietly depend on a model&#39;s outputs, so\nyou can&#39;t change it without breaking theirs.</li>\n<li><strong>Feedback loops gone feral.</strong> Predictions reshape the training data until the\nmodel optimizes for its own past output.</li>\n<li><strong>The notebook in production.</strong> A model promoted from a notebook with no\nversioning, tests, or rollback.</li>\n</ul>\n","wordCount":123},{"heading":"Anti-patterns","id":"anti-patterns","markdown":"- **Train-once-deploy-forever** — shipped with no retraining or monitoring,\n  left to decay.\n- **Two feature code paths** — separate logic for training and serving, drifting\n  apart silently.\n- **Big-model-by-default** — the largest architecture when a smaller one clears\n  the bar at a tenth the cost.\n- **Offline-only validation** — promoting on AUC with no shadow or canary.\n- **Unversioned everything** — no way to answer \"which model, on which data?\"\n- **Manual retraining heroics** — re-running a notebook each month, not a guarded\n  pipeline.","html":"<h2 id=\"anti-patterns\">Anti-patterns</h2>\n<ul>\n<li><strong>Train-once-deploy-forever</strong> — shipped with no retraining or monitoring,\nleft to decay.</li>\n<li><strong>Two feature code paths</strong> — separate logic for training and serving, drifting\napart silently.</li>\n<li><strong>Big-model-by-default</strong> — the largest architecture when a smaller one clears\nthe bar at a tenth the cost.</li>\n<li><strong>Offline-only validation</strong> — promoting on AUC with no shadow or canary.</li>\n<li><strong>Unversioned everything</strong> — no way to answer &quot;which model, on which data?&quot;</li>\n<li><strong>Manual retraining heroics</strong> — re-running a notebook each month, not a guarded\npipeline.</li>\n</ul>\n","wordCount":80},{"heading":"Vocabulary","id":"vocabulary","markdown":"- **Training–serving skew** — divergence between how features are computed in\n  training and at inference.\n- **Feature store** — serves consistent features to training and serving from one\n  definition.\n- **Data / concept drift** — input distribution moving / input-to-target\n  relationship moving.\n- **Model registry** — versioned catalog of model artifacts and lineage.\n- **Shadow deployment** — running a new model on live traffic without acting on\n  it.\n- **Quantization / distillation** — shrinking a model by reducing numerical\n  precision / training a smaller model to mimic a larger.\n- **MLOps** — operating ML systems reliably (CI/CD for models).\n- **Inference latency** — time to produce one prediction; the serving budget.\n- **Backfill** — recomputing historical features or predictions with new logic.\n- **Embedding** — a learned dense vector representing an input.","html":"<h2 id=\"vocabulary\">Vocabulary</h2>\n<ul>\n<li><strong>Training–serving skew</strong> — divergence between how features are computed in\ntraining and at inference.</li>\n<li><strong>Feature store</strong> — serves consistent features to training and serving from one\ndefinition.</li>\n<li><strong>Data / concept drift</strong> — input distribution moving / input-to-target\nrelationship moving.</li>\n<li><strong>Model registry</strong> — versioned catalog of model artifacts and lineage.</li>\n<li><strong>Shadow deployment</strong> — running a new model on live traffic without acting on\nit.</li>\n<li><strong>Quantization / distillation</strong> — shrinking a model by reducing numerical\nprecision / training a smaller model to mimic a larger.</li>\n<li><strong>MLOps</strong> — operating ML systems reliably (CI/CD for models).</li>\n<li><strong>Inference latency</strong> — time to produce one prediction; the serving budget.</li>\n<li><strong>Backfill</strong> — recomputing historical features or predictions with new logic.</li>\n<li><strong>Embedding</strong> — a learned dense vector representing an input.</li>\n</ul>\n","wordCount":112},{"heading":"Tools","id":"tools","markdown":"- **Frameworks** — PyTorch, TensorFlow, JAX; scikit-learn and XGBoost for tabular\n  work.\n- **Serving** — TorchServe, Triton, BentoML, ONNX Runtime, vLLM for LLMs.\n- **Pipelines and orchestration** — Airflow, Kubeflow, Ray, Spark.\n- **Experiment and artifact tracking** — MLflow, Weights & Biases, DVC; a model\n  registry for audit.\n- **Feature stores** — Feast, Tecton, to kill training–serving skew.\n- **Monitoring** — drift detectors, Evidently, plus Prometheus/Grafana for\n  latency and throughput.\n- **Infrastructure** — Kubernetes, GPUs/TPUs, autoscaling, and cost accounting.","html":"<h2 id=\"tools\">Tools</h2>\n<ul>\n<li><strong>Frameworks</strong> — PyTorch, TensorFlow, JAX; scikit-learn and XGBoost for tabular\nwork.</li>\n<li><strong>Serving</strong> — TorchServe, Triton, BentoML, ONNX Runtime, vLLM for LLMs.</li>\n<li><strong>Pipelines and orchestration</strong> — Airflow, Kubeflow, Ray, Spark.</li>\n<li><strong>Experiment and artifact tracking</strong> — MLflow, Weights &amp; Biases, DVC; a model\nregistry for audit.</li>\n<li><strong>Feature stores</strong> — Feast, Tecton, to kill training–serving skew.</li>\n<li><strong>Monitoring</strong> — drift detectors, Evidently, plus Prometheus/Grafana for\nlatency and throughput.</li>\n<li><strong>Infrastructure</strong> — Kubernetes, GPUs/TPUs, autoscaling, and cost accounting.</li>\n</ul>\n","wordCount":68},{"heading":"Collaboration","id":"collaboration","markdown":"An ML engineer sits between research and operations and speaks both languages.\nWith data scientists, they harden a prototype and push back when it can't be\nserved within budget or reproduced. With data engineers, they share feature\npipelines and feel every upstream schema change. With software engineers, they\nintegrate the model behind an API under the same review discipline; with SREs,\nthey share on-call and observability. The recurring friction is the handoff from\ndata science: a research artifact lands needing the reproducibility, monitoring,\nand rollback it was never built with — good ML engineers push that rigor\nupstream.","html":"<h2 id=\"collaboration\">Collaboration</h2>\n<p>An ML engineer sits between research and operations and speaks both languages.\nWith data scientists, they harden a prototype and push back when it can&#39;t be\nserved within budget or reproduced. With data engineers, they share feature\npipelines and feel every upstream schema change. With software engineers, they\nintegrate the model behind an API under the same review discipline; with SREs,\nthey share on-call and observability. The recurring friction is the handoff from\ndata science: a research artifact lands needing the reproducibility, monitoring,\nand rollback it was never built with — good ML engineers push that rigor\nupstream.</p>\n","wordCount":98},{"heading":"Ethics","id":"ethics","markdown":"ML systems decide about people at a scale and speed no human reviews, which\nconcentrates the cost of a quiet mistake. The duties: monitor for fairness and\ndisparate impact in production, not just at training time, because a model can\ndrift into discrimination as the world shifts; keep a human-meaningful\nexplanation and appeal path for high-stakes decisions; refuse to deploy a model\nwhose data you can't account for or whose failures you can't detect; treat a\nsilently-wrong model as a safety incident. The hardest line is shipping speed\nversus the guardrails that catch the harm — and the engineer who skips them owns\nwhat the model does.","html":"<h2 id=\"ethics\">Ethics</h2>\n<p>ML systems decide about people at a scale and speed no human reviews, which\nconcentrates the cost of a quiet mistake. The duties: monitor for fairness and\ndisparate impact in production, not just at training time, because a model can\ndrift into discrimination as the world shifts; keep a human-meaningful\nexplanation and appeal path for high-stakes decisions; refuse to deploy a model\nwhose data you can&#39;t account for or whose failures you can&#39;t detect; treat a\nsilently-wrong model as a safety incident. The hardest line is shipping speed\nversus the guardrails that catch the harm — and the engineer who skips them owns\nwhat the model does.</p>\n","wordCount":109},{"heading":"Scenarios","id":"scenarios","markdown":"**The model that aced the lab and failed the field.** A fraud model scores 0.94\nAUC offline but barely beats the old rules in production. Suspecting\ntraining–serving skew, the engineer compares feature values across both paths.\nThe training pipeline computed \"average transaction amount\" over a 30-day window\nincluding the current transaction; the serving path excluded it — a leak\ninflating offline scores. Unifying the paths through one feature definition drops\nthe offline number to a realistic 0.86, which production matches.\n\n**Silent drift after a holiday.** A demand-forecasting model that ran clean for\nmonths starts under-predicting badly in late November. No alert fired — latency\nand error rates were fine; the *system* was healthy, only the *predictions* were\nwrong. The fix is twofold: revert to a seasonal heuristic now, then add\nprediction- and input-distribution monitoring with drift alerts. ML monitoring\nmust watch the numbers the model emits, not just whether the server is up.\n\n**The retrain that would have shipped a regression.** The weekly automated\nretrain scores higher on the rolling validation set. The promotion\nguardrail runs it against a fixed golden test set and a fairness slice, and\ncatches that overall accuracy rose but dropped sharply for a minority segment —\nthat week's training data was skewed by a logging bug. The pipeline refuses to\npromote and pages the engineer; the old model stays up, stopping harm at machine\nspeed.","html":"<h2 id=\"scenarios\">Scenarios</h2>\n<p><strong>The model that aced the lab and failed the field.</strong> A fraud model scores 0.94\nAUC offline but barely beats the old rules in production. Suspecting\ntraining–serving skew, the engineer compares feature values across both paths.\nThe training pipeline computed &quot;average transaction amount&quot; over a 30-day window\nincluding the current transaction; the serving path excluded it — a leak\ninflating offline scores. Unifying the paths through one feature definition drops\nthe offline number to a realistic 0.86, which production matches.</p>\n<p><strong>Silent drift after a holiday.</strong> A demand-forecasting model that ran clean for\nmonths starts under-predicting badly in late November. No alert fired — latency\nand error rates were fine; the <em>system</em> was healthy, only the <em>predictions</em> were\nwrong. The fix is twofold: revert to a seasonal heuristic now, then add\nprediction- and input-distribution monitoring with drift alerts. ML monitoring\nmust watch the numbers the model emits, not just whether the server is up.</p>\n<p><strong>The retrain that would have shipped a regression.</strong> The weekly automated\nretrain scores higher on the rolling validation set. The promotion\nguardrail runs it against a fixed golden test set and a fairness slice, and\ncatches that overall accuracy rose but dropped sharply for a minority segment —\nthat week&#39;s training data was skewed by a logging bug. The pipeline refuses to\npromote and pages the engineer; the old model stays up, stopping harm at machine\nspeed.</p>\n","wordCount":234},{"heading":"Related Occupations","id":"related-occupations","markdown":"A machine learning engineer is a software engineer who specializes in the\nfailure modes of systems that learn — versioning, testing, and operating code\napplied to artifacts that decay and drift. The data scientist is the closest\nneighbor and upstream partner: they prove the signal and prototype the model; the\nML engineer makes it a reliable production system. Data engineers\nbuild the pipelines both depend on; SREs share the on-call culture; AI safety\nresearchers and prompt engineers work the same models from alignment and\ninterface angles; research scientists supply the methods it productionizes.","html":"<h2 id=\"related-occupations\">Related Occupations</h2>\n<p>A machine learning engineer is a software engineer who specializes in the\nfailure modes of systems that learn — versioning, testing, and operating code\napplied to artifacts that decay and drift. The data scientist is the closest\nneighbor and upstream partner: they prove the signal and prototype the model; the\nML engineer makes it a reliable production system. Data engineers\nbuild the pipelines both depend on; SREs share the on-call culture; AI safety\nresearchers and prompt engineers work the same models from alignment and\ninterface angles; research scientists supply the methods it productionizes.</p>\n","wordCount":93},{"heading":"References","id":"references","markdown":"- *Designing Machine Learning Systems* — Chip Huyen\n- *Machine Learning Design Patterns* — Lakshmanan, Robinson, Munn\n- *Reliable Machine Learning* — Chen et al. (Google)\n- \"Hidden Technical Debt in Machine Learning Systems\" — Sculley et al., NeurIPS 2015\n- *Designing Data-Intensive Applications* — Martin Kleppmann","html":"<h2 id=\"references\">References</h2>\n<ul>\n<li><em>Designing Machine Learning Systems</em> — Chip Huyen</li>\n<li><em>Machine Learning Design Patterns</em> — Lakshmanan, Robinson, Munn</li>\n<li><em>Reliable Machine Learning</em> — Chen et al. (Google)</li>\n<li>&quot;Hidden Technical Debt in Machine Learning Systems&quot; — Sculley et al., NeurIPS 2015</li>\n<li><em>Designing Data-Intensive Applications</em> — Martin Kleppmann</li>\n</ul>\n","wordCount":38}],"computed":{"wordCount":2150,"readingTimeMinutes":10,"completeness":1,"backlinks":["ai-safety-researcher","bioinformatics-scientist","data-engineer","data-scientist","neuroscientist","prompt-engineer","quantum-engineer","robotics-engineer","statistician"],"verified":false,"aiDrafted":true,"unverifiedAiDraft":true},"git":{"created":"2026-06-26","updated":"2026-06-26","revisions":1,"authors":[{"name":"soul-atlas","commits":1}],"timeline":[{"date":"2026-06-26","author":"soul-atlas"}]},"citation":{"apa":"soul-atlas (2026). Machine Learning Engineer [SOUL]. SOUL Atlas. https://soul-atlas.github.io/occupations/machine-learning-engineer","bibtex":"@misc{soulatlas-machine-learning-engineer,\n  title        = {Machine Learning Engineer},\n  author       = {soul-atlas},\n  year         = {2026},\n  howpublished = {SOUL Atlas},\n  note         = {SOUL.md, version 2026-06-26},\n  url          = {https://soul-atlas.github.io/occupations/machine-learning-engineer}\n}","text":"soul-atlas. \"Machine Learning Engineer.\" SOUL Atlas, 2026. https://soul-atlas.github.io/occupations/machine-learning-engineer."}}