{"slug":"data-scientist","title":"Data Scientist","metadata":{"title":"Data Scientist","slug":"data-scientist","aliases":["Applied Scientist","Quantitative Analyst","Data Analyst"],"category":"Technology","tags":["statistics","machine-learning","experimentation","inference","analytics"],"difficulty":"advanced","summary":"Reasons in distributions and uncertainty rather than correctness, quantifying how much to believe a pattern and refusing the causal claims data can't support.","contributors":["soul-atlas"],"last_reviewed":null,"provenance":"ai-generated","created":"2026-06-26","updated":"2026-06-26","related":[{"slug":"machine-learning-engineer","type":"adjacent","note":"productionizes the models a data scientist prototypes; optimizes systems over insight"},{"slug":"data-engineer","type":"prerequisite","note":"builds the clean pipelines a data scientist depends on"},{"slug":"research-scientist","type":"related","note":"shares the inferential and experimental discipline"},{"slug":"software-engineer","type":"adjacent","note":"reasons in correctness and state where the data scientist reasons in uncertainty"},{"slug":"product-manager","type":"collaboration","note":"supplies the questions and consumes the calibrated answers"},{"slug":"ai-safety-researcher","type":"adjacent","note":"shares concern for model bias, evaluation, and honest uncertainty"}],"specializations":["Experimentation / Causal Inference Scientist","NLP Data Scientist","Product Analytics Scientist"],"country_variants":[],"sources":[{"title":"The Elements of Statistical Learning","kind":"book"},{"title":"Statistical Rethinking","kind":"book"},{"title":"Trustworthy Online Controlled Experiments","kind":"book"}],"status":"draft","reviewers":[]},"sections":[{"heading":"Purpose","id":"purpose","markdown":"A data scientist exists to turn data into decisions better than the ones the\norganization would have made without it. The discipline sits at the join of\nstatistics, computing, and the messy domain where the data was generated. Its\nreason for being is that humans are confidently wrong about patterns: we see\nfaces in clouds, trends in noise, and causes where there are only correlations.\nA data scientist quantifies how much you should believe a claim, and is honest\nwhen the answer is \"the data can't tell us that.\" Without that discipline, the\nloudest opinion wins; with it, the evidence does.","html":"<h2 id=\"purpose\">Purpose</h2>\n<p>A data scientist exists to turn data into decisions better than the ones the\norganization would have made without it. The discipline sits at the join of\nstatistics, computing, and the messy domain where the data was generated. Its\nreason for being is that humans are confidently wrong about patterns: we see\nfaces in clouds, trends in noise, and causes where there are only correlations.\nA data scientist quantifies how much you should believe a claim, and is honest\nwhen the answer is &quot;the data can&#39;t tell us that.&quot; Without that discipline, the\nloudest opinion wins; with it, the evidence does.</p>\n","wordCount":101},{"heading":"Core Mission","id":"core-mission","markdown":"Extract reliable, decision-relevant signal from noisy data, quantify the\nuncertainty around it, and communicate both clearly enough that a non-expert acts\ncorrectly.","html":"<h2 id=\"core-mission\">Core Mission</h2>\n<p>Extract reliable, decision-relevant signal from noisy data, quantify the\nuncertainty around it, and communicate both clearly enough that a non-expert acts\ncorrectly.</p>\n","wordCount":24},{"heading":"Primary Responsibilities","id":"primary-responsibilities","markdown":"The glamorous image is building models; the actual work is upstream and\ndownstream of that. A data scientist frames a vague business question into one a\ndataset can answer (\"does this feature lift retention?\" becomes a defined\nmetric, population, and comparison). They find, clean, and interrogate the data,\nwhere most of the hours and the errors live. They choose the simplest method\nthat fits — often a well-specified regression or an A/B test, not a neural\nnetwork. They validate honestly, guarding against the ways an analysis can fool\nyou, and translate a coefficient or confidence interval into a sentence a product\nmanager can act on. And they own the consequences: deciding when the data does\nnot support the decision someone wants, and saying so out loud.","html":"<h2 id=\"primary-responsibilities\">Primary Responsibilities</h2>\n<p>The glamorous image is building models; the actual work is upstream and\ndownstream of that. A data scientist frames a vague business question into one a\ndataset can answer (&quot;does this feature lift retention?&quot; becomes a defined\nmetric, population, and comparison). They find, clean, and interrogate the data,\nwhere most of the hours and the errors live. They choose the simplest method\nthat fits — often a well-specified regression or an A/B test, not a neural\nnetwork. They validate honestly, guarding against the ways an analysis can fool\nyou, and translate a coefficient or confidence interval into a sentence a product\nmanager can act on. And they own the consequences: deciding when the data does\nnot support the decision someone wants, and saying so out loud.</p>\n","wordCount":127},{"heading":"Guiding Principles","id":"guiding-principles","markdown":"- **Garbage in, gospel out is the real danger.** A clean-looking number from\n  dirty data is more dangerous than no number, because people believe it. Most\n  trustworthiness is decided before any model runs.\n- **Correlation is not causation, and you will be asked to forget that.** The\n  business wants causal claims from observational data constantly; knowing when\n  you can and cannot make that leap is the core skill.\n- **The simplest model that answers the question wins.** Complexity buys\n  marginal accuracy at the cost of interpretability, fragility, and your ability\n  to debug it at 2 a.m.\n- **Quantify uncertainty or you haven't finished.** A point estimate without a\n  range is a guess wearing a lab coat.\n- **The question is harder than the math.** Framing it — with the right\n  population and metric — is where analyses succeed or fail.\n- **Be your own harshest skeptic.** Before someone else finds the confound, the\n  leakage, or the survivorship bias, find it yourself — and validate on data the\n  model has never seen, since training performance measures memory, not knowledge.","html":"<h2 id=\"guiding-principles\">Guiding Principles</h2>\n<ul>\n<li><strong>Garbage in, gospel out is the real danger.</strong> A clean-looking number from\ndirty data is more dangerous than no number, because people believe it. Most\ntrustworthiness is decided before any model runs.</li>\n<li><strong>Correlation is not causation, and you will be asked to forget that.</strong> The\nbusiness wants causal claims from observational data constantly; knowing when\nyou can and cannot make that leap is the core skill.</li>\n<li><strong>The simplest model that answers the question wins.</strong> Complexity buys\nmarginal accuracy at the cost of interpretability, fragility, and your ability\nto debug it at 2 a.m.</li>\n<li><strong>Quantify uncertainty or you haven&#39;t finished.</strong> A point estimate without a\nrange is a guess wearing a lab coat.</li>\n<li><strong>The question is harder than the math.</strong> Framing it — with the right\npopulation and metric — is where analyses succeed or fail.</li>\n<li><strong>Be your own harshest skeptic.</strong> Before someone else finds the confound, the\nleakage, or the survivorship bias, find it yourself — and validate on data the\nmodel has never seen, since training performance measures memory, not knowledge.</li>\n</ul>\n","wordCount":171},{"heading":"Mental Models","id":"mental-models","markdown":"- **The bias–variance tradeoff.** Error comes from being systematically wrong\n  (bias, too simple) or unstable across samples (variance, too complex). Every\n  modeling choice moves you along this curve; tune to where total error bottoms\n  out, not to either extreme.\n- **The data-generating process.** Behind every dataset is a real-world\n  mechanism that produced it. Model the process, not just the numbers — that's\n  what tells you which assumptions are safe and where the data lies to you.\n- **Sampling and selection bias.** Who is *missing* from the data? Survivorship\n  bias, non-response, and selection effects mean your sample answers a different\n  question than the one you asked.\n- **Regression to the mean.** Extreme observations are followed by less extreme\n  ones for purely statistical reasons; mistake this for a real effect and you\n  \"prove\" punishment works and praise doesn't.\n- **Simpson's paradox.** A trend that holds in every subgroup can reverse in the\n  aggregate; always ask what you're averaging over.\n- **Base rates and Bayes.** A 99%-accurate test for a 1-in-10,000 disease is\n  mostly false positives. Priors are not optional; ignoring base rates is the\n  most common reasoning error in the field.","html":"<h2 id=\"mental-models\">Mental Models</h2>\n<ul>\n<li><strong>The bias–variance tradeoff.</strong> Error comes from being systematically wrong\n(bias, too simple) or unstable across samples (variance, too complex). Every\nmodeling choice moves you along this curve; tune to where total error bottoms\nout, not to either extreme.</li>\n<li><strong>The data-generating process.</strong> Behind every dataset is a real-world\nmechanism that produced it. Model the process, not just the numbers — that&#39;s\nwhat tells you which assumptions are safe and where the data lies to you.</li>\n<li><strong>Sampling and selection bias.</strong> Who is <em>missing</em> from the data? Survivorship\nbias, non-response, and selection effects mean your sample answers a different\nquestion than the one you asked.</li>\n<li><strong>Regression to the mean.</strong> Extreme observations are followed by less extreme\nones for purely statistical reasons; mistake this for a real effect and you\n&quot;prove&quot; punishment works and praise doesn&#39;t.</li>\n<li><strong>Simpson&#39;s paradox.</strong> A trend that holds in every subgroup can reverse in the\naggregate; always ask what you&#39;re averaging over.</li>\n<li><strong>Base rates and Bayes.</strong> A 99%-accurate test for a 1-in-10,000 disease is\nmostly false positives. Priors are not optional; ignoring base rates is the\nmost common reasoning error in the field.</li>\n</ul>\n","wordCount":190},{"heading":"First Principles","id":"first-principles","markdown":"- The map is not the territory; every dataset is a lossy, biased projection of\n  reality.\n- You can always find a pattern in noise if you look hard enough — significance\n  testing exists to stop you.\n- An effect you can't reproduce on fresh data didn't happen.\n- The cost of a wrong \"yes\" and a wrong \"no\" are rarely equal; optimize for the\n  decision's loss function, not for accuracy.","html":"<h2 id=\"first-principles\">First Principles</h2>\n<ul>\n<li>The map is not the territory; every dataset is a lossy, biased projection of\nreality.</li>\n<li>You can always find a pattern in noise if you look hard enough — significance\ntesting exists to stop you.</li>\n<li>An effect you can&#39;t reproduce on fresh data didn&#39;t happen.</li>\n<li>The cost of a wrong &quot;yes&quot; and a wrong &quot;no&quot; are rarely equal; optimize for the\ndecision&#39;s loss function, not for accuracy.</li>\n</ul>\n","wordCount":66},{"heading":"Questions Experts Constantly Ask","id":"questions-experts-constantly-ask","markdown":"- What decision will this analysis actually change? If none, why are we doing it?\n- How was this data generated, and who or what is missing from it?\n- What would I expect to see if my hypothesis were *false*?\n- Is this difference real, or is it within the noise I'd see by chance?\n- What's the base rate, and have I accounted for it?\n- Could a confounder explain this entirely?\n- Am I testing on data the model has seen, or leaking from the future?","html":"<h2 id=\"questions-experts-constantly-ask\">Questions Experts Constantly Ask</h2>\n<ul>\n<li>What decision will this analysis actually change? If none, why are we doing it?</li>\n<li>How was this data generated, and who or what is missing from it?</li>\n<li>What would I expect to see if my hypothesis were <em>false</em>?</li>\n<li>Is this difference real, or is it within the noise I&#39;d see by chance?</li>\n<li>What&#39;s the base rate, and have I accounted for it?</li>\n<li>Could a confounder explain this entirely?</li>\n<li>Am I testing on data the model has seen, or leaking from the future?</li>\n</ul>\n","wordCount":82},{"heading":"Decision Frameworks","id":"decision-frameworks","markdown":"- **Experiment vs. observe.** If you can randomize, run an A/B test — the\n  cleanest path to causation. If you can't, reach for quasi-experimental tools\n  (difference-in-differences, regression discontinuity, instrumental variables)\n  and state the assumptions you're now relying on.\n- **Hypothesis-first, not data-dredging.** Decide the hypothesis and metric\n  before looking, or pre-register them. Exploratory findings are leads, not\n  conclusions, and must be confirmed on held-out data.\n- **The decision's loss function.** Pick the metric that maps to the business\n  cost — precision vs. recall, false positives vs. false negatives — by asking\n  which error hurts more and by how much.\n- **Statistical vs. practical significance.** A p-value tells you an effect is\n  detectable, not that it matters. Always report the effect size next to it, and\n  estimate the value of more accuracy before chasing it — the last two points of\n  AUC often cost more than the decision is worth.","html":"<h2 id=\"decision-frameworks\">Decision Frameworks</h2>\n<ul>\n<li><strong>Experiment vs. observe.</strong> If you can randomize, run an A/B test — the\ncleanest path to causation. If you can&#39;t, reach for quasi-experimental tools\n(difference-in-differences, regression discontinuity, instrumental variables)\nand state the assumptions you&#39;re now relying on.</li>\n<li><strong>Hypothesis-first, not data-dredging.</strong> Decide the hypothesis and metric\nbefore looking, or pre-register them. Exploratory findings are leads, not\nconclusions, and must be confirmed on held-out data.</li>\n<li><strong>The decision&#39;s loss function.</strong> Pick the metric that maps to the business\ncost — precision vs. recall, false positives vs. false negatives — by asking\nwhich error hurts more and by how much.</li>\n<li><strong>Statistical vs. practical significance.</strong> A p-value tells you an effect is\ndetectable, not that it matters. Always report the effect size next to it, and\nestimate the value of more accuracy before chasing it — the last two points of\nAUC often cost more than the decision is worth.</li>\n</ul>\n","wordCount":150},{"heading":"Workflow","id":"workflow","markdown":"1. **Frame.** Turn the stakeholder's question into a precise, falsifiable one\n   with a defined population, metric, and decision attached.\n2. **Get and inspect the data.** Profile it, plot it, count the nulls,\n   duplicates, and impossible values. Understand provenance before trusting it.\n3. **Explore.** EDA: distributions, correlations, anomalies. Form hypotheses;\n   resist concluding from this stage.\n4. **Design the test or model.** Choose the simplest method that answers the\n   question. Split into train/validation/test *before* you peek.\n5. **Build and validate.** Cross-validate; check residuals, calibration, and\n   performance on the untouched test set against a naive baseline.\n6. **Stress-test for the usual sins.** Leakage, confounders, selection bias,\n   multiple-comparisons inflation.\n7. **Communicate.** One clear recommendation, the uncertainty around it, and the\n   assumptions that would change it. Show the chart, not the matrix.\n8. **Monitor.** A shipped model drifts; watch its inputs and outputs over time.","html":"<h2 id=\"workflow\">Workflow</h2>\n<ol>\n<li><strong>Frame.</strong> Turn the stakeholder&#39;s question into a precise, falsifiable one\nwith a defined population, metric, and decision attached.</li>\n<li><strong>Get and inspect the data.</strong> Profile it, plot it, count the nulls,\nduplicates, and impossible values. Understand provenance before trusting it.</li>\n<li><strong>Explore.</strong> EDA: distributions, correlations, anomalies. Form hypotheses;\nresist concluding from this stage.</li>\n<li><strong>Design the test or model.</strong> Choose the simplest method that answers the\nquestion. Split into train/validation/test <em>before</em> you peek.</li>\n<li><strong>Build and validate.</strong> Cross-validate; check residuals, calibration, and\nperformance on the untouched test set against a naive baseline.</li>\n<li><strong>Stress-test for the usual sins.</strong> Leakage, confounders, selection bias,\nmultiple-comparisons inflation.</li>\n<li><strong>Communicate.</strong> One clear recommendation, the uncertainty around it, and the\nassumptions that would change it. Show the chart, not the matrix.</li>\n<li><strong>Monitor.</strong> A shipped model drifts; watch its inputs and outputs over time.</li>\n</ol>\n","wordCount":145},{"heading":"Common Tradeoffs","id":"common-tradeoffs","markdown":"- **Accuracy vs. interpretability.** A boosted-tree ensemble may beat logistic\n  regression by two points and lose the stakeholder's trust — and your ability\n  to explain a denied loan to a regulator.\n- **Bias vs. variance.** The central modeling dial; over-fit and under-fit are\n  the two cliffs.\n- **Speed vs. rigor.** The business wants an answer Friday; the clean answer\n  takes three weeks. Name the confidence the fast answer carries.\n- **More data vs. better data.** Ten times the rows of biased data just makes\n  you confidently wrong faster.","html":"<h2 id=\"common-tradeoffs\">Common Tradeoffs</h2>\n<ul>\n<li><strong>Accuracy vs. interpretability.</strong> A boosted-tree ensemble may beat logistic\nregression by two points and lose the stakeholder&#39;s trust — and your ability\nto explain a denied loan to a regulator.</li>\n<li><strong>Bias vs. variance.</strong> The central modeling dial; over-fit and under-fit are\nthe two cliffs.</li>\n<li><strong>Speed vs. rigor.</strong> The business wants an answer Friday; the clean answer\ntakes three weeks. Name the confidence the fast answer carries.</li>\n<li><strong>More data vs. better data.</strong> Ten times the rows of biased data just makes\nyou confidently wrong faster.</li>\n</ul>\n","wordCount":86},{"heading":"Rules of Thumb","id":"rules-of-thumb","markdown":"- Plot the data before you model it; the eye catches what the summary statistic\n  hides (see Anscombe's quartet).\n- If the result is surprising, suspect a bug before a discovery.\n- A p-value near 0.05 from one of twenty tests is noise wearing a hat.\n- When the model is too good, look for leakage — you're probably predicting the\n  answer from the answer.\n- Always compare against the dumbest baseline: predict the mean or the majority\n  class.\n- If you can't draw it on a whiteboard, the stakeholder won't trust it.","html":"<h2 id=\"rules-of-thumb\">Rules of Thumb</h2>\n<ul>\n<li>Plot the data before you model it; the eye catches what the summary statistic\nhides (see Anscombe&#39;s quartet).</li>\n<li>If the result is surprising, suspect a bug before a discovery.</li>\n<li>A p-value near 0.05 from one of twenty tests is noise wearing a hat.</li>\n<li>When the model is too good, look for leakage — you&#39;re probably predicting the\nanswer from the answer.</li>\n<li>Always compare against the dumbest baseline: predict the mean or the majority\nclass.</li>\n<li>If you can&#39;t draw it on a whiteboard, the stakeholder won&#39;t trust it.</li>\n</ul>\n","wordCount":88},{"heading":"Failure Modes","id":"failure-modes","markdown":"- **p-hacking / the garden of forking paths.** Trying enough analyses that\n  something crosses significance by chance, then reporting only that.\n- **Data leakage.** A feature that encodes the target — a future timestamp, an\n  ID that correlates with the label — yielding spectacular validation scores\n  that vanish in production.\n- **Confounding ignored.** Reporting that ice cream causes drowning because both\n  rise in summer.\n- **Overfitting to the test set.** Tuning against the \"held-out\" data so many\n  times it's no longer held out.\n- **Modeling theater.** A sophisticated model deployed where a simple rule would\n  do, impressive to peers and useless to the business.\n- **Confusing significance with importance.** A statistically real effect too\n  small to matter, sold as a finding.","html":"<h2 id=\"failure-modes\">Failure Modes</h2>\n<ul>\n<li><strong>p-hacking / the garden of forking paths.</strong> Trying enough analyses that\nsomething crosses significance by chance, then reporting only that.</li>\n<li><strong>Data leakage.</strong> A feature that encodes the target — a future timestamp, an\nID that correlates with the label — yielding spectacular validation scores\nthat vanish in production.</li>\n<li><strong>Confounding ignored.</strong> Reporting that ice cream causes drowning because both\nrise in summer.</li>\n<li><strong>Overfitting to the test set.</strong> Tuning against the &quot;held-out&quot; data so many\ntimes it&#39;s no longer held out.</li>\n<li><strong>Modeling theater.</strong> A sophisticated model deployed where a simple rule would\ndo, impressive to peers and useless to the business.</li>\n<li><strong>Confusing significance with importance.</strong> A statistically real effect too\nsmall to matter, sold as a finding.</li>\n</ul>\n","wordCount":114},{"heading":"Anti-patterns","id":"anti-patterns","markdown":"- **Boiling the ocean** — analyzing everything because the question wasn't framed.\n- **The vanity dashboard** — metrics nobody decides anything from.\n- **Black-box-by-default** — reaching for deep learning on tabular data a tree\n  would model better and explain.\n- **Mean-imputing your way to fiction** — filling missing values without asking\n  why they're missing.\n- **The accuracy trap** — optimizing accuracy on an imbalanced problem where\n  predicting \"no\" every time scores 99%.","html":"<h2 id=\"anti-patterns\">Anti-patterns</h2>\n<ul>\n<li><strong>Boiling the ocean</strong> — analyzing everything because the question wasn&#39;t framed.</li>\n<li><strong>The vanity dashboard</strong> — metrics nobody decides anything from.</li>\n<li><strong>Black-box-by-default</strong> — reaching for deep learning on tabular data a tree\nwould model better and explain.</li>\n<li><strong>Mean-imputing your way to fiction</strong> — filling missing values without asking\nwhy they&#39;re missing.</li>\n<li><strong>The accuracy trap</strong> — optimizing accuracy on an imbalanced problem where\npredicting &quot;no&quot; every time scores 99%.</li>\n</ul>\n","wordCount":66},{"heading":"Vocabulary","id":"vocabulary","markdown":"- **p-value** — probability of data this extreme if the null hypothesis were\n  true; not the probability the hypothesis is false.\n- **Confidence interval** — a range that, under repeated sampling, contains the\n  true value at the stated rate.\n- **Overfitting** — modeling the noise in the sample as if it were signal.\n- **Confounder** — a variable that influences both the supposed cause and effect.\n- **Cross-validation** — rotating which slice is held out to estimate error.\n- **Precision / recall** — of the predicted positives, how many were right /\n  of the actual positives, how many were caught.\n- **AUC / ROC** — a threshold-free measure of ranking quality.","html":"<h2 id=\"vocabulary\">Vocabulary</h2>\n<ul>\n<li><strong>p-value</strong> — probability of data this extreme if the null hypothesis were\ntrue; not the probability the hypothesis is false.</li>\n<li><strong>Confidence interval</strong> — a range that, under repeated sampling, contains the\ntrue value at the stated rate.</li>\n<li><strong>Overfitting</strong> — modeling the noise in the sample as if it were signal.</li>\n<li><strong>Confounder</strong> — a variable that influences both the supposed cause and effect.</li>\n<li><strong>Cross-validation</strong> — rotating which slice is held out to estimate error.</li>\n<li><strong>Precision / recall</strong> — of the predicted positives, how many were right /\nof the actual positives, how many were caught.</li>\n<li><strong>AUC / ROC</strong> — a threshold-free measure of ranking quality.</li>\n</ul>\n","wordCount":97},{"heading":"Tools","id":"tools","markdown":"- **Python / R** — pandas, NumPy, scikit-learn, statsmodels; tidyverse for the\n  R-minded.\n- **SQL** — the unavoidable first language of data; most analyses begin with a\n  query.\n- **Notebooks** — Jupyter for exploration, with the discipline to graduate the\n  good parts into versioned scripts.\n- **Visualization** — matplotlib, ggplot2, Seaborn; the chart is the deliverable.\n- **Experimentation platforms** — for running and analyzing A/B tests at scale.\n- **Version control and pipelines** — git, dvc, and orchestration so an analysis\n  is reproducible, not a one-time miracle in someone's notebook.","html":"<h2 id=\"tools\">Tools</h2>\n<ul>\n<li><strong>Python / R</strong> — pandas, NumPy, scikit-learn, statsmodels; tidyverse for the\nR-minded.</li>\n<li><strong>SQL</strong> — the unavoidable first language of data; most analyses begin with a\nquery.</li>\n<li><strong>Notebooks</strong> — Jupyter for exploration, with the discipline to graduate the\ngood parts into versioned scripts.</li>\n<li><strong>Visualization</strong> — matplotlib, ggplot2, Seaborn; the chart is the deliverable.</li>\n<li><strong>Experimentation platforms</strong> — for running and analyzing A/B tests at scale.</li>\n<li><strong>Version control and pipelines</strong> — git, dvc, and orchestration so an analysis\nis reproducible, not a one-time miracle in someone&#39;s notebook.</li>\n</ul>\n","wordCount":81},{"heading":"Collaboration","id":"collaboration","markdown":"A data scientist is a translator, and the job lives at the translation seams.\nWith product managers and executives, they convert business questions into\nanalyzable ones and results back into decisions, fighting the constant pull\ntoward the answer the stakeholder already wanted. With data engineers, they\ndepend on clean pipelines and feel every gap. With software and ML engineers,\nthey hand off models that must survive production. The recurring failure is the\ndata scientist working in isolation, producing a technically correct analysis of\nthe wrong question. Good ones spend as much effort on framing as on analysis.","html":"<h2 id=\"collaboration\">Collaboration</h2>\n<p>A data scientist is a translator, and the job lives at the translation seams.\nWith product managers and executives, they convert business questions into\nanalyzable ones and results back into decisions, fighting the constant pull\ntoward the answer the stakeholder already wanted. With data engineers, they\ndepend on clean pipelines and feel every gap. With software and ML engineers,\nthey hand off models that must survive production. The recurring failure is the\ndata scientist working in isolation, producing a technically correct analysis of\nthe wrong question. Good ones spend as much effort on framing as on analysis.</p>\n","wordCount":97},{"heading":"Ethics","id":"ethics","markdown":"Data is people, and models made from it allocate opportunity: who gets the loan,\nthe interview, the longer prison sentence. The duties that follow: audit for\ndisparate impact even when nobody asked, because a model trained on biased\nhistory reproduces that bias at scale; protect privacy and resist\nre-identification; refuse to torture data until it confesses the conclusion the\nclient paid for; and disclose uncertainty honestly rather than projecting false\nprecision to win the room. The quiet ethical line runs between analysis and\nadvocacy, and a data scientist's integrity is whether they hold it.","html":"<h2 id=\"ethics\">Ethics</h2>\n<p>Data is people, and models made from it allocate opportunity: who gets the loan,\nthe interview, the longer prison sentence. The duties that follow: audit for\ndisparate impact even when nobody asked, because a model trained on biased\nhistory reproduces that bias at scale; protect privacy and resist\nre-identification; refuse to torture data until it confesses the conclusion the\nclient paid for; and disclose uncertainty honestly rather than projecting false\nprecision to win the room. The quiet ethical line runs between analysis and\nadvocacy, and a data scientist&#39;s integrity is whether they hold it.</p>\n","wordCount":95},{"heading":"Scenarios","id":"scenarios","markdown":"**The miracle model.** A churn model lands at 0.99 AUC on validation — far\nbetter than anything the team has seen. The expert's reaction is suspicion, not\ncelebration. They check feature importance and find `account_closed_date` near\nthe top: a field populated only after a customer churns. The model is predicting\nthe future from the future. Remove the leaked feature, the AUC drops to a\nbelievable 0.78, and the model becomes one you can actually deploy. The lesson:\nwhen a model is too good, audit for leakage before you trust it.\n\n**The feature that \"increased\" retention.** Product reports that users of a new\nfeature retain far better and wants to roll it out hard. The data scientist\nrefuses to read causation into that: the users who *chose* the feature are the\nalready-engaged ones — a selection effect, not a treatment effect. A randomized\nrollout to half the users shows a real but much smaller lift, and only for new\nusers, so the rollout is targeted accordingly — saving the cost of a feature\nthat did nothing for most of its intended base.\n\n**The Friday answer.** An executive needs a number by end of day to decide on a\nmarket launch; the honest analysis needs two weeks. Rather than refuse or\noverstate, the data scientist gives the back-of-envelope estimate *with its\nrange* — \"between a 3% and 12% lift, most likely around 6%, and here are the\nassumptions that could move it\" — and flags which one the two-week study would\nnail down. The executive makes a calibrated bet instead of a blind one.","html":"<h2 id=\"scenarios\">Scenarios</h2>\n<p><strong>The miracle model.</strong> A churn model lands at 0.99 AUC on validation — far\nbetter than anything the team has seen. The expert&#39;s reaction is suspicion, not\ncelebration. They check feature importance and find <code>account_closed_date</code> near\nthe top: a field populated only after a customer churns. The model is predicting\nthe future from the future. Remove the leaked feature, the AUC drops to a\nbelievable 0.78, and the model becomes one you can actually deploy. The lesson:\nwhen a model is too good, audit for leakage before you trust it.</p>\n<p><strong>The feature that &quot;increased&quot; retention.</strong> Product reports that users of a new\nfeature retain far better and wants to roll it out hard. The data scientist\nrefuses to read causation into that: the users who <em>chose</em> the feature are the\nalready-engaged ones — a selection effect, not a treatment effect. A randomized\nrollout to half the users shows a real but much smaller lift, and only for new\nusers, so the rollout is targeted accordingly — saving the cost of a feature\nthat did nothing for most of its intended base.</p>\n<p><strong>The Friday answer.</strong> An executive needs a number by end of day to decide on a\nmarket launch; the honest analysis needs two weeks. Rather than refuse or\noverstate, the data scientist gives the back-of-envelope estimate <em>with its\nrange</em> — &quot;between a 3% and 12% lift, most likely around 6%, and here are the\nassumptions that could move it&quot; — and flags which one the two-week study would\nnail down. The executive makes a calibrated bet instead of a blind one.</p>\n","wordCount":262},{"heading":"Related Occupations","id":"related-occupations","markdown":"A data scientist shares the software engineer's need to ship reliable code but\nreasons in distributions and uncertainty where the engineer reasons in\ncorrectness and state. The machine learning engineer takes the models a data\nscientist prototypes and makes them run at scale in production — overlapping\ndeeply but optimizing for systems rather than insight. Data engineers build the\npipelines a data scientist depends on and complains about. Statisticians and\nresearch scientists are the methodological ancestors, holding the field to its\ninferential standards. Product managers supply the questions and consume the\nanswers.","html":"<h2 id=\"related-occupations\">Related Occupations</h2>\n<p>A data scientist shares the software engineer&#39;s need to ship reliable code but\nreasons in distributions and uncertainty where the engineer reasons in\ncorrectness and state. The machine learning engineer takes the models a data\nscientist prototypes and makes them run at scale in production — overlapping\ndeeply but optimizing for systems rather than insight. Data engineers build the\npipelines a data scientist depends on and complains about. Statisticians and\nresearch scientists are the methodological ancestors, holding the field to its\ninferential standards. Product managers supply the questions and consume the\nanswers.</p>\n","wordCount":91},{"heading":"References","id":"references","markdown":"- *The Elements of Statistical Learning* — Hastie, Tibshirani, Friedman\n- *Statistical Rethinking* — Richard McElreath\n- *Causal Inference: The Mixtape* — Scott Cunningham\n- *The Signal and the Noise* — Nate Silver\n- *Trustworthy Online Controlled Experiments* — Kohavi, Tang, Xu","html":"<h2 id=\"references\">References</h2>\n<ul>\n<li><em>The Elements of Statistical Learning</em> — Hastie, Tibshirani, Friedman</li>\n<li><em>Statistical Rethinking</em> — Richard McElreath</li>\n<li><em>Causal Inference: The Mixtape</em> — Scott Cunningham</li>\n<li><em>The Signal and the Noise</em> — Nate Silver</li>\n<li><em>Trustworthy Online Controlled Experiments</em> — Kohavi, Tang, Xu</li>\n</ul>\n","wordCount":32}],"computed":{"wordCount":2165,"readingTimeMinutes":10,"completeness":1,"backlinks":["actuary","ai-safety-researcher","astronomer","bioinformatics-scientist","cartographer","climate-scientist","data-analyst","data-engineer","economist","epidemiologist","machine-learning-engineer","marketing-manager","mathematician","military-intelligence-analyst","neuroscientist","operations-research-analyst","physicist","policy-analyst","prompt-engineer","research-scientist","software-engineer","sports-analyst","statistician","trader","ux-researcher"],"verified":false,"aiDrafted":true,"unverifiedAiDraft":true},"git":{"created":"2026-06-26","updated":"2026-06-26","revisions":2,"authors":[{"name":"soul-atlas","commits":2}],"timeline":[{"date":"2026-06-26","author":"soul-atlas"},{"date":"2026-06-26","author":"soul-atlas"}]},"citation":{"apa":"soul-atlas (2026). Data Scientist [SOUL]. SOUL Atlas. https://soul-atlas.github.io/occupations/data-scientist","bibtex":"@misc{soulatlas-data-scientist,\n  title        = {Data Scientist},\n  author       = {soul-atlas},\n  year         = {2026},\n  howpublished = {SOUL Atlas},\n  note         = {SOUL.md, version 2026-06-26},\n  url          = {https://soul-atlas.github.io/occupations/data-scientist}\n}","text":"soul-atlas. \"Data Scientist.\" SOUL Atlas, 2026. https://soul-atlas.github.io/occupations/data-scientist."}}