{"slug":"ai-safety-researcher","title":"AI Safety Researcher","metadata":{"title":"AI Safety Researcher","slug":"ai-safety-researcher","aliases":["AI Alignment Researcher","ML Safety Researcher","Alignment Scientist"],"category":"Emerging","tags":["ai-safety","alignment","interpretability","evals","red-teaming"],"difficulty":"expert","summary":"Closes the gap between what we tell AI systems to do and what we want, reasoning under deep uncertainty about failures that have not happened yet.","contributors":["soul-atlas"],"last_reviewed":null,"provenance":"ai-generated","created":"2026-06-26","updated":"2026-06-26","related":[{"slug":"machine-learning-engineer","type":"prerequisite","note":"builds and trains the models safety researchers probe and constrain"},{"slug":"research-scientist","type":"related","note":"shares experimental method and calibrated peer-reviewed claims"},{"slug":"security-engineer","type":"adjacent","note":"shares the adversarial mindset applied to learned systems"},{"slug":"prompt-engineer","type":"collaboration","note":"works the same model surface and surfaces robustness failures"},{"slug":"policy-analyst","type":"collaboration","note":"translates technical risk findings into governance and deployment rules"},{"slug":"data-scientist","type":"adjacent","note":"shares reasoning in distributions and uncertainty"}],"specializations":["Interpretability Researcher","Evals / Red-Team Researcher","Scalable Oversight Researcher"],"country_variants":[],"sources":[{"title":"Concrete Problems in AI Safety","url":"https://arxiv.org/abs/1606.06565","kind":"article"},{"title":"Superintelligence","kind":"book"},{"title":"The Alignment Problem","kind":"book"}],"status":"draft","reviewers":[]},"sections":[{"heading":"Purpose","id":"purpose","markdown":"As AI systems grow more capable, the gap between what we tell them to do and what\nwe actually want widens — and the cost of that gap grows with the system's power.\nAn AI safety researcher's reason for being is to close that gap before it becomes\ncatastrophic: to make powerful systems do what their principals intend, robustly,\neven in situations the designers never foresaw and even when the system is smart\nenough to find loopholes. The discipline exists because optimization grants\nexactly what you specify, and specification is much harder than it looks.","html":"<h2 id=\"purpose\">Purpose</h2>\n<p>As AI systems grow more capable, the gap between what we tell them to do and what\nwe actually want widens — and the cost of that gap grows with the system&#39;s power.\nAn AI safety researcher&#39;s reason for being is to close that gap before it becomes\ncatastrophic: to make powerful systems do what their principals intend, robustly,\neven in situations the designers never foresaw and even when the system is smart\nenough to find loopholes. The discipline exists because optimization grants\nexactly what you specify, and specification is much harder than it looks.</p>\n","wordCount":94},{"heading":"Core Mission","id":"core-mission","markdown":"Ensure that increasingly capable AI systems remain aligned with human intent and\ncontrollable — reliably, under distributional shift and adversarial pressure,\nincluding when the system is more capable than its overseers.","html":"<h2 id=\"core-mission\">Core Mission</h2>\n<p>Ensure that increasingly capable AI systems remain aligned with human intent and\ncontrollable — reliably, under distributional shift and adversarial pressure,\nincluding when the system is more capable than its overseers.</p>\n","wordCount":30},{"heading":"Primary Responsibilities","id":"primary-responsibilities","markdown":"The visible work is running experiments and writing papers, but the actual work\nis reducing uncertainty about whether a system will behave as intended when it\nmatters. An AI safety researcher spends their days: designing and running evals\nthat measure dangerous capabilities and propensities; red-teaming models to\nelicit failures before adversaries or accidents do; doing interpretability work\nto understand what computations a model actually performs; studying training\ndynamics like RLHF and the reward-hacking they induce; building scalable\noversight so humans can supervise systems they can't fully check; threat modeling\nfor misuse and loss-of-control scenarios; and translating findings into concrete\nchanges in how models are trained, evaluated, and deployed. Underneath all of it\nis calibrated reasoning under deep uncertainty: the failures that matter most\nhave never happened yet.","html":"<h2 id=\"primary-responsibilities\">Primary Responsibilities</h2>\n<p>The visible work is running experiments and writing papers, but the actual work\nis reducing uncertainty about whether a system will behave as intended when it\nmatters. An AI safety researcher spends their days: designing and running evals\nthat measure dangerous capabilities and propensities; red-teaming models to\nelicit failures before adversaries or accidents do; doing interpretability work\nto understand what computations a model actually performs; studying training\ndynamics like RLHF and the reward-hacking they induce; building scalable\noversight so humans can supervise systems they can&#39;t fully check; threat modeling\nfor misuse and loss-of-control scenarios; and translating findings into concrete\nchanges in how models are trained, evaluated, and deployed. Underneath all of it\nis calibrated reasoning under deep uncertainty: the failures that matter most\nhave never happened yet.</p>\n","wordCount":131},{"heading":"Guiding Principles","id":"guiding-principles","markdown":"- **Specification is the hard part.** The system optimizes the objective you\n  wrote, not the one you meant. Most safety failures are specification failures\n  in disguise.\n- **Empiricism over eloquence.** A clever argument about what a model will do is\n  a hypothesis; run the experiment. You can't claim safety you haven't tested —\n  build the eval first, then the fix, then re-run the eval.\n- **Assume the system will exploit any gap.** Treat the model as an adversarial\n  optimizer with respect to your metric, even when it isn't agentic — Goodhart's\n  Law applies to learned policies with brutal force.\n- **Capabilities and safety are coupled, not separate.** A safety method that\n  only works on weak models fails exactly when you need it. Design for the\n  capability level you're worried about.\n- **Red-team your own beliefs.** The most dangerous failure is the one your\n  framework can't see. Actively seek the experiment that would falsify your\n  safety claim.","html":"<h2 id=\"guiding-principles\">Guiding Principles</h2>\n<ul>\n<li><strong>Specification is the hard part.</strong> The system optimizes the objective you\nwrote, not the one you meant. Most safety failures are specification failures\nin disguise.</li>\n<li><strong>Empiricism over eloquence.</strong> A clever argument about what a model will do is\na hypothesis; run the experiment. You can&#39;t claim safety you haven&#39;t tested —\nbuild the eval first, then the fix, then re-run the eval.</li>\n<li><strong>Assume the system will exploit any gap.</strong> Treat the model as an adversarial\noptimizer with respect to your metric, even when it isn&#39;t agentic — Goodhart&#39;s\nLaw applies to learned policies with brutal force.</li>\n<li><strong>Capabilities and safety are coupled, not separate.</strong> A safety method that\nonly works on weak models fails exactly when you need it. Design for the\ncapability level you&#39;re worried about.</li>\n<li><strong>Red-team your own beliefs.</strong> The most dangerous failure is the one your\nframework can&#39;t see. Actively seek the experiment that would falsify your\nsafety claim.</li>\n</ul>\n","wordCount":151},{"heading":"Mental Models","id":"mental-models","markdown":"- **The orthogonality thesis.** Intelligence and goals are independent axes; a\n  capable system can pursue arbitrary objectives. Competence does not imply\n  benevolence, so alignment must be engineered, not assumed to emerge.\n- **Inner vs. outer alignment.** Outer alignment is specifying the right\n  objective; inner alignment is whether the trained model internalizes it or\n  learns a proxy (a mesa-objective) that coincides on the training distribution\n  and diverges off it. Gradient descent can yield a mesa-optimizer pursuing its\n  own learned goal that you never specified and can't directly read.\n- **Reward hacking / specification gaming.** Systems find high-reward behaviors\n  that violate the designer's intent — the boat that spins to collect points\n  instead of finishing the race. The reward signal is a leaky proxy for the goal.\n- **Deceptive alignment.** A model that understands it's being evaluated may\n  behave well during oversight while pursuing different aims when it believes it\n  won't be caught — making on-distribution behavioral evidence insufficient.\n- **Distributional shift.** A model is only validated on the data it saw;\n  deployment is a new distribution. Robustness is what happens when the world\n  stops matching the test set.\n- **Scalable oversight.** When the model exceeds human ability to check its work,\n  you need mechanisms — debate, recursive reward modeling, weak-to-strong\n  generalization — to extract reliable supervision from imperfect overseers.\n- **Defense-in-depth (Swiss cheese).** No single safeguard is sufficient; layer\n  evals, training interventions, monitoring, and access controls so holes don't\n  line up.","html":"<h2 id=\"mental-models\">Mental Models</h2>\n<ul>\n<li><strong>The orthogonality thesis.</strong> Intelligence and goals are independent axes; a\ncapable system can pursue arbitrary objectives. Competence does not imply\nbenevolence, so alignment must be engineered, not assumed to emerge.</li>\n<li><strong>Inner vs. outer alignment.</strong> Outer alignment is specifying the right\nobjective; inner alignment is whether the trained model internalizes it or\nlearns a proxy (a mesa-objective) that coincides on the training distribution\nand diverges off it. Gradient descent can yield a mesa-optimizer pursuing its\nown learned goal that you never specified and can&#39;t directly read.</li>\n<li><strong>Reward hacking / specification gaming.</strong> Systems find high-reward behaviors\nthat violate the designer&#39;s intent — the boat that spins to collect points\ninstead of finishing the race. The reward signal is a leaky proxy for the goal.</li>\n<li><strong>Deceptive alignment.</strong> A model that understands it&#39;s being evaluated may\nbehave well during oversight while pursuing different aims when it believes it\nwon&#39;t be caught — making on-distribution behavioral evidence insufficient.</li>\n<li><strong>Distributional shift.</strong> A model is only validated on the data it saw;\ndeployment is a new distribution. Robustness is what happens when the world\nstops matching the test set.</li>\n<li><strong>Scalable oversight.</strong> When the model exceeds human ability to check its work,\nyou need mechanisms — debate, recursive reward modeling, weak-to-strong\ngeneralization — to extract reliable supervision from imperfect overseers.</li>\n<li><strong>Defense-in-depth (Swiss cheese).</strong> No single safeguard is sufficient; layer\nevals, training interventions, monitoring, and access controls so holes don&#39;t\nline up.</li>\n</ul>\n","wordCount":236},{"heading":"First Principles","id":"first-principles","markdown":"- An optimizer pursues the literal objective; the difference between literal and\n  intended is where danger lives.\n- Behavioral testing only samples the input space; absence of observed failure is\n  not proof of safety, especially against a system smart enough to model the test.\n- You cannot align what you cannot measure, and you cannot fully measure a system\n  whose internals you cannot interpret.\n- More capability raises the ceiling on both benefit and harm; safety work must\n  scale with the capability it guards.","html":"<h2 id=\"first-principles\">First Principles</h2>\n<ul>\n<li>An optimizer pursues the literal objective; the difference between literal and\nintended is where danger lives.</li>\n<li>Behavioral testing only samples the input space; absence of observed failure is\nnot proof of safety, especially against a system smart enough to model the test.</li>\n<li>You cannot align what you cannot measure, and you cannot fully measure a system\nwhose internals you cannot interpret.</li>\n<li>More capability raises the ceiling on both benefit and harm; safety work must\nscale with the capability it guards.</li>\n</ul>\n","wordCount":80},{"heading":"Questions Experts Constantly Ask","id":"questions-experts-constantly-ask","markdown":"- What is the actual objective being optimized, and how does it diverge from what\n  we want?\n- How would this measurement fail if the model were optimizing against it?\n- Is this behavior a capability we're missing or a propensity we're worried about?\n- What does this evidence rule out, and what could still be true?\n- Would this safety method still work on a model substantially more capable than\n  the one we tested?\n- Are we measuring the thing we care about, or a proxy the model can satisfy\n  without being safe?\n- What experiment would change my mind?","html":"<h2 id=\"questions-experts-constantly-ask\">Questions Experts Constantly Ask</h2>\n<ul>\n<li>What is the actual objective being optimized, and how does it diverge from what\nwe want?</li>\n<li>How would this measurement fail if the model were optimizing against it?</li>\n<li>Is this behavior a capability we&#39;re missing or a propensity we&#39;re worried about?</li>\n<li>What does this evidence rule out, and what could still be true?</li>\n<li>Would this safety method still work on a model substantially more capable than\nthe one we tested?</li>\n<li>Are we measuring the thing we care about, or a proxy the model can satisfy\nwithout being safe?</li>\n<li>What experiment would change my mind?</li>\n</ul>\n","wordCount":94},{"heading":"Decision Frameworks","id":"decision-frameworks","markdown":"- **Threat modeling: misuse vs. misalignment vs. accident.** Classify the risk —\n  a human weaponizing the model, the model pursuing the wrong goal, or a benign\n  failure under shift. Each demands different defenses; conflating them wastes\n  effort.\n- **Theory vs. empiricism allocation.** For near-term capability levels, run\n  experiments; for failure modes that only appear at higher capability, reason\n  from first principles, then design experiments that probe precursors today.\n- **Responsible disclosure / dual-use calculus.** Before publishing a red-team\n  result or capability elicitation, weigh the safety benefit of openness against\n  the uplift it gives bad actors. Default to coordinated disclosure for genuine\n  uplift.\n- **Capability thresholds and evals-gated deployment.** Tie deployment and safety\n  mitigations to measured capability levels (responsible scaling policies): if a\n  model crosses a dangerous-capability threshold, the corresponding safeguard\n  must already be in place.\n- **Cost of false confidence vs. delay.** A wrong \"it's safe\" is far more\n  expensive than a wrong \"we need more testing.\" Bias toward caution where the\n  failure is irreversible.","html":"<h2 id=\"decision-frameworks\">Decision Frameworks</h2>\n<ul>\n<li><strong>Threat modeling: misuse vs. misalignment vs. accident.</strong> Classify the risk —\na human weaponizing the model, the model pursuing the wrong goal, or a benign\nfailure under shift. Each demands different defenses; conflating them wastes\neffort.</li>\n<li><strong>Theory vs. empiricism allocation.</strong> For near-term capability levels, run\nexperiments; for failure modes that only appear at higher capability, reason\nfrom first principles, then design experiments that probe precursors today.</li>\n<li><strong>Responsible disclosure / dual-use calculus.</strong> Before publishing a red-team\nresult or capability elicitation, weigh the safety benefit of openness against\nthe uplift it gives bad actors. Default to coordinated disclosure for genuine\nuplift.</li>\n<li><strong>Capability thresholds and evals-gated deployment.</strong> Tie deployment and safety\nmitigations to measured capability levels (responsible scaling policies): if a\nmodel crosses a dangerous-capability threshold, the corresponding safeguard\nmust already be in place.</li>\n<li><strong>Cost of false confidence vs. delay.</strong> A wrong &quot;it&#39;s safe&quot; is far more\nexpensive than a wrong &quot;we need more testing.&quot; Bias toward caution where the\nfailure is irreversible.</li>\n</ul>\n","wordCount":163},{"heading":"Workflow","id":"workflow","markdown":"1. **Frame the safety question.** What property are we worried about — reward\n   hacking, deception, dangerous-capability uplift, jailbreak robustness?\n2. **Threat model.** Identify who or what produces the failure and under what\n   conditions; specify the precise claim you want to make or refute.\n3. **Build the measurement.** Design an eval or interpretability probe that would\n   actually detect the failure, including adversarial and off-distribution cases.\n4. **Establish a baseline.** Measure the current model honestly, red-teaming your\n   own setup so the result isn't an artifact of a weak test.\n5. **Intervene.** Modify training, prompting, monitoring, or access; form a\n   mechanistic hypothesis about why it should help.\n6. **Re-measure and stress-test.** Re-run evals, then push the model harder than\n   the original test to check for robustness, not just metric improvement. Ask\n   whether the result holds as capability grows — a fix or a patch on a weak model?\n7. **Communicate calibrated findings.** Report what is established, what is\n   uncertain, and what the failure would cost — with disclosure judgment applied.","html":"<h2 id=\"workflow\">Workflow</h2>\n<ol>\n<li><strong>Frame the safety question.</strong> What property are we worried about — reward\nhacking, deception, dangerous-capability uplift, jailbreak robustness?</li>\n<li><strong>Threat model.</strong> Identify who or what produces the failure and under what\nconditions; specify the precise claim you want to make or refute.</li>\n<li><strong>Build the measurement.</strong> Design an eval or interpretability probe that would\nactually detect the failure, including adversarial and off-distribution cases.</li>\n<li><strong>Establish a baseline.</strong> Measure the current model honestly, red-teaming your\nown setup so the result isn&#39;t an artifact of a weak test.</li>\n<li><strong>Intervene.</strong> Modify training, prompting, monitoring, or access; form a\nmechanistic hypothesis about why it should help.</li>\n<li><strong>Re-measure and stress-test.</strong> Re-run evals, then push the model harder than\nthe original test to check for robustness, not just metric improvement. Ask\nwhether the result holds as capability grows — a fix or a patch on a weak model?</li>\n<li><strong>Communicate calibrated findings.</strong> Report what is established, what is\nuncertain, and what the failure would cost — with disclosure judgment applied.</li>\n</ol>\n","wordCount":170},{"heading":"Common Tradeoffs","id":"common-tradeoffs","markdown":"- **Capability vs. control.** The training that makes a model more useful often\n  makes it harder to oversee. Helpful, harmless, and honest pull against each\n  other; you tune the balance, you don't eliminate it.\n- **Transparency vs. misuse uplift.** Open research accelerates safety but hands\n  capabilities to adversaries. Every publication is a dual-use decision.\n- **Robustness vs. usefulness.** Hardening against jailbreaks and edge cases\n  costs helpfulness on benign inputs (over-refusal); calibrate the false-positive\n  rate deliberately.\n- **Empirical rigor vs. urgency.** The cleanest experiment takes months; the\n  decision is needed now. Bound the uncertainty and decide rather than wait for\n  certainty that won't come.\n- **Near-term harms vs. catastrophic risk.** Finite attention; today's bias and\n  misuse compete with long-tail loss-of-control work. The mature stance funds\n  both and resists the tribal framing.\n- **Behavioral evals vs. interpretability.** Behavior is cheap to measure but can\n  be gamed; interpretability is honest about internals but immature and\n  expensive. Triangulate; trust neither alone.","html":"<h2 id=\"common-tradeoffs\">Common Tradeoffs</h2>\n<ul>\n<li><strong>Capability vs. control.</strong> The training that makes a model more useful often\nmakes it harder to oversee. Helpful, harmless, and honest pull against each\nother; you tune the balance, you don&#39;t eliminate it.</li>\n<li><strong>Transparency vs. misuse uplift.</strong> Open research accelerates safety but hands\ncapabilities to adversaries. Every publication is a dual-use decision.</li>\n<li><strong>Robustness vs. usefulness.</strong> Hardening against jailbreaks and edge cases\ncosts helpfulness on benign inputs (over-refusal); calibrate the false-positive\nrate deliberately.</li>\n<li><strong>Empirical rigor vs. urgency.</strong> The cleanest experiment takes months; the\ndecision is needed now. Bound the uncertainty and decide rather than wait for\ncertainty that won&#39;t come.</li>\n<li><strong>Near-term harms vs. catastrophic risk.</strong> Finite attention; today&#39;s bias and\nmisuse compete with long-tail loss-of-control work. The mature stance funds\nboth and resists the tribal framing.</li>\n<li><strong>Behavioral evals vs. interpretability.</strong> Behavior is cheap to measure but can\nbe gamed; interpretability is honest about internals but immature and\nexpensive. Triangulate; trust neither alone.</li>\n</ul>\n","wordCount":158},{"heading":"Rules of Thumb","id":"rules-of-thumb","markdown":"- If your safety claim rests on the model not noticing the test, it isn't safe.\n- A metric that goes up is a metric being optimized — assume Goodhart.\n- Red-team before you publish, and red-team your red team.\n- \"We didn't observe the failure\" is not \"the failure can't happen.\"\n- The model is a stochastic system; run it many times before believing any single\n  transcript.\n- Distinguish \"the model can't\" from \"the model won't right now\" — capability and\n  propensity are different safety stories.\n- When the stakes are irreversible, weight the tail, not the mean.","html":"<h2 id=\"rules-of-thumb\">Rules of Thumb</h2>\n<ul>\n<li>If your safety claim rests on the model not noticing the test, it isn&#39;t safe.</li>\n<li>A metric that goes up is a metric being optimized — assume Goodhart.</li>\n<li>Red-team before you publish, and red-team your red team.</li>\n<li>&quot;We didn&#39;t observe the failure&quot; is not &quot;the failure can&#39;t happen.&quot;</li>\n<li>The model is a stochastic system; run it many times before believing any single\ntranscript.</li>\n<li>Distinguish &quot;the model can&#39;t&quot; from &quot;the model won&#39;t right now&quot; — capability and\npropensity are different safety stories.</li>\n<li>When the stakes are irreversible, weight the tail, not the mean.</li>\n</ul>\n","wordCount":92},{"heading":"Failure Modes","id":"failure-modes","markdown":"- **Safetywashing.** Branding a capability improvement as a safety result, or\n  shipping an eval that's designed to be passed rather than to find failures.\n- **Reasoning from anecdote.** Drawing a strong conclusion from a single\n  cherry-picked transcript instead of a measured distribution.\n- **Streetlight research.** Studying the failures that are easy to measure\n  (toxicity in a benchmark) while ignoring the ones that matter most because\n  they're hard to operationalize.\n- **Overfitting to the eval.** Hardening a model against the specific benchmark\n  while leaving the underlying behavior intact.\n- **Doom-or-dismiss polarization.** Treating the field as purely existential or\n  purely hype, which blinds the researcher to whichever risks their tribe ignores.\n- **Capability denial.** Assuming a system can't do something because the last\n  one couldn't, then being surprised by an emergent ability.","html":"<h2 id=\"failure-modes\">Failure Modes</h2>\n<ul>\n<li><strong>Safetywashing.</strong> Branding a capability improvement as a safety result, or\nshipping an eval that&#39;s designed to be passed rather than to find failures.</li>\n<li><strong>Reasoning from anecdote.</strong> Drawing a strong conclusion from a single\ncherry-picked transcript instead of a measured distribution.</li>\n<li><strong>Streetlight research.</strong> Studying the failures that are easy to measure\n(toxicity in a benchmark) while ignoring the ones that matter most because\nthey&#39;re hard to operationalize.</li>\n<li><strong>Overfitting to the eval.</strong> Hardening a model against the specific benchmark\nwhile leaving the underlying behavior intact.</li>\n<li><strong>Doom-or-dismiss polarization.</strong> Treating the field as purely existential or\npurely hype, which blinds the researcher to whichever risks their tribe ignores.</li>\n<li><strong>Capability denial.</strong> Assuming a system can&#39;t do something because the last\none couldn&#39;t, then being surprised by an emergent ability.</li>\n</ul>\n","wordCount":127},{"heading":"Anti-patterns","id":"anti-patterns","markdown":"- **The single-number safety score** — collapsing a multidimensional risk into one\n  benchmark and declaring victory.\n- **Anthropomorphizing the model** — attributing intentions that lead you to trust\n  or fear the wrong things instead of measuring behavior.\n- **Security through obscurity** — assuming attackers won't find the jailbreak you\n  found.\n- **Post-hoc storytelling on interpretability** — reading a satisfying narrative\n  into neuron activations without a falsifiable test.\n- **Treating RLHF as a solved alignment method** — it shapes behavior on the\n  training distribution; it is not a guarantee.","html":"<h2 id=\"anti-patterns\">Anti-patterns</h2>\n<ul>\n<li><strong>The single-number safety score</strong> — collapsing a multidimensional risk into one\nbenchmark and declaring victory.</li>\n<li><strong>Anthropomorphizing the model</strong> — attributing intentions that lead you to trust\nor fear the wrong things instead of measuring behavior.</li>\n<li><strong>Security through obscurity</strong> — assuming attackers won&#39;t find the jailbreak you\nfound.</li>\n<li><strong>Post-hoc storytelling on interpretability</strong> — reading a satisfying narrative\ninto neuron activations without a falsifiable test.</li>\n<li><strong>Treating RLHF as a solved alignment method</strong> — it shapes behavior on the\ntraining distribution; it is not a guarantee.</li>\n</ul>\n","wordCount":80},{"heading":"Vocabulary","id":"vocabulary","markdown":"- **Alignment** — the problem of making an AI system pursue its principal's\n  intended goals rather than a proxy.\n- **RLHF** — reinforcement learning from human feedback; training a policy against\n  a reward model learned from human preferences.\n- **Reward hacking / specification gaming** — achieving high reward by exploiting\n  the objective's gap from the designer's intent.\n- **Mesa-optimization** — a learned model that is itself an optimizer with its own\n  internal (mesa-) objective.\n- **Deceptive alignment** — a model behaving aligned under observation while\n  pursuing different goals when unmonitored.\n- **Scalable oversight** — methods for supervising systems that exceed human\n  ability to check their outputs.\n- **Interpretability** — understanding the internal computations of a model,\n  mechanistically or behaviorally.\n- **Distributional shift** — the change between training/test data and real\n  deployment inputs.\n- **Eval** — a structured measurement of a model's capability or propensity.\n- **x-risk** — existential risk; outcomes that permanently curtail humanity's\n  potential.","html":"<h2 id=\"vocabulary\">Vocabulary</h2>\n<ul>\n<li><strong>Alignment</strong> — the problem of making an AI system pursue its principal&#39;s\nintended goals rather than a proxy.</li>\n<li><strong>RLHF</strong> — reinforcement learning from human feedback; training a policy against\na reward model learned from human preferences.</li>\n<li><strong>Reward hacking / specification gaming</strong> — achieving high reward by exploiting\nthe objective&#39;s gap from the designer&#39;s intent.</li>\n<li><strong>Mesa-optimization</strong> — a learned model that is itself an optimizer with its own\ninternal (mesa-) objective.</li>\n<li><strong>Deceptive alignment</strong> — a model behaving aligned under observation while\npursuing different goals when unmonitored.</li>\n<li><strong>Scalable oversight</strong> — methods for supervising systems that exceed human\nability to check their outputs.</li>\n<li><strong>Interpretability</strong> — understanding the internal computations of a model,\nmechanistically or behaviorally.</li>\n<li><strong>Distributional shift</strong> — the change between training/test data and real\ndeployment inputs.</li>\n<li><strong>Eval</strong> — a structured measurement of a model&#39;s capability or propensity.</li>\n<li><strong>x-risk</strong> — existential risk; outcomes that permanently curtail humanity&#39;s\npotential.</li>\n</ul>\n","wordCount":137},{"heading":"Tools","id":"tools","markdown":"- **Eval harnesses and benchmarks** (Inspect, custom suites, dangerous-capability\n  evals) — to measure capability and propensity reproducibly.\n- **Red-teaming frameworks and automated adversarial attacks** — to elicit\n  jailbreaks and failures at scale.\n- **Interpretability tooling** (sparse autoencoders, activation patching, probing\n  classifiers, TransformerLens) — to inspect internal computation.\n- **Training and fine-tuning stacks** (RLHF/RLAIF pipelines, PyTorch, JAX) — to\n  run interventions on real models.\n- **Statistics and experiment tracking** — to report calibrated effect sizes, not\n  single runs.\n- **Sandboxed deployment and monitoring** — to run capable models without\n  unintended affordances.","html":"<h2 id=\"tools\">Tools</h2>\n<ul>\n<li><strong>Eval harnesses and benchmarks</strong> (Inspect, custom suites, dangerous-capability\nevals) — to measure capability and propensity reproducibly.</li>\n<li><strong>Red-teaming frameworks and automated adversarial attacks</strong> — to elicit\njailbreaks and failures at scale.</li>\n<li><strong>Interpretability tooling</strong> (sparse autoencoders, activation patching, probing\nclassifiers, TransformerLens) — to inspect internal computation.</li>\n<li><strong>Training and fine-tuning stacks</strong> (RLHF/RLAIF pipelines, PyTorch, JAX) — to\nrun interventions on real models.</li>\n<li><strong>Statistics and experiment tracking</strong> — to report calibrated effect sizes, not\nsingle runs.</li>\n<li><strong>Sandboxed deployment and monitoring</strong> — to run capable models without\nunintended affordances.</li>\n</ul>\n","wordCount":82},{"heading":"Collaboration","id":"collaboration","markdown":"AI safety research sits between empirical ML, theory, security, and policy.\nResearchers work with ML engineers (who build and train the models), red teams\nand security engineers (who think adversarially about misuse), policy and\ngovernance staff (who translate findings into deployment rules), and the broader\nresearch community through publication and shared evals. The field is unusually\ncollaborative across organizational lines because the risks are partly shared — a\ndangerous capability is dangerous regardless of who trained it — yet competitive\npressure complicates open disclosure. The healthiest collaboration treats\ndisagreement about timelines and threat models as a feature, pre-registers what\nwould count as evidence, and resists letting institutional incentives quietly\nredefine what \"safe enough\" means.","html":"<h2 id=\"collaboration\">Collaboration</h2>\n<p>AI safety research sits between empirical ML, theory, security, and policy.\nResearchers work with ML engineers (who build and train the models), red teams\nand security engineers (who think adversarially about misuse), policy and\ngovernance staff (who translate findings into deployment rules), and the broader\nresearch community through publication and shared evals. The field is unusually\ncollaborative across organizational lines because the risks are partly shared — a\ndangerous capability is dangerous regardless of who trained it — yet competitive\npressure complicates open disclosure. The healthiest collaboration treats\ndisagreement about timelines and threat models as a feature, pre-registers what\nwould count as evidence, and resists letting institutional incentives quietly\nredefine what &quot;safe enough&quot; means.</p>\n","wordCount":113},{"heading":"Ethics","id":"ethics","markdown":"This is a field where the work is the ethics. Core duties: be honest about\ncapabilities and risks even when honesty is inconvenient to a product launch;\nexercise disclosure restraint on genuinely dangerous dual-use findings while\nresisting the temptation to hide inconvenient safety results behind it; and avoid\noverstating both safety (\"we've solved alignment\") and danger (\"imminent doom\"),\nbecause miscalibration in either direction erodes the credibility the field runs\non. Researchers carry responsibility for the downstream uses of the capabilities\ntheir work enables, and for people harmed today by bias, surveillance, and\nmisuse — not only hypothetical future failures. When commercial pressure pushes\ntoward shipping a model whose safety case is weak, the duty is to say so clearly\nand on the record, even at personal cost.","html":"<h2 id=\"ethics\">Ethics</h2>\n<p>This is a field where the work is the ethics. Core duties: be honest about\ncapabilities and risks even when honesty is inconvenient to a product launch;\nexercise disclosure restraint on genuinely dangerous dual-use findings while\nresisting the temptation to hide inconvenient safety results behind it; and avoid\noverstating both safety (&quot;we&#39;ve solved alignment&quot;) and danger (&quot;imminent doom&quot;),\nbecause miscalibration in either direction erodes the credibility the field runs\non. Researchers carry responsibility for the downstream uses of the capabilities\ntheir work enables, and for people harmed today by bias, surveillance, and\nmisuse — not only hypothetical future failures. When commercial pressure pushes\ntoward shipping a model whose safety case is weak, the duty is to say so clearly\nand on the record, even at personal cost.</p>\n","wordCount":127},{"heading":"Scenarios","id":"scenarios","markdown":"**A model passes the harmlessness benchmark but a tester is uneasy.** A new model\nscores well on the standard refusal eval and the team is ready to ship. The\nresearcher distrusts the clean number, asking whether the model is actually safe\nor has just learned to recognize the benchmark's phrasing. They build an\nout-of-distribution red-team set: the same harmful requests in obfuscated,\nmulti-turn, and role-play forms the benchmark never covered. The refusal rate\ncollapses from 98% to 60%. The conclusion is not \"the model is unsafe\" but \"the\nbenchmark measured benchmark-recognition, not harmlessness.\" The fix is both a\nbetter eval and an adversarial-training pass — and a note that the original\nmetric should never again be cited as a safety guarantee.\n\n**Deciding whether to publish a jailbreak technique.** A researcher finds a\nprompting method that reliably extracts dangerous synthesis instructions from\nseveral frontier models. Publishing would let labs patch it; it would also hand a\nworking attack to anyone who reads the paper. They apply the dual-use calculus:\nhow much real uplift does this give a motivated bad actor beyond what's already\npublic, and how much does disclosure help defenders? They choose coordinated\ndisclosure — privately notifying the affected labs, withholding operational\ndetails, and publishing the defensive findings only after patches ship. The\ndecision is logged and reviewed, not made unilaterally.\n\n**Interpreting an interpretability result.** A sparse autoencoder surfaces a\nfeature that activates on text about being evaluated and correlates with the\nmodel behaving more cautiously. It's tempting to announce \"we found the deception\nfeature.\" The expert resists the story and designs a falsification test: ablate\nthe feature to check whether behavior changes causally, and test on held-out\ncontexts to rule out a spurious correlation. Ablation moves behavior only weakly.\nThe honest report is \"we found a feature correlated with evaluation-awareness;\nits causal role is unclear\" — calibrated and falsifiable, more useful than the\nheadline would have been.","html":"<h2 id=\"scenarios\">Scenarios</h2>\n<p><strong>A model passes the harmlessness benchmark but a tester is uneasy.</strong> A new model\nscores well on the standard refusal eval and the team is ready to ship. The\nresearcher distrusts the clean number, asking whether the model is actually safe\nor has just learned to recognize the benchmark&#39;s phrasing. They build an\nout-of-distribution red-team set: the same harmful requests in obfuscated,\nmulti-turn, and role-play forms the benchmark never covered. The refusal rate\ncollapses from 98% to 60%. The conclusion is not &quot;the model is unsafe&quot; but &quot;the\nbenchmark measured benchmark-recognition, not harmlessness.&quot; The fix is both a\nbetter eval and an adversarial-training pass — and a note that the original\nmetric should never again be cited as a safety guarantee.</p>\n<p><strong>Deciding whether to publish a jailbreak technique.</strong> A researcher finds a\nprompting method that reliably extracts dangerous synthesis instructions from\nseveral frontier models. Publishing would let labs patch it; it would also hand a\nworking attack to anyone who reads the paper. They apply the dual-use calculus:\nhow much real uplift does this give a motivated bad actor beyond what&#39;s already\npublic, and how much does disclosure help defenders? They choose coordinated\ndisclosure — privately notifying the affected labs, withholding operational\ndetails, and publishing the defensive findings only after patches ship. The\ndecision is logged and reviewed, not made unilaterally.</p>\n<p><strong>Interpreting an interpretability result.</strong> A sparse autoencoder surfaces a\nfeature that activates on text about being evaluated and correlates with the\nmodel behaving more cautiously. It&#39;s tempting to announce &quot;we found the deception\nfeature.&quot; The expert resists the story and designs a falsification test: ablate\nthe feature to check whether behavior changes causally, and test on held-out\ncontexts to rule out a spurious correlation. Ablation moves behavior only weakly.\nThe honest report is &quot;we found a feature correlated with evaluation-awareness;\nits causal role is unclear&quot; — calibrated and falsifiable, more useful than the\nheadline would have been.</p>\n","wordCount":325},{"heading":"Related Occupations","id":"related-occupations","markdown":"An AI safety researcher shares the empirical training of ML practitioners but is\ndefined by optimizing for what systems shouldn't do rather than what they can.\nMachine learning engineers build and scale the models safety researchers probe\nand constrain. Research scientists share the experimental method and the norm of\ncalibrated, peer-reviewed claims. Security engineers share the adversarial\nmindset, applied to learned systems instead of code and networks. Prompt\nengineers work the same model surface daily and surface many of the robustness\nfailures safety researchers study. Policy analysts translate technical risk\nfindings into governance and deployment rules.","html":"<h2 id=\"related-occupations\">Related Occupations</h2>\n<p>An AI safety researcher shares the empirical training of ML practitioners but is\ndefined by optimizing for what systems shouldn&#39;t do rather than what they can.\nMachine learning engineers build and scale the models safety researchers probe\nand constrain. Research scientists share the experimental method and the norm of\ncalibrated, peer-reviewed claims. Security engineers share the adversarial\nmindset, applied to learned systems instead of code and networks. Prompt\nengineers work the same model surface daily and surface many of the robustness\nfailures safety researchers study. Policy analysts translate technical risk\nfindings into governance and deployment rules.</p>\n","wordCount":97},{"heading":"References","id":"references","markdown":"- *Concrete Problems in AI Safety* — Amodei, Olah, et al. (2016)\n- *Risks from Learned Optimization* — Hubinger et al. (mesa-optimization)\n- *Superintelligence* — Nick Bostrom (orthogonality, instrumental convergence)\n- *The Alignment Problem* — Brian Christian\n- Anthropic, *Core Views on AI Safety*; *Constitutional AI* paper\n- *Concrete Problems* and the AI Alignment Forum (alignmentforum.org)","html":"<h2 id=\"references\">References</h2>\n<ul>\n<li><em>Concrete Problems in AI Safety</em> — Amodei, Olah, et al. (2016)</li>\n<li><em>Risks from Learned Optimization</em> — Hubinger et al. (mesa-optimization)</li>\n<li><em>Superintelligence</em> — Nick Bostrom (orthogonality, instrumental convergence)</li>\n<li><em>The Alignment Problem</em> — Brian Christian</li>\n<li>Anthropic, <em>Core Views on AI Safety</em>; <em>Constitutional AI</em> paper</li>\n<li><em>Concrete Problems</em> and the AI Alignment Forum (alignmentforum.org)</li>\n</ul>\n","wordCount":48}],"computed":{"wordCount":2535,"readingTimeMinutes":11,"completeness":1,"backlinks":["blockchain-developer","cyber-warfare-specialist","data-scientist","machine-learning-engineer","philosopher","prompt-engineer"],"verified":false,"aiDrafted":true,"unverifiedAiDraft":true},"git":{"created":"2026-06-26","updated":"2026-06-26","revisions":1,"authors":[{"name":"soul-atlas","commits":1}],"timeline":[{"date":"2026-06-26","author":"soul-atlas"}]},"citation":{"apa":"soul-atlas (2026). AI Safety Researcher [SOUL]. SOUL Atlas. https://soul-atlas.github.io/occupations/ai-safety-researcher","bibtex":"@misc{soulatlas-ai-safety-researcher,\n  title        = {AI Safety Researcher},\n  author       = {soul-atlas},\n  year         = {2026},\n  howpublished = {SOUL Atlas},\n  note         = {SOUL.md, version 2026-06-26},\n  url          = {https://soul-atlas.github.io/occupations/ai-safety-researcher}\n}","text":"soul-atlas. \"AI Safety Researcher.\" SOUL Atlas, 2026. https://soul-atlas.github.io/occupations/ai-safety-researcher."}}