{"slug":"bioinformatics-scientist","title":"Bioinformatics Scientist","metadata":{"title":"Bioinformatics Scientist","slug":"bioinformatics-scientist","aliases":["computational biologist","genomics data scientist","bioinformatician"],"category":"Emerging","tags":["bioinformatics","genomics","reproducible-pipelines","multiple-testing","high-dimensional-data"],"difficulty":"expert","summary":"Extracts biological signal from terabyte-scale molecular data while fighting batch effects, multiple testing, and p >> n, then validates every finding in an independent cohort.","contributors":["soul-atlas"],"last_reviewed":null,"provenance":"ai-generated","created":"2026-06-26","updated":"2026-06-26","related":[{"slug":"biologist","type":"prerequisite","note":"supplies the molecular questions and wet-lab validation"},{"slug":"data-scientist","type":"related","note":"shares high-dimensional statistics and cross-validation, applied to molecular data"},{"slug":"data-engineer","type":"adjacent","note":"shares the reproducible-pipeline and large-scale data craft"},{"slug":"machine-learning-engineer","type":"adjacent","note":"shares modeling and overfitting-control methods"},{"slug":"research-scientist","type":"collaboration","note":"frames the hypothesis-driven study the analysis tests"},{"slug":"medical-laboratory-scientist","type":"collaboration","note":"generates and QCs the wet-lab samples feeding the pipeline"}],"specializations":["genomics","transcriptomics / RNA-seq","clinical bioinformatics","single-cell analysis"],"country_variants":[],"sources":[{"title":"Bioinformatics and Functional Genomics (Jonathan Pevsner)","kind":"book"},{"title":"Modern Statistics for Modern Biology (Holmes & Huber)","kind":"book"}],"status":"draft","reviewers":[]},"sections":[{"heading":"Purpose","id":"purpose","markdown":"A bioinformatics scientist exists to turn the flood of molecular data — genomes, transcriptomes, proteomes — into biological knowledge that holds up. The bottleneck has moved from generating data to interpreting it correctly. The job is to extract real biological signal from noisy, biased, high-dimensional measurements, reproducibly enough that the result survives a second cohort, a second pipeline, and a hostile reviewer. The discipline exists because biology became a data science faster than biologists became statisticians, and a confident wrong answer can send a clinical trial chasing a ghost for years.","html":"<h2 id=\"purpose\">Purpose</h2>\n<p>A bioinformatics scientist exists to turn the flood of molecular data — genomes, transcriptomes, proteomes — into biological knowledge that holds up. The bottleneck has moved from generating data to interpreting it correctly. The job is to extract real biological signal from noisy, biased, high-dimensional measurements, reproducibly enough that the result survives a second cohort, a second pipeline, and a hostile reviewer. The discipline exists because biology became a data science faster than biologists became statisticians, and a confident wrong answer can send a clinical trial chasing a ghost for years.</p>\n","wordCount":90},{"heading":"Core Mission","id":"core-mission","markdown":"Extract biological signal from large-scale molecular data through reproducible, statistically defensible pipelines, and validate every consequential finding in independent data before anyone acts.","html":"<h2 id=\"core-mission\">Core Mission</h2>\n<p>Extract biological signal from large-scale molecular data through reproducible, statistically defensible pipelines, and validate every consequential finding in independent data before anyone acts.</p>\n","wordCount":24},{"heading":"Primary Responsibilities","id":"primary-responsibilities","markdown":"The visible output is figures and gene lists, but the actual work is defending an inference against the many ways high-dimensional biology lies. A bioinformatics scientist frames a biological question into a computational test; assesses data quality and provenance before trusting a single number; parameterizes alignment, variant-calling, or quantification tools whose defaults are rarely right; corrects for batch effects and the thousands-of-tests problem; builds containerized, version-controlled pipelines so a result can be regrown bit-for-bit; interprets hits through biological knowledge rather than p-value alone; and partners with wet-lab scientists to validate the few findings worth the cost. Underneath it all is suspicion of the data: the reference is biased, the batches differ, the dimensionality dwarfs the sample size.","html":"<h2 id=\"primary-responsibilities\">Primary Responsibilities</h2>\n<p>The visible output is figures and gene lists, but the actual work is defending an inference against the many ways high-dimensional biology lies. A bioinformatics scientist frames a biological question into a computational test; assesses data quality and provenance before trusting a single number; parameterizes alignment, variant-calling, or quantification tools whose defaults are rarely right; corrects for batch effects and the thousands-of-tests problem; builds containerized, version-controlled pipelines so a result can be regrown bit-for-bit; interprets hits through biological knowledge rather than p-value alone; and partners with wet-lab scientists to validate the few findings worth the cost. Underneath it all is suspicion of the data: the reference is biased, the batches differ, the dimensionality dwarfs the sample size.</p>\n","wordCount":126},{"heading":"Guiding Principles","id":"guiding-principles","markdown":"- **Garbage in, garbage out — so audit the input first.** No analysis is better than its data. Check read quality, contamination, batch structure, and sample swaps before computing anything downstream.\n- **Correct for multiple testing or report noise.** Twenty thousand genes tested at p < 0.05 yields a thousand false positives by chance. Benjamini-Hochberg FDR is not optional.\n- **A finding in one cohort is a hypothesis; validation in an independent cohort is a result.** Discovery data overfits; the second dataset is the referee.\n- **Batch is a confound until proven otherwise.** If cases were sequenced in one run and controls in another, you may be measuring the machine, not the biology.\n- **Reproducibility is a property of the pipeline, not the person.** If you cannot rerun last year's analysis and get the identical figure, you have an anecdote, not a method.\n- **The reference is a choice with consequences.** Genome build, annotation version, and the reference's source populations all bias what you can detect.","html":"<h2 id=\"guiding-principles\">Guiding Principles</h2>\n<ul>\n<li><strong>Garbage in, garbage out — so audit the input first.</strong> No analysis is better than its data. Check read quality, contamination, batch structure, and sample swaps before computing anything downstream.</li>\n<li><strong>Correct for multiple testing or report noise.</strong> Twenty thousand genes tested at p &lt; 0.05 yields a thousand false positives by chance. Benjamini-Hochberg FDR is not optional.</li>\n<li><strong>A finding in one cohort is a hypothesis; validation in an independent cohort is a result.</strong> Discovery data overfits; the second dataset is the referee.</li>\n<li><strong>Batch is a confound until proven otherwise.</strong> If cases were sequenced in one run and controls in another, you may be measuring the machine, not the biology.</li>\n<li><strong>Reproducibility is a property of the pipeline, not the person.</strong> If you cannot rerun last year&#39;s analysis and get the identical figure, you have an anecdote, not a method.</li>\n<li><strong>The reference is a choice with consequences.</strong> Genome build, annotation version, and the reference&#39;s source populations all bias what you can detect.</li>\n</ul>\n","wordCount":160},{"heading":"Mental Models","id":"mental-models","markdown":"- **Garbage-in-garbage-out (GIGO).** Every downstream conclusion inherits the defects of the raw reads; QC is the foundation of validity.\n- **The curse of dimensionality (p >> n).** With 20,000 genes and 50 samples, random structure looks like discovery. Every method is judged by how it controls for this — regularization, dimensionality reduction, or honest cross-validation.\n- **Multiple-testing correction.** Family-wise error (Bonferroni) versus false discovery rate (Benjamini-Hochberg). FDR usually wins in genomics because thousands of true effects coexist with thousands of nulls.\n- **Batch effects as systematic noise.** Technical variation (run, lane, reagent lot, day) that correlates with the variable of interest and mimics biology. Diagnosed with PCA; removed with ComBat or modeled as a covariate.\n- **Reference bias.** Reads matching the reference align easily; divergent sequences (insertions, structural variants, underrepresented ancestries) align poorly, so the reference shapes what you find.\n- **Discovery and validation cohorts.** One dataset to generate the signature, an independent one to confirm it — the wall between science and circular reasoning.","html":"<h2 id=\"mental-models\">Mental Models</h2>\n<ul>\n<li><strong>Garbage-in-garbage-out (GIGO).</strong> Every downstream conclusion inherits the defects of the raw reads; QC is the foundation of validity.</li>\n<li><strong>The curse of dimensionality (p &gt;&gt; n).</strong> With 20,000 genes and 50 samples, random structure looks like discovery. Every method is judged by how it controls for this — regularization, dimensionality reduction, or honest cross-validation.</li>\n<li><strong>Multiple-testing correction.</strong> Family-wise error (Bonferroni) versus false discovery rate (Benjamini-Hochberg). FDR usually wins in genomics because thousands of true effects coexist with thousands of nulls.</li>\n<li><strong>Batch effects as systematic noise.</strong> Technical variation (run, lane, reagent lot, day) that correlates with the variable of interest and mimics biology. Diagnosed with PCA; removed with ComBat or modeled as a covariate.</li>\n<li><strong>Reference bias.</strong> Reads matching the reference align easily; divergent sequences (insertions, structural variants, underrepresented ancestries) align poorly, so the reference shapes what you find.</li>\n<li><strong>Discovery and validation cohorts.</strong> One dataset to generate the signature, an independent one to confirm it — the wall between science and circular reasoning.</li>\n</ul>\n","wordCount":164},{"heading":"First Principles","id":"first-principles","markdown":"- The data was generated by a noisy physical and chemical process; every step (extraction, library prep, sequencing, basecalling) leaves artifacts you must model or remove.\n- Statistical significance in genomics is cheap; biological plausibility and replication convert a hit into a finding.\n- A result you cannot reproduce from raw data and code is not yours to claim.\n- More data does not fix a confound; it makes a biased estimate precisely wrong.","html":"<h2 id=\"first-principles\">First Principles</h2>\n<ul>\n<li>The data was generated by a noisy physical and chemical process; every step (extraction, library prep, sequencing, basecalling) leaves artifacts you must model or remove.</li>\n<li>Statistical significance in genomics is cheap; biological plausibility and replication convert a hit into a finding.</li>\n<li>A result you cannot reproduce from raw data and code is not yours to claim.</li>\n<li>More data does not fix a confound; it makes a biased estimate precisely wrong.</li>\n</ul>\n","wordCount":70},{"heading":"Questions Experts Constantly Ask","id":"questions-experts-constantly-ask","markdown":"- What is the data quality, and is there contamination, a sample swap, or a failed library hiding here?\n- Is there a batch effect, and is it confounded with my variable of interest?\n- How many tests am I running, and how am I controlling the false discovery rate?\n- Do I have more features than samples, and is my cross-validation leak-free?\n- Which reference genome and annotation version, and how does that bias what I can detect?\n- Can someone rerun this and get the identical result, and will the signature replicate in an independent cohort?","html":"<h2 id=\"questions-experts-constantly-ask\">Questions Experts Constantly Ask</h2>\n<ul>\n<li>What is the data quality, and is there contamination, a sample swap, or a failed library hiding here?</li>\n<li>Is there a batch effect, and is it confounded with my variable of interest?</li>\n<li>How many tests am I running, and how am I controlling the false discovery rate?</li>\n<li>Do I have more features than samples, and is my cross-validation leak-free?</li>\n<li>Which reference genome and annotation version, and how does that bias what I can detect?</li>\n<li>Can someone rerun this and get the identical result, and will the signature replicate in an independent cohort?</li>\n</ul>\n","wordCount":94},{"heading":"Decision Frameworks","id":"decision-frameworks","markdown":"- **QC gate before analysis.** Run FastQC/MultiQC, check duplication, adapter content, GC bias, and sex/genotype concordance; drop failing samples first. No QC, no analysis.\n- **Choosing correction stringency.** For a confirmatory genome-wide association, use the genome-wide threshold (5e-8); for differential expression, Benjamini-Hochberg FDR at 0.05 or 0.1. Match stringency to the cost of a false positive.\n- **Alignment versus assembly.** Use a reference aligner (BWA, STAR, minimap2) when a good reference exists; use de novo assembly for a novel organism or structural variation the reference cannot represent.\n- **Validation strategy.** Validate the top handful by an orthogonal method (qPCR, targeted sequencing) and the central claim in an independent cohort.\n- **Reproducibility commitment.** Pin tool versions, containerize (Docker/Singularity), workflow-manage (Nextflow/Snakemake), and version-control code and parameters — at project start, not after a reviewer asks.","html":"<h2 id=\"decision-frameworks\">Decision Frameworks</h2>\n<ul>\n<li><strong>QC gate before analysis.</strong> Run FastQC/MultiQC, check duplication, adapter content, GC bias, and sex/genotype concordance; drop failing samples first. No QC, no analysis.</li>\n<li><strong>Choosing correction stringency.</strong> For a confirmatory genome-wide association, use the genome-wide threshold (5e-8); for differential expression, Benjamini-Hochberg FDR at 0.05 or 0.1. Match stringency to the cost of a false positive.</li>\n<li><strong>Alignment versus assembly.</strong> Use a reference aligner (BWA, STAR, minimap2) when a good reference exists; use de novo assembly for a novel organism or structural variation the reference cannot represent.</li>\n<li><strong>Validation strategy.</strong> Validate the top handful by an orthogonal method (qPCR, targeted sequencing) and the central claim in an independent cohort.</li>\n<li><strong>Reproducibility commitment.</strong> Pin tool versions, containerize (Docker/Singularity), workflow-manage (Nextflow/Snakemake), and version-control code and parameters — at project start, not after a reviewer asks.</li>\n</ul>\n","wordCount":139},{"heading":"Workflow","id":"workflow","markdown":"1. **Question.** Translate a biological question (\"which genes drive this tumor's drug resistance?\") into a precise computational test.\n2. **Design.** Specify samples, replicates, and controls with the wet-lab team; randomize across batches; power the comparison.\n3. **Ingest and QC.** Pull raw reads with provenance; detect contamination, swaps, and batch structure.\n4. **Build the pipeline.** Align/quantify (STAR + Salmon, or BWA + GATK) with pinned versions inside a workflow manager and container.\n5. **Correct and analyze.** Model or remove batch effects; normalize; run the pre-specified test with the chosen correction; report effect sizes and corrected p-values.\n6. **Interpret.** Filter hits through pathway, conservation, and prior biological knowledge; resist ranking by p-value alone.\n7. **Validate.** Confirm top findings by orthogonal assay and in an independent cohort.\n8. **Deposit and document.** Submit data to public archives (SRA, GEO, EGA) per FAIR; publish code; write methods precise enough to rerun.","html":"<h2 id=\"workflow\">Workflow</h2>\n<ol>\n<li><strong>Question.</strong> Translate a biological question (&quot;which genes drive this tumor&#39;s drug resistance?&quot;) into a precise computational test.</li>\n<li><strong>Design.</strong> Specify samples, replicates, and controls with the wet-lab team; randomize across batches; power the comparison.</li>\n<li><strong>Ingest and QC.</strong> Pull raw reads with provenance; detect contamination, swaps, and batch structure.</li>\n<li><strong>Build the pipeline.</strong> Align/quantify (STAR + Salmon, or BWA + GATK) with pinned versions inside a workflow manager and container.</li>\n<li><strong>Correct and analyze.</strong> Model or remove batch effects; normalize; run the pre-specified test with the chosen correction; report effect sizes and corrected p-values.</li>\n<li><strong>Interpret.</strong> Filter hits through pathway, conservation, and prior biological knowledge; resist ranking by p-value alone.</li>\n<li><strong>Validate.</strong> Confirm top findings by orthogonal assay and in an independent cohort.</li>\n<li><strong>Deposit and document.</strong> Submit data to public archives (SRA, GEO, EGA) per FAIR; publish code; write methods precise enough to rerun.</li>\n</ol>\n","wordCount":149},{"heading":"Common Tradeoffs","id":"common-tradeoffs","markdown":"- **Sensitivity versus specificity in variant calling.** Loose filters catch real rare variants and a flood of false ones; strict filters are clean but miss true signal. Tune to the downstream use.\n- **Reference-based speed versus de novo unbiased discovery.** Mapping is fast and biased; assembly is unbiased and expensive.\n- **FDR stringency versus discovery.** A harsher correction protects against false positives but buries weak true effects; the right point depends on validation cost.\n- **Speed versus the wall between discovery and validation.** It is tempting to report the discovery-cohort finding; the credible result waits for the independent one.","html":"<h2 id=\"common-tradeoffs\">Common Tradeoffs</h2>\n<ul>\n<li><strong>Sensitivity versus specificity in variant calling.</strong> Loose filters catch real rare variants and a flood of false ones; strict filters are clean but miss true signal. Tune to the downstream use.</li>\n<li><strong>Reference-based speed versus de novo unbiased discovery.</strong> Mapping is fast and biased; assembly is unbiased and expensive.</li>\n<li><strong>FDR stringency versus discovery.</strong> A harsher correction protects against false positives but buries weak true effects; the right point depends on validation cost.</li>\n<li><strong>Speed versus the wall between discovery and validation.</strong> It is tempting to report the discovery-cohort finding; the credible result waits for the independent one.</li>\n</ul>\n","wordCount":97},{"heading":"Rules of Thumb","id":"rules-of-thumb","markdown":"- Look at the QC report before you look at the biology.\n- If cases and controls were run on different days, suspect the day, not the disease.\n- Correct for multiple testing before you get excited about any single gene.\n- If p >> n, your classifier's accuracy is a fantasy until cross-validated without leakage.\n- Pin every tool version; \"latest\" is how a result becomes irreproducible.\n- A signature that doesn't replicate in a second cohort never existed.","html":"<h2 id=\"rules-of-thumb\">Rules of Thumb</h2>\n<ul>\n<li>Look at the QC report before you look at the biology.</li>\n<li>If cases and controls were run on different days, suspect the day, not the disease.</li>\n<li>Correct for multiple testing before you get excited about any single gene.</li>\n<li>If p &gt;&gt; n, your classifier&#39;s accuracy is a fantasy until cross-validated without leakage.</li>\n<li>Pin every tool version; &quot;latest&quot; is how a result becomes irreproducible.</li>\n<li>A signature that doesn&#39;t replicate in a second cohort never existed.</li>\n</ul>\n","wordCount":74},{"heading":"Failure Modes","id":"failure-modes","markdown":"- **Batch confounding.** Reporting a \"disease signature\" that is really the difference between two sequencing runs.\n- **Uncorrected multiple testing.** Celebrating genes at nominal p < 0.05 across a transcriptome-wide scan, most of them noise.\n- **Data leakage in cross-validation.** Normalizing or selecting features on the whole dataset before splitting, inflating accuracy to fiction.\n- **Reference bias ignored.** Missing variants in underrepresented populations, then generalizing anyway.\n- **Pipeline irreproducibility.** Unpinned tools and undocumented parameters, so the figure cannot be regenerated.","html":"<h2 id=\"failure-modes\">Failure Modes</h2>\n<ul>\n<li><strong>Batch confounding.</strong> Reporting a &quot;disease signature&quot; that is really the difference between two sequencing runs.</li>\n<li><strong>Uncorrected multiple testing.</strong> Celebrating genes at nominal p &lt; 0.05 across a transcriptome-wide scan, most of them noise.</li>\n<li><strong>Data leakage in cross-validation.</strong> Normalizing or selecting features on the whole dataset before splitting, inflating accuracy to fiction.</li>\n<li><strong>Reference bias ignored.</strong> Missing variants in underrepresented populations, then generalizing anyway.</li>\n<li><strong>Pipeline irreproducibility.</strong> Unpinned tools and undocumented parameters, so the figure cannot be regenerated.</li>\n</ul>\n","wordCount":77},{"heading":"Anti-patterns","id":"anti-patterns","markdown":"- **The default-parameters analysis** — running GATK or DESeq2 on defaults that don't fit the data and trusting the output.\n- **The single-cohort discovery published as truth** — no validation, no independent replication.\n- **The undocumented Jupyter notebook** — a result no one, including the author, can reproduce in six months.\n- **Overfitting a biomarker** — a 95%-accurate classifier on 40 samples and 20,000 genes that collapses on new data.\n- **p-hacking via feature selection** — trying gene sets and thresholds until something crosses significance.","html":"<h2 id=\"anti-patterns\">Anti-patterns</h2>\n<ul>\n<li><strong>The default-parameters analysis</strong> — running GATK or DESeq2 on defaults that don&#39;t fit the data and trusting the output.</li>\n<li><strong>The single-cohort discovery published as truth</strong> — no validation, no independent replication.</li>\n<li><strong>The undocumented Jupyter notebook</strong> — a result no one, including the author, can reproduce in six months.</li>\n<li><strong>Overfitting a biomarker</strong> — a 95%-accurate classifier on 40 samples and 20,000 genes that collapses on new data.</li>\n<li><strong>p-hacking via feature selection</strong> — trying gene sets and thresholds until something crosses significance.</li>\n</ul>\n","wordCount":80},{"heading":"Vocabulary","id":"vocabulary","markdown":"- **Alignment** — placing sequencing reads against a reference genome or transcriptome (BWA, Bowtie, STAR, minimap2).\n- **Variant calling** — identifying differences (SNPs, indels, structural variants) from a reference, via GATK or DeepVariant.\n- **FDR / Benjamini-Hochberg** — false discovery rate; the expected proportion of false positives among declared hits, and the standard procedure to control it.\n- **Batch effect** — systematic technical variation tied to processing groups that mimics or masks biological signal.\n- **p >> n** — far more features (genes) than samples; the regime where overfitting is the default.\n- **Reference bias** — distortion from mapping to a reference that underrepresents some sequences or ancestries.\n- **FAIR data** — Findable, Accessible, Interoperable, Reusable principles for data stewardship.","html":"<h2 id=\"vocabulary\">Vocabulary</h2>\n<ul>\n<li><strong>Alignment</strong> — placing sequencing reads against a reference genome or transcriptome (BWA, Bowtie, STAR, minimap2).</li>\n<li><strong>Variant calling</strong> — identifying differences (SNPs, indels, structural variants) from a reference, via GATK or DeepVariant.</li>\n<li><strong>FDR / Benjamini-Hochberg</strong> — false discovery rate; the expected proportion of false positives among declared hits, and the standard procedure to control it.</li>\n<li><strong>Batch effect</strong> — systematic technical variation tied to processing groups that mimics or masks biological signal.</li>\n<li><strong>p &gt;&gt; n</strong> — far more features (genes) than samples; the regime where overfitting is the default.</li>\n<li><strong>Reference bias</strong> — distortion from mapping to a reference that underrepresents some sequences or ancestries.</li>\n<li><strong>FAIR data</strong> — Findable, Accessible, Interoperable, Reusable principles for data stewardship.</li>\n</ul>\n","wordCount":105},{"heading":"Tools","id":"tools","markdown":"- **Aligners and quantifiers** — BWA, Bowtie2, STAR, minimap2, Salmon, kallisto.\n- **Variant callers** — GATK, bcftools, DeepVariant, Manta.\n- **Statistical packages** — DESeq2, edgeR, limma in R/Bioconductor.\n- **Workflow, containers, environments** — Nextflow, Snakemake; Docker, Singularity, Conda.\n- **QC suites** — FastQC, MultiQC, Picard, samtools.\n- **Compute and provenance** — HPC and cloud (AWS, GCP) with SLURM; Git, Jupyter, R Markdown.","html":"<h2 id=\"tools\">Tools</h2>\n<ul>\n<li><strong>Aligners and quantifiers</strong> — BWA, Bowtie2, STAR, minimap2, Salmon, kallisto.</li>\n<li><strong>Variant callers</strong> — GATK, bcftools, DeepVariant, Manta.</li>\n<li><strong>Statistical packages</strong> — DESeq2, edgeR, limma in R/Bioconductor.</li>\n<li><strong>Workflow, containers, environments</strong> — Nextflow, Snakemake; Docker, Singularity, Conda.</li>\n<li><strong>QC suites</strong> — FastQC, MultiQC, Picard, samtools.</li>\n<li><strong>Compute and provenance</strong> — HPC and cloud (AWS, GCP) with SLURM; Git, Jupyter, R Markdown.</li>\n</ul>\n","wordCount":51},{"heading":"Collaboration","id":"collaboration","markdown":"Bioinformatics sits at the seam between the wet lab and the cluster, and most of the job is translation. The analyst works with molecular biologists and clinicians who own the biological question, sequencing-core staff who control library prep and batch structure (and must be lobbied to randomize), statisticians who sharpen the inference, software engineers who harden pipelines, and data stewards who enforce FAIR and consent. The recurring friction is the dry-lab/wet-lab gap: the analyst sees confounds the bench scientist designed in months ago, and the bench scientist knows context the ranking ignores. The best results come from analysts embedded early enough to shape experimental design.","html":"<h2 id=\"collaboration\">Collaboration</h2>\n<p>Bioinformatics sits at the seam between the wet lab and the cluster, and most of the job is translation. The analyst works with molecular biologists and clinicians who own the biological question, sequencing-core staff who control library prep and batch structure (and must be lobbied to randomize), statisticians who sharpen the inference, software engineers who harden pipelines, and data stewards who enforce FAIR and consent. The recurring friction is the dry-lab/wet-lab gap: the analyst sees confounds the bench scientist designed in months ago, and the bench scientist knows context the ranking ignores. The best results come from analysts embedded early enough to shape experimental design.</p>\n","wordCount":109},{"heading":"Ethics","id":"ethics","markdown":"Genomic data is among the most identifying information that exists — it cannot be truly anonymized, it implicates relatives who never consented, and it can reveal disease risk a person did not want to know. Consent must be specific, data access controlled (dbGaP, EGA tiers), and secondary use governed. Reference bias is an equity issue: genomic medicine built largely on European-ancestry references underserves everyone else. The analyst owes honesty about uncertainty — a variant interpretation that overstates confidence can drive a prophylactic surgery or a missed diagnosis. Reproducibility is an ethical commitment: irreproducible biomarkers waste public money and patient hope.","html":"<h2 id=\"ethics\">Ethics</h2>\n<p>Genomic data is among the most identifying information that exists — it cannot be truly anonymized, it implicates relatives who never consented, and it can reveal disease risk a person did not want to know. Consent must be specific, data access controlled (dbGaP, EGA tiers), and secondary use governed. Reference bias is an equity issue: genomic medicine built largely on European-ancestry references underserves everyone else. The analyst owes honesty about uncertainty — a variant interpretation that overstates confidence can drive a prophylactic surgery or a missed diagnosis. Reproducibility is an ethical commitment: irreproducible biomarkers waste public money and patient hope.</p>\n","wordCount":99},{"heading":"Scenarios","id":"scenarios","markdown":"**A striking differential-expression result.** An analyst finds 300 genes \"significantly\" up-regulated in disease versus control at p < 0.05. The expert runs the checks first. Benjamini-Hochberg FDR at 0.05 collapses 300 hits to 12. Then PCA: samples separate by sequencing run, and disease cases cluster in run 1 — the \"signature\" is largely a batch effect confounded with condition. The fix is to model batch as a covariate in the DESeq2 design (better, to randomize cases across runs). What survives correction and batch adjustment, and replicates in a held-out cohort, is the real finding.\n\n**Building a diagnostic classifier on p >> n.** A team wants a classifier to predict treatment response from 60 patients across 20,000 genes, and an early model reports 97% accuracy. The expert is suspicious: with p >> n, near-perfect separation is the expected artifact, and the likely culprit is leakage — feature selection and normalization on the full dataset before cross-validation. They rebuild the evaluation so every preprocessing and selection step happens inside each fold, on that fold's training data only. Honest nested cross-validation drops accuracy to 68% — disappointing but real. They refuse to deploy until it validates on an independent cohort.\n\n**A variant absent from the reference.** A clinical analysis fails to find a suspected pathogenic variant in a patient of non-European ancestry. Rather than concluding it is absent, the expert finds the reference region poorly represented and the haplotype divergent — reads carrying the variant aligned poorly and were filtered. Realigning against a graph-based reference recovers the reads, and the variant appears. The lesson, recorded in the protocol: \"not found\" against a biased reference is not \"not present.\"","html":"<h2 id=\"scenarios\">Scenarios</h2>\n<p><strong>A striking differential-expression result.</strong> An analyst finds 300 genes &quot;significantly&quot; up-regulated in disease versus control at p &lt; 0.05. The expert runs the checks first. Benjamini-Hochberg FDR at 0.05 collapses 300 hits to 12. Then PCA: samples separate by sequencing run, and disease cases cluster in run 1 — the &quot;signature&quot; is largely a batch effect confounded with condition. The fix is to model batch as a covariate in the DESeq2 design (better, to randomize cases across runs). What survives correction and batch adjustment, and replicates in a held-out cohort, is the real finding.</p>\n<p><strong>Building a diagnostic classifier on p &gt;&gt; n.</strong> A team wants a classifier to predict treatment response from 60 patients across 20,000 genes, and an early model reports 97% accuracy. The expert is suspicious: with p &gt;&gt; n, near-perfect separation is the expected artifact, and the likely culprit is leakage — feature selection and normalization on the full dataset before cross-validation. They rebuild the evaluation so every preprocessing and selection step happens inside each fold, on that fold&#39;s training data only. Honest nested cross-validation drops accuracy to 68% — disappointing but real. They refuse to deploy until it validates on an independent cohort.</p>\n<p><strong>A variant absent from the reference.</strong> A clinical analysis fails to find a suspected pathogenic variant in a patient of non-European ancestry. Rather than concluding it is absent, the expert finds the reference region poorly represented and the haplotype divergent — reads carrying the variant aligned poorly and were filtered. Realigning against a graph-based reference recovers the reads, and the variant appears. The lesson, recorded in the protocol: &quot;not found&quot; against a biased reference is not &quot;not present.&quot;</p>\n","wordCount":195},{"heading":"Related Occupations","id":"related-occupations","markdown":"A bioinformatics scientist sits between biology and computation, sharing the inferential discipline of science with the engineering discipline of reproducible software. The research scientist supplies the hypothesis-driven method; the biologist supplies the molecular questions and wet-lab validation. The data scientist and machine-learning engineer share the high-dimensional statistics and cross-validation, with peculiar p >> n hazards. The data engineer shares the pipeline craft.","html":"<h2 id=\"related-occupations\">Related Occupations</h2>\n<p>A bioinformatics scientist sits between biology and computation, sharing the inferential discipline of science with the engineering discipline of reproducible software. The research scientist supplies the hypothesis-driven method; the biologist supplies the molecular questions and wet-lab validation. The data scientist and machine-learning engineer share the high-dimensional statistics and cross-validation, with peculiar p &gt;&gt; n hazards. The data engineer shares the pipeline craft.</p>\n","wordCount":66},{"heading":"References","id":"references","markdown":"- *Bioinformatics and Functional Genomics* — Jonathan Pevsner\n- *Modern Statistics for Modern Biology* — Holmes & Huber\n- \"Controlling the False Discovery Rate\" — Benjamini & Hochberg, *JRSS B* (1995)\n- \"The FAIR Guiding Principles for scientific data management\" — Wilkinson et al., *Scientific Data* (2016)\n- *Biostar Handbook* — Istvan Albert","html":"<h2 id=\"references\">References</h2>\n<ul>\n<li><em>Bioinformatics and Functional Genomics</em> — Jonathan Pevsner</li>\n<li><em>Modern Statistics for Modern Biology</em> — Holmes &amp; Huber</li>\n<li>&quot;Controlling the False Discovery Rate&quot; — Benjamini &amp; Hochberg, <em>JRSS B</em> (1995)</li>\n<li>&quot;The FAIR Guiding Principles for scientific data management&quot; — Wilkinson et al., <em>Scientific Data</em> (2016)</li>\n<li><em>Biostar Handbook</em> — Istvan Albert</li>\n</ul>\n","wordCount":41}],"computed":{"wordCount":2010,"readingTimeMinutes":9,"completeness":1,"backlinks":["biochemist","biologist","botanist","geneticist","microbiologist","neuroscientist","research-scientist"],"verified":false,"aiDrafted":true,"unverifiedAiDraft":true},"git":{"created":"2026-06-26","updated":"2026-06-26","revisions":2,"authors":[{"name":"soul-atlas","commits":2}],"timeline":[{"date":"2026-06-26","author":"soul-atlas"},{"date":"2026-06-26","author":"soul-atlas"}]},"citation":{"apa":"soul-atlas (2026). Bioinformatics Scientist [SOUL]. SOUL Atlas. https://soul-atlas.github.io/occupations/bioinformatics-scientist","bibtex":"@misc{soulatlas-bioinformatics-scientist,\n  title        = {Bioinformatics Scientist},\n  author       = {soul-atlas},\n  year         = {2026},\n  howpublished = {SOUL Atlas},\n  note         = {SOUL.md, version 2026-06-26},\n  url          = {https://soul-atlas.github.io/occupations/bioinformatics-scientist}\n}","text":"soul-atlas. \"Bioinformatics Scientist.\" SOUL Atlas, 2026. https://soul-atlas.github.io/occupations/bioinformatics-scientist."}}