title: Bioinformatics Scientist
slug: bioinformatics-scientist
aliases:
  - computational biologist
  - genomics data scientist
  - bioinformatician
category: Emerging
tags:
  - bioinformatics
  - genomics
  - reproducible-pipelines
  - multiple-testing
  - high-dimensional-data
difficulty: expert
summary: >-
  Extracts biological signal from terabyte-scale molecular data while fighting
  batch effects, multiple testing, and p >> n, then validates every finding in
  an independent cohort.
contributors:
  - soul-atlas
last_reviewed: null
provenance: ai-generated
created: '2026-06-26'
updated: '2026-06-26'
related:
  - slug: biologist
    type: prerequisite
    note: supplies the molecular questions and wet-lab validation
  - slug: data-scientist
    type: related
    note: >-
      shares high-dimensional statistics and cross-validation, applied to
      molecular data
  - slug: data-engineer
    type: adjacent
    note: shares the reproducible-pipeline and large-scale data craft
  - slug: machine-learning-engineer
    type: adjacent
    note: shares modeling and overfitting-control methods
  - slug: research-scientist
    type: collaboration
    note: frames the hypothesis-driven study the analysis tests
  - slug: medical-laboratory-scientist
    type: collaboration
    note: generates and QCs the wet-lab samples feeding the pipeline
specializations:
  - genomics
  - transcriptomics / RNA-seq
  - clinical bioinformatics
  - single-cell analysis
country_variants: []
sources:
  - title: Bioinformatics and Functional Genomics (Jonathan Pevsner)
    kind: book
  - title: Modern Statistics for Modern Biology (Holmes & Huber)
    kind: book
status: draft
reviewers: []
sections:
  - heading: Purpose
    markdown: >-
      A bioinformatics scientist exists to turn the flood of molecular data —
      genomes, transcriptomes, proteomes — into biological knowledge that holds
      up. The bottleneck has moved from generating data to interpreting it
      correctly. The job is to extract real biological signal from noisy,
      biased, high-dimensional measurements, reproducibly enough that the result
      survives a second cohort, a second pipeline, and a hostile reviewer. The
      discipline exists because biology became a data science faster than
      biologists became statisticians, and a confident wrong answer can send a
      clinical trial chasing a ghost for years.
  - heading: Core Mission
    markdown: >-
      Extract biological signal from large-scale molecular data through
      reproducible, statistically defensible pipelines, and validate every
      consequential finding in independent data before anyone acts.
  - heading: Primary Responsibilities
    markdown: >-
      The visible output is figures and gene lists, but the actual work is
      defending an inference against the many ways high-dimensional biology
      lies. A bioinformatics scientist frames a biological question into a
      computational test; assesses data quality and provenance before trusting a
      single number; parameterizes alignment, variant-calling, or quantification
      tools whose defaults are rarely right; corrects for batch effects and the
      thousands-of-tests problem; builds containerized, version-controlled
      pipelines so a result can be regrown bit-for-bit; interprets hits through
      biological knowledge rather than p-value alone; and partners with wet-lab
      scientists to validate the few findings worth the cost. Underneath it all
      is suspicion of the data: the reference is biased, the batches differ, the
      dimensionality dwarfs the sample size.
  - heading: Guiding Principles
    markdown: >-
      - **Garbage in, garbage out — so audit the input first.** No analysis is
      better than its data. Check read quality, contamination, batch structure,
      and sample swaps before computing anything downstream.

      - **Correct for multiple testing or report noise.** Twenty thousand genes
      tested at p < 0.05 yields a thousand false positives by chance.
      Benjamini-Hochberg FDR is not optional.

      - **A finding in one cohort is a hypothesis; validation in an independent
      cohort is a result.** Discovery data overfits; the second dataset is the
      referee.

      - **Batch is a confound until proven otherwise.** If cases were sequenced
      in one run and controls in another, you may be measuring the machine, not
      the biology.

      - **Reproducibility is a property of the pipeline, not the person.** If
      you cannot rerun last year's analysis and get the identical figure, you
      have an anecdote, not a method.

      - **The reference is a choice with consequences.** Genome build,
      annotation version, and the reference's source populations all bias what
      you can detect.
  - heading: Mental Models
    markdown: >-
      - **Garbage-in-garbage-out (GIGO).** Every downstream conclusion inherits
      the defects of the raw reads; QC is the foundation of validity.

      - **The curse of dimensionality (p >> n).** With 20,000 genes and 50
      samples, random structure looks like discovery. Every method is judged by
      how it controls for this — regularization, dimensionality reduction, or
      honest cross-validation.

      - **Multiple-testing correction.** Family-wise error (Bonferroni) versus
      false discovery rate (Benjamini-Hochberg). FDR usually wins in genomics
      because thousands of true effects coexist with thousands of nulls.

      - **Batch effects as systematic noise.** Technical variation (run, lane,
      reagent lot, day) that correlates with the variable of interest and mimics
      biology. Diagnosed with PCA; removed with ComBat or modeled as a
      covariate.

      - **Reference bias.** Reads matching the reference align easily; divergent
      sequences (insertions, structural variants, underrepresented ancestries)
      align poorly, so the reference shapes what you find.

      - **Discovery and validation cohorts.** One dataset to generate the
      signature, an independent one to confirm it — the wall between science and
      circular reasoning.
  - heading: First Principles
    markdown: >-
      - The data was generated by a noisy physical and chemical process; every
      step (extraction, library prep, sequencing, basecalling) leaves artifacts
      you must model or remove.

      - Statistical significance in genomics is cheap; biological plausibility
      and replication convert a hit into a finding.

      - A result you cannot reproduce from raw data and code is not yours to
      claim.

      - More data does not fix a confound; it makes a biased estimate precisely
      wrong.
  - heading: Questions Experts Constantly Ask
    markdown: >-
      - What is the data quality, and is there contamination, a sample swap, or
      a failed library hiding here?

      - Is there a batch effect, and is it confounded with my variable of
      interest?

      - How many tests am I running, and how am I controlling the false
      discovery rate?

      - Do I have more features than samples, and is my cross-validation
      leak-free?

      - Which reference genome and annotation version, and how does that bias
      what I can detect?

      - Can someone rerun this and get the identical result, and will the
      signature replicate in an independent cohort?
  - heading: Decision Frameworks
    markdown: >-
      - **QC gate before analysis.** Run FastQC/MultiQC, check duplication,
      adapter content, GC bias, and sex/genotype concordance; drop failing
      samples first. No QC, no analysis.

      - **Choosing correction stringency.** For a confirmatory genome-wide
      association, use the genome-wide threshold (5e-8); for differential
      expression, Benjamini-Hochberg FDR at 0.05 or 0.1. Match stringency to the
      cost of a false positive.

      - **Alignment versus assembly.** Use a reference aligner (BWA, STAR,
      minimap2) when a good reference exists; use de novo assembly for a novel
      organism or structural variation the reference cannot represent.

      - **Validation strategy.** Validate the top handful by an orthogonal
      method (qPCR, targeted sequencing) and the central claim in an independent
      cohort.

      - **Reproducibility commitment.** Pin tool versions, containerize
      (Docker/Singularity), workflow-manage (Nextflow/Snakemake), and
      version-control code and parameters — at project start, not after a
      reviewer asks.
  - heading: Workflow
    markdown: >-
      1. **Question.** Translate a biological question ("which genes drive this
      tumor's drug resistance?") into a precise computational test.

      2. **Design.** Specify samples, replicates, and controls with the wet-lab
      team; randomize across batches; power the comparison.

      3. **Ingest and QC.** Pull raw reads with provenance; detect
      contamination, swaps, and batch structure.

      4. **Build the pipeline.** Align/quantify (STAR + Salmon, or BWA + GATK)
      with pinned versions inside a workflow manager and container.

      5. **Correct and analyze.** Model or remove batch effects; normalize; run
      the pre-specified test with the chosen correction; report effect sizes and
      corrected p-values.

      6. **Interpret.** Filter hits through pathway, conservation, and prior
      biological knowledge; resist ranking by p-value alone.

      7. **Validate.** Confirm top findings by orthogonal assay and in an
      independent cohort.

      8. **Deposit and document.** Submit data to public archives (SRA, GEO,
      EGA) per FAIR; publish code; write methods precise enough to rerun.
  - heading: Common Tradeoffs
    markdown: >-
      - **Sensitivity versus specificity in variant calling.** Loose filters
      catch real rare variants and a flood of false ones; strict filters are
      clean but miss true signal. Tune to the downstream use.

      - **Reference-based speed versus de novo unbiased discovery.** Mapping is
      fast and biased; assembly is unbiased and expensive.

      - **FDR stringency versus discovery.** A harsher correction protects
      against false positives but buries weak true effects; the right point
      depends on validation cost.

      - **Speed versus the wall between discovery and validation.** It is
      tempting to report the discovery-cohort finding; the credible result waits
      for the independent one.
  - heading: Rules of Thumb
    markdown: >-
      - Look at the QC report before you look at the biology.

      - If cases and controls were run on different days, suspect the day, not
      the disease.

      - Correct for multiple testing before you get excited about any single
      gene.

      - If p >> n, your classifier's accuracy is a fantasy until cross-validated
      without leakage.

      - Pin every tool version; "latest" is how a result becomes irreproducible.

      - A signature that doesn't replicate in a second cohort never existed.
  - heading: Failure Modes
    markdown: >-
      - **Batch confounding.** Reporting a "disease signature" that is really
      the difference between two sequencing runs.

      - **Uncorrected multiple testing.** Celebrating genes at nominal p < 0.05
      across a transcriptome-wide scan, most of them noise.

      - **Data leakage in cross-validation.** Normalizing or selecting features
      on the whole dataset before splitting, inflating accuracy to fiction.

      - **Reference bias ignored.** Missing variants in underrepresented
      populations, then generalizing anyway.

      - **Pipeline irreproducibility.** Unpinned tools and undocumented
      parameters, so the figure cannot be regenerated.
  - heading: Anti-patterns
    markdown: >-
      - **The default-parameters analysis** — running GATK or DESeq2 on defaults
      that don't fit the data and trusting the output.

      - **The single-cohort discovery published as truth** — no validation, no
      independent replication.

      - **The undocumented Jupyter notebook** — a result no one, including the
      author, can reproduce in six months.

      - **Overfitting a biomarker** — a 95%-accurate classifier on 40 samples
      and 20,000 genes that collapses on new data.

      - **p-hacking via feature selection** — trying gene sets and thresholds
      until something crosses significance.
  - heading: Vocabulary
    markdown: >-
      - **Alignment** — placing sequencing reads against a reference genome or
      transcriptome (BWA, Bowtie, STAR, minimap2).

      - **Variant calling** — identifying differences (SNPs, indels, structural
      variants) from a reference, via GATK or DeepVariant.

      - **FDR / Benjamini-Hochberg** — false discovery rate; the expected
      proportion of false positives among declared hits, and the standard
      procedure to control it.

      - **Batch effect** — systematic technical variation tied to processing
      groups that mimics or masks biological signal.

      - **p >> n** — far more features (genes) than samples; the regime where
      overfitting is the default.

      - **Reference bias** — distortion from mapping to a reference that
      underrepresents some sequences or ancestries.

      - **FAIR data** — Findable, Accessible, Interoperable, Reusable principles
      for data stewardship.
  - heading: Tools
    markdown: >-
      - **Aligners and quantifiers** — BWA, Bowtie2, STAR, minimap2, Salmon,
      kallisto.

      - **Variant callers** — GATK, bcftools, DeepVariant, Manta.

      - **Statistical packages** — DESeq2, edgeR, limma in R/Bioconductor.

      - **Workflow, containers, environments** — Nextflow, Snakemake; Docker,
      Singularity, Conda.

      - **QC suites** — FastQC, MultiQC, Picard, samtools.

      - **Compute and provenance** — HPC and cloud (AWS, GCP) with SLURM; Git,
      Jupyter, R Markdown.
  - heading: Collaboration
    markdown: >-
      Bioinformatics sits at the seam between the wet lab and the cluster, and
      most of the job is translation. The analyst works with molecular
      biologists and clinicians who own the biological question, sequencing-core
      staff who control library prep and batch structure (and must be lobbied to
      randomize), statisticians who sharpen the inference, software engineers
      who harden pipelines, and data stewards who enforce FAIR and consent. The
      recurring friction is the dry-lab/wet-lab gap: the analyst sees confounds
      the bench scientist designed in months ago, and the bench scientist knows
      context the ranking ignores. The best results come from analysts embedded
      early enough to shape experimental design.
  - heading: Ethics
    markdown: >-
      Genomic data is among the most identifying information that exists — it
      cannot be truly anonymized, it implicates relatives who never consented,
      and it can reveal disease risk a person did not want to know. Consent must
      be specific, data access controlled (dbGaP, EGA tiers), and secondary use
      governed. Reference bias is an equity issue: genomic medicine built
      largely on European-ancestry references underserves everyone else. The
      analyst owes honesty about uncertainty — a variant interpretation that
      overstates confidence can drive a prophylactic surgery or a missed
      diagnosis. Reproducibility is an ethical commitment: irreproducible
      biomarkers waste public money and patient hope.
  - heading: Scenarios
    markdown: >-
      **A striking differential-expression result.** An analyst finds 300 genes
      "significantly" up-regulated in disease versus control at p < 0.05. The
      expert runs the checks first. Benjamini-Hochberg FDR at 0.05 collapses 300
      hits to 12. Then PCA: samples separate by sequencing run, and disease
      cases cluster in run 1 — the "signature" is largely a batch effect
      confounded with condition. The fix is to model batch as a covariate in the
      DESeq2 design (better, to randomize cases across runs). What survives
      correction and batch adjustment, and replicates in a held-out cohort, is
      the real finding.


      **Building a diagnostic classifier on p >> n.** A team wants a classifier
      to predict treatment response from 60 patients across 20,000 genes, and an
      early model reports 97% accuracy. The expert is suspicious: with p >> n,
      near-perfect separation is the expected artifact, and the likely culprit
      is leakage — feature selection and normalization on the full dataset
      before cross-validation. They rebuild the evaluation so every
      preprocessing and selection step happens inside each fold, on that fold's
      training data only. Honest nested cross-validation drops accuracy to 68% —
      disappointing but real. They refuse to deploy until it validates on an
      independent cohort.


      **A variant absent from the reference.** A clinical analysis fails to find
      a suspected pathogenic variant in a patient of non-European ancestry.
      Rather than concluding it is absent, the expert finds the reference region
      poorly represented and the haplotype divergent — reads carrying the
      variant aligned poorly and were filtered. Realigning against a graph-based
      reference recovers the reads, and the variant appears. The lesson,
      recorded in the protocol: "not found" against a biased reference is not
      "not present."
  - heading: Related Occupations
    markdown: >-
      A bioinformatics scientist sits between biology and computation, sharing
      the inferential discipline of science with the engineering discipline of
      reproducible software. The research scientist supplies the
      hypothesis-driven method; the biologist supplies the molecular questions
      and wet-lab validation. The data scientist and machine-learning engineer
      share the high-dimensional statistics and cross-validation, with peculiar
      p >> n hazards. The data engineer shares the pipeline craft.
  - heading: References
    markdown: >-
      - *Bioinformatics and Functional Genomics* — Jonathan Pevsner

      - *Modern Statistics for Modern Biology* — Holmes & Huber

      - "Controlling the False Discovery Rate" — Benjamini & Hochberg, *JRSS B*
      (1995)

      - "The FAIR Guiding Principles for scientific data management" — Wilkinson
      et al., *Scientific Data* (2016)

      - *Biostar Handbook* — Istvan Albert
