Question 1

What is bioinformatics and why is it important in biotechnology?

Accepted Answer

Bioinformatics is an interdisciplinary field that combines biology, computer science, mathematics, and statistics to analyze and interpret biological data, particularly large-scale molecular data like DNA, RNA, and protein sequences. It is important in biotechnology because it enables: analysis of genomic and proteomic data, identification of drug targets, prediction of protein structure and function, understanding evolutionary relationships, and managing vast amounts of biological data generated by modern techniques like next-generation sequencing.

Question 2

What are the common file formats used to store DNA sequences?

Accepted Answer

Common DNA sequence file formats include: FASTA (simple text format with header line starting with > followed by sequence), GenBank/EMBL (annotated format with features, references, and metadata), FASTQ (includes quality scores for each base, used in NGS data), SAM/BAM (Sequence Alignment Map format for aligned sequences, BAM is binary compressed version), VCF (Variant Call Format for genetic variations), and GFF/GTF (Gene Feature Format for annotations). FASTA is the most universal format for basic sequence storage and analysis.

Question 3

What is BLAST and what is it used for?

Accepted Answer

BLAST (Basic Local Alignment Search Tool) is a sequence similarity search algorithm used to compare a query sequence against a database of sequences to find regions of similarity. It identifies homologous genes and proteins by finding local alignments with statistically significant similarity. Different BLAST programs exist: BLASTn (nucleotide vs nucleotide), BLASTp (protein vs protein), BLASTx (translated nucleotide vs protein), tBLASTn (protein vs translated nucleotide), and tBLASTx (translated vs translated). BLAST is essential for gene identification, functional annotation, and evolutionary studies.

Question 4

What is NCBI and what databases does it provide?

Accepted Answer

NCBI (National Center for Biotechnology Information) is a major repository of biological data maintained by the US National Institutes of Health. Key databases include: GenBank (nucleotide sequences), RefSeq (curated reference sequences), PubMed (biomedical literature), Protein (protein sequences), Gene (gene-centric information), dbSNP (single nucleotide polymorphisms), SRA (Sequence Read Archive for raw sequencing data), PDB (protein structures linked from RCSB), and ClinVar (clinical variants). NCBI provides free access to these interconnected databases along with analysis tools like BLAST.

Question 5

What is pairwise sequence alignment?

Accepted Answer

Pairwise sequence alignment is the comparison of two sequences (DNA, RNA, or protein) to identify regions of similarity that may indicate functional, structural, or evolutionary relationships. There are two types: global alignment (aligns entire sequences end-to-end, best for similar-length sequences with high similarity, uses Needleman-Wunsch algorithm) and local alignment (finds regions of high similarity within sequences, useful for sequences with different lengths or partial similarity, uses Smith-Waterman algorithm). Alignment scoring considers matches, mismatches, and gaps using scoring matrices like BLOSUM or PAM for proteins.

Question 6

What is UniProt and what information does it provide?

Accepted Answer

UniProt (Universal Protein Resource) is a comprehensive database of protein sequence and functional information. It consists of three main components: UniProtKB/Swiss-Prot (manually reviewed, high-quality annotations), UniProtKB/TrEMBL (automatically annotated, unreviewed), and UniRef (clustered sequences for faster searching). UniProt provides information on protein function, domains, post-translational modifications, subcellular location, tissue specificity, disease associations, and 3D structures. It links to other databases and is essential for functional annotation of proteins and proteomics research.

Question 7

What is an E-value in BLAST and how is it interpreted?

Accepted Answer

The E-value (Expect value) in BLAST represents the number of alignments with similar or better scores expected to occur by chance in a database of a given size. A lower E-value indicates a more significant match (less likely to be random). General interpretation: E-value < 10^-50 indicates near-identical sequences, < 10^-10 suggests homology, < 0.01 may be biologically meaningful but requires further analysis, while values > 1 are likely random matches. E-value depends on database size, sequence length, and alignment score. It is more reliable than percent identity alone for assessing significance.

Question 8

What is multiple sequence alignment and when is it used?

Accepted Answer

Multiple sequence alignment (MSA) is the alignment of three or more biological sequences to identify conserved regions and patterns across related sequences. Common tools include ClustalW/Clustal Omega, MUSCLE, T-Coffee, and MAFFT. MSA is used for: identifying conserved functional domains, constructing phylogenetic trees, detecting evolutionary relationships, predicting protein secondary structure, designing degenerate primers for PCR, and identifying functional motifs. Progressive alignment methods build the MSA by first aligning the most similar sequences and progressively adding others.

Question 9

What is a phylogenetic tree and what information does it convey?

Accepted Answer

A phylogenetic tree is a branching diagram that represents evolutionary relationships among organisms, genes, or proteins based on similarities and differences in their sequences or characteristics. Key components include: leaves/tips (representing current species or sequences), branches (representing evolutionary lineages), nodes (representing common ancestors), and branch lengths (often representing evolutionary distance or time). Trees can be rooted (with a defined ancestor) or unrooted. Phylogenetic analysis helps understand evolutionary history, predict gene function, and classify organisms.

Question 10

What are the four levels of protein structure?

Accepted Answer

The four levels of protein structure are: Primary structure - the linear sequence of amino acids in the polypeptide chain, determined by the gene sequence. Secondary structure - local folding patterns stabilized by hydrogen bonds, including alpha-helices, beta-sheets, turns, and loops. Tertiary structure - the overall 3D shape of a single polypeptide, determined by interactions between side chains (hydrophobic, ionic, disulfide bonds, hydrogen bonds). Quaternary structure - the arrangement of multiple polypeptide subunits into a functional complex, present only in multi-subunit proteins like hemoglobin.

Question 11

What is the PDB database and what information does it contain?

Accepted Answer

The PDB (Protein Data Bank) is the primary global repository for experimentally determined 3D structures of biological macromolecules (proteins, nucleic acids, and their complexes). It contains atomic coordinates determined by X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy (cryo-EM). Each entry includes: 3D coordinates, experimental method and resolution, sequence information, ligand binding information, and literature references. The PDB is essential for structural biology, drug design, and understanding protein function. It is maintained by the wwPDB consortium with access through RCSB PDB, PDBe, and PDBj.

Question 12

What is a genome and what is genomics?

Accepted Answer

A genome is the complete set of genetic material (DNA) in an organism, including all genes and non-coding sequences. Genomics is the study of genomes, including their structure, function, evolution, and mapping. Key areas include: structural genomics (physical organization and mapping), functional genomics (gene function through expression studies), comparative genomics (comparing genomes across species), and personal genomics (individual genetic variation). Technologies like next-generation sequencing have enabled rapid genome sequencing and analysis, revolutionizing our understanding of biology and enabling personalized medicine.

Question 13

What is transcriptomics and how is it different from genomics?

Accepted Answer

Transcriptomics is the study of the transcriptome - the complete set of RNA transcripts produced by a genome at a specific time and condition. Unlike genomics, which studies the static DNA sequence, transcriptomics captures dynamic gene expression that varies by cell type, developmental stage, and environmental conditions. Key technologies include RNA-Seq and microarrays. Transcriptomics reveals which genes are active, their expression levels, alternative splicing variants, and non-coding RNAs. It is essential for understanding gene regulation, disease mechanisms, and cellular responses to treatments.

Question 14

What are BLOSUM and PAM scoring matrices used for?

Accepted Answer

BLOSUM (BLOcks SUbstitution Matrix) and PAM (Point Accepted Mutation) are substitution matrices used to score amino acid alignments in protein sequence comparison. They assign scores based on how frequently amino acid substitutions occur in related proteins. PAM matrices (PAM1, PAM120, PAM250) are based on evolutionary models, with higher numbers for distantly related sequences. BLOSUM matrices (BLOSUM62, BLOSUM80) are derived from conserved blocks in alignments, with higher numbers for closely related sequences. BLOSUM62 is commonly used as default in BLAST. These matrices improve alignment accuracy by weighting biologically meaningful substitutions.

Question 15

What is an Open Reading Frame (ORF)?

Accepted Answer

An Open Reading Frame (ORF) is a continuous stretch of DNA sequence that begins with a start codon (usually ATG) and ends with a stop codon (TAA, TAG, or TGA), potentially encoding a protein. A single DNA sequence has six possible reading frames (three on each strand), and bioinformatics tools identify ORFs to predict potential genes. Key considerations include: minimum length requirements (typically >100 codons for significance), presence of regulatory elements like promoters and ribosome binding sites, and codon usage bias. ORF identification is a fundamental step in genome annotation and gene prediction.

Question 16

Describe a typical NGS data analysis pipeline and its key steps.

Accepted Answer

A typical NGS data analysis pipeline includes: 1) Quality control - assess raw read quality using FastQC, check for adapter contamination and base quality scores. 2) Preprocessing - trim adapters and low-quality bases using Trimmomatic or Cutadapt. 3) Alignment - map reads to reference genome using BWA, Bowtie2, or STAR (for RNA-Seq). 4) Post-alignment processing - sort, index, mark duplicates using SAMtools and Picard. 5) Variant calling - identify SNPs/indels using GATK, FreeBayes, or application-specific tools. 6) Annotation - annotate variants with functional impact using ANNOVAR, VEP, or SnpEff. 7) Visualization and interpretation - use IGV, R/Bioconductor for downstream analysis.

Question 17

Explain how dynamic programming is used in sequence alignment.

Accepted Answer

Dynamic programming in sequence alignment builds optimal alignments by breaking the problem into smaller subproblems. For two sequences of length m and n, it creates an (m+1) x (n+1) matrix where each cell (i,j) represents the best alignment score for the first i characters of sequence 1 and first j characters of sequence 2. Each cell is calculated from three possibilities: diagonal (match/mismatch), left (gap in sequence 2), and top (gap in sequence 1). Needleman-Wunsch (global) initializes edges with gap penalties and traces back from bottom-right. Smith-Waterman (local) allows zero scores and traces from highest-scoring cell. Time complexity is O(mn), space can be optimized to O(min(m,n)).

Question 18

What are Hidden Markov Models and how are they applied in bioinformatics?

Accepted Answer

Hidden Markov Models (HMMs) are probabilistic models for sequences where the underlying states are hidden but emit observable symbols. In bioinformatics, HMMs are used for: gene prediction (modeling exons, introns, and intergenic regions), protein family modeling (profile HMMs in HMMER and Pfam for domain detection), multiple sequence alignment construction, secondary structure prediction, and transmembrane topology prediction. Key algorithms include: Viterbi (find most probable state path), Forward-Backward (calculate state probabilities), and Baum-Welch (parameter training). Profile HMMs capture position-specific amino acid preferences and insertion/deletion probabilities for protein families.

Question 19

Compare distance-based and character-based methods for phylogenetic tree construction.

Accepted Answer

Distance-based methods (UPGMA, Neighbor-Joining) convert sequences to a distance matrix and cluster based on pairwise distances. They are computationally fast but lose sequence information during matrix conversion. UPGMA assumes molecular clock (constant evolution rate), while Neighbor-Joining does not. Character-based methods (Maximum Parsimony, Maximum Likelihood, Bayesian) use sequence data directly. Maximum Parsimony finds trees minimizing evolutionary changes but can be inconsistent. Maximum Likelihood evaluates probability of data given a tree model, is statistically rigorous but computationally intensive. Bayesian methods estimate posterior probabilities of trees using MCMC sampling. ML and Bayesian methods are generally preferred for accuracy.

Question 20

Describe the main approaches to protein structure prediction.

Accepted Answer

Protein structure prediction approaches include: 1) Homology/Comparative modeling - uses known structures of homologous proteins as templates, most accurate when sequence identity >30%. Tools: SWISS-MODEL, MODELLER. 2) Threading/Fold recognition - matches query sequence to known fold library even without detectable sequence similarity. Tools: I-TASSER, Phyre2. 3) Ab initio/De novo - predicts structure from physical principles without templates, computationally intensive. Uses energy minimization and fragment assembly. 4) Deep learning methods - AlphaFold2 and RoseTTAFold use neural networks trained on known structures, achieving near-experimental accuracy. They revolutionized the field and can predict structures for most proteins with high confidence.

Question 21

How is differential gene expression analysis performed using RNA-Seq data?

Accepted Answer

Differential expression analysis from RNA-Seq involves: 1) Read alignment - map reads to reference genome/transcriptome using STAR or HISAT2. 2) Quantification - count reads per gene using featureCounts, HTSeq, or estimate TPM/FPKM using Salmon/Kallisto. 3) Normalization - account for library size and composition using methods like TMM, DESeq2 normalization, or TPM. 4) Statistical testing - identify differentially expressed genes using negative binomial models in DESeq2, edgeR, or limma-voom, accounting for overdispersion in count data. 5) Multiple testing correction - apply FDR/Benjamini-Hochberg to control false positives. 6) Downstream analysis - gene set enrichment (GSEA), pathway analysis, clustering, and visualization with heatmaps and volcano plots.

Question 22

What is Gene Ontology and how is GO enrichment analysis performed?

Accepted Answer

Gene Ontology (GO) is a standardized vocabulary describing gene and protein functions across three domains: Biological Process (BP - what the gene does), Molecular Function (MF - biochemical activity), and Cellular Component (CC - where it functions). GO enrichment analysis tests whether specific GO terms are over-represented in a gene list compared to background. Methods include: hypergeometric/Fisher's exact test (for gene lists), GSEA (for ranked lists), and topGO (accounting for GO hierarchy). Multiple testing correction is essential due to many tests. Tools include DAVID, g:Profiler, clusterProfiler, and Enrichr. Results help interpret biological meaning of gene sets from experiments.

Question 23

Explain the process of variant calling for SNPs and indels from NGS data.

Accepted Answer

Variant calling identifies genetic variants from aligned NGS reads. The process involves: 1) Read preprocessing - quality filtering, duplicate marking, base quality recalibration (BQSR in GATK). 2) Pileup generation - stack reads at each genomic position to count alleles. 3) Variant detection - statistical models distinguish true variants from sequencing errors. GATK HaplotypeCaller performs local de novo assembly. FreeBayes uses Bayesian approaches. 4) Filtering - apply quality filters (QUAL score, depth, strand bias, mapping quality) using VQSR or hard filters. 5) Annotation - determine functional impact using VEP, ANNOVAR, or SnpEff. Challenges include: low-frequency variants, repetitive regions, indel alignment ambiguity, and systematic errors.

Question 24

What is molecular docking and how is it used in drug discovery?

Accepted Answer

Molecular docking predicts the preferred orientation and binding affinity of a small molecule (ligand) to a protein target. The process involves: 1) Protein preparation - add hydrogens, assign charges, define binding site. 2) Ligand preparation - generate 3D conformers, assign charges. 3) Docking - search algorithms (genetic algorithms, Monte Carlo, incremental construction) sample ligand poses while scoring functions estimate binding affinity. 4) Scoring - physics-based (force fields), empirical, or knowledge-based functions rank poses. Popular tools include AutoDock, Glide, GOLD, and MOE. In drug discovery, docking enables virtual screening of compound libraries, lead optimization, and understanding of binding mechanisms, significantly reducing experimental screening costs.

Question 25

What is metagenomics and how is metagenomic data analyzed?

Accepted Answer

Metagenomics is the study of genetic material recovered directly from environmental samples, enabling analysis of microbial communities without cultivation. Analysis approaches include: 1) Amplicon sequencing (16S/18S/ITS) - PCR-amplify marker genes, cluster into OTUs or ASVs (DADA2, QIIME2), assign taxonomy. 2) Shotgun metagenomics - sequence all DNA, provides functional information. Reads are assembled (metaSPAdes, MEGAHIT), binned into MAGs (MetaBAT, CONCOCT), and annotated. Taxonomic profiling uses MetaPhlAn, Kraken, or Kaiju. Functional profiling uses HUMAnN, MEGAN, or MG-RAST. Applications include: gut microbiome studies, environmental monitoring, antibiotic resistance surveillance, and novel enzyme discovery.

Question 26

How are protein domains and motifs identified from sequences?

Accepted Answer

Protein domains and motifs are identified using: 1) Profile HMM searches - HMMER searches against Pfam database to identify conserved domains with statistical significance. InterProScan integrates multiple databases (Pfam, SMART, CDD, PROSITE). 2) Pattern matching - PROSITE patterns define short motifs using regular expressions. 3) Neural networks - predictors like PSIPRED identify secondary structure, SignalP detects signal peptides. 4) Sequence features - PEST sequences, nuclear localization signals, transmembrane regions have characteristic compositions. Results help predict protein function, subcellular localization, and evolutionary relationships. Domain architecture comparison reveals domain shuffling and protein evolution patterns.

Question 27

Describe the process of genome annotation.

Accepted Answer

Genome annotation identifies functional elements in a genome sequence. It involves: 1) Repeat masking - identify and mask repetitive elements using RepeatMasker. 2) Gene prediction - ab initio predictors (Augustus, GeneMark) use HMMs trained on known genes; evidence-based methods incorporate EST/RNA-Seq alignments and protein homology. MAKER and Braker combine approaches. 3) Functional annotation - assign gene names and functions through homology searches (BLAST), domain prediction (InterProScan), and pathway mapping (KEGG, GO). 4) Non-coding element annotation - identify rRNAs, tRNAs (tRNAscan), small RNAs, and regulatory elements. 5) Quality control and manual curation - validate predictions, resolve conflicts. Output formats include GFF3 and GenBank.

Question 28

How is ChIP-Seq data analyzed to identify transcription factor binding sites?

Accepted Answer

ChIP-Seq analysis identifies DNA regions bound by proteins. Pipeline: 1) Quality control and preprocessing - FastQC, adapter trimming, filter low-quality reads. 2) Alignment - map reads to reference genome using Bowtie2 or BWA (allow multi-mappers carefully). 3) Peak calling - identify enriched regions using MACS2, which models fragment size and compares to input/IgG control. Broad peaks for histone modifications, narrow for transcription factors. 4) Peak annotation - assign peaks to nearby genes using ChIPseeker or HOMER, analyze genomic distribution. 5) Motif discovery - identify enriched sequence motifs using HOMER, MEME-ChIP. 6) Differential binding analysis - DiffBind compares binding between conditions. Integrative analysis with RNA-Seq reveals regulatory networks.

Question 29

What are the key considerations in single-cell RNA-Seq data analysis?

Accepted Answer

Single-cell RNA-Seq (scRNA-Seq) analysis requires handling unique challenges: 1) Quality control - filter cells based on gene count, UMI count, and mitochondrial percentage to remove empty droplets and doublets. 2) Normalization - account for technical variation and library size differences using scran or SCTransform. 3) Dimensionality reduction - reduce noise and visualization using PCA followed by UMAP/t-SNE. 4) Batch correction - remove technical batch effects using Harmony, Seurat integration, or MNN. 5) Clustering - identify cell populations using graph-based methods (Louvain, Leiden). 6) Cell type annotation - use marker genes or reference datasets (SingleR, CellTypist). 7) Trajectory analysis - infer developmental trajectories using Monocle, Velocyto. Tools: Seurat, Scanpy.

Question 30

Explain the basic workflow of a Genome-Wide Association Study (GWAS).

Accepted Answer

GWAS identifies genetic variants associated with traits or diseases across many individuals. Workflow: 1) Sample collection and genotyping - collect DNA, genotype using SNP arrays or WGS. 2) Quality control - filter samples (call rate, relatedness, population stratification) and SNPs (call rate, Hardy-Weinberg equilibrium, minor allele frequency). 3) Population stratification - detect and correct using principal component analysis (PCA) as covariates. 4) Association testing - test each SNP for trait association using regression models (linear for quantitative, logistic for binary traits). Tools: PLINK, GCTA, BOLT-LMM. 5) Multiple testing correction - Bonferroni or permutation testing; genome-wide significance typically p < 5x10^-8. 6) Post-GWAS - fine-mapping, functional annotation, pathway analysis, polygenic risk scores.

Question 31

How are protein-protein interaction networks constructed and analyzed?

Accepted Answer

Protein-protein interaction (PPI) networks represent physical or functional relationships between proteins. Construction sources: 1) Experimental - yeast two-hybrid, co-immunoprecipitation, affinity purification mass spectrometry (AP-MS). 2) Databases - STRING, BioGRID, IntAct compile curated interactions. 3) Computational prediction - gene co-expression, domain interactions, phylogenetic profiles. Analysis methods: 1) Network topology - identify hubs (highly connected nodes), betweenness centrality (information flow). 2) Clustering - detect protein complexes and functional modules using MCL, MCODE. 3) Enrichment analysis - determine pathway/GO enrichment of clusters. 4) Integration - overlay with expression data, mutations. Tools: Cytoscape for visualization, NetworkX/igraph for analysis.

Question 32

How is machine learning applied in bioinformatics?

Accepted Answer

Machine learning is widely applied in bioinformatics for prediction and classification: 1) Sequence-based predictions - protein secondary structure (PSIPRED), subcellular localization, gene prediction, splice site detection. 2) Structure - protein structure prediction (AlphaFold), protein-ligand binding. 3) Function prediction - enzyme function, drug-target interactions. 4) Expression analysis - cell type classification, biomarker discovery. 5) Variant interpretation - pathogenicity prediction (CADD, REVEL). Methods include: Random Forests (feature interpretability), SVMs (small datasets), Neural Networks/Deep Learning (large datasets, complex patterns), CNNs (sequence motifs), RNNs/Transformers (sequential data). Key considerations: feature engineering, cross-validation, overfitting prevention, biological interpretability.

Question 33

What is bootstrapping in phylogenetics and how should bootstrap values be interpreted?

Accepted Answer

Bootstrapping in phylogenetics assesses the statistical support for branches in a phylogenetic tree. The process: 1) Resample alignment columns with replacement to create pseudo-replicates of the same size. 2) Build a tree from each bootstrap replicate (typically 100-1000 replicates). 3) Calculate bootstrap support - percentage of replicate trees containing each clade. Interpretation: >90% strong support, 70-90% moderate support, <70% weak support. Limitations: bootstrap values are measures of repeatability, not accuracy; can be inflated with many sites; don't indicate if the correct tree was found. Alternatives include Bayesian posterior probabilities (which measure probability of clade) and SH-aLRT support values (faster approximation).

Question 34

Explain how de Bruijn graphs are used in genome assembly.

Accepted Answer

De Bruijn graphs are the basis of most short-read genome assemblers. Construction: 1) Break reads into k-mers (overlapping subsequences of length k). 2) Create nodes for each unique (k-1)-mer. 3) Create directed edges representing k-mers connecting consecutive (k-1)-mers. Assembly involves finding an Eulerian path (visiting each edge once) through the graph. Advantages: handles high coverage efficiently, implicitly captures read overlaps without all-vs-all comparison. Challenges: repeats create bubbles and tangles, sequencing errors create spurious nodes, k-mer choice affects resolution (larger k resolves repeats but requires higher coverage). Assemblers (SPAdes, MEGAHIT) use error correction, coverage information, paired-end reads, and multiple k values to improve contiguity.

Question 35

How are predicted protein structures validated?

Accepted Answer

Protein structure validation assesses quality at multiple levels: 1) Stereochemistry - Ramachandran plot analysis (most residues in favored regions), bond lengths and angles, rotamer conformations. Tools: MolProbity, PROCHECK, WHATCHECK. 2) Packing - check for atomic clashes, cavity analysis, and proper hydrophobic burial. 3) Energy-based - molecular mechanics scoring, compatibility with expected energy profiles. 4) Sequence-structure compatibility - 3D-1D profile methods (Verify3D), ProSA checks against statistical potentials. 5) For experimental structures - R-factor, resolution (X-ray), restraint violations (NMR). 6) For predicted structures - AlphaFold pLDDT scores, predicted aligned error (PAE), template modeling scores (GDT-TS, TM-score). Poor scores indicate potential errors requiring attention.

Question 36

What are the unique challenges and strategies for long-read sequencing data analysis?

Accepted Answer

Long-read technologies (PacBio, Oxford Nanopore) present unique challenges: 1) High error rates (5-15% vs 0.1% for Illumina) - require specialized error correction using consensus calling (multiple passes in HiFi) or hybrid correction with short reads (LoRDEC, FMLRC). 2) Assembly strategies - overlap-layout-consensus (Canu, Flye) instead of de Bruijn graphs; long reads span repeats enabling more contiguous assemblies. 3) Alignment considerations - minimap2 handles high error rates with appropriate parameters; alignment scoring must tolerate indels. 4) Polish assemblies - use long reads (Racon, Medaka) or short reads (Pilon) to correct consensus errors. 5) Structural variant detection - long reads excel at detecting SVs, insertions, and complex rearrangements. 6) Direct modification detection - nanopore can detect base modifications (methylation) without bisulfite treatment.

Question 37

How does spatial transcriptomics data analysis differ from standard scRNA-Seq?

Accepted Answer

Spatial transcriptomics captures gene expression with spatial context, requiring specialized analysis: 1) Technology-specific preprocessing - Visium (10x) uses spot-based deconvolution; MERFISH/seqFISH provides single-cell resolution but limited gene panels; Slide-seq uses beads. 2) Quality control - spatial artifacts, tissue edge effects, spot cell composition. 3) Normalization - must account for spatial variation in cell density and RNA capture efficiency. 4) Spatial analysis - identify spatially variable genes (SpatialDE, SPARK), spatially co-expressed gene modules, domain detection through spatial clustering (BayesSpace, SpaGCN uses graph neural networks). 5) Cell type deconvolution - estimate cell type proportions per spot using scRNA-Seq references (SPOTlight, Cell2location, RCTD). 6) Integration - combine with histology images, multi-modal data. 7) Spatial statistics - Moran's I, Ripley's K for point patterns.

Question 38

Describe strategies for integrating multi-omics data (genomics, transcriptomics, proteomics, metabolomics).

Accepted Answer

Multi-omics integration combines data layers for comprehensive biological understanding. Strategies: 1) Concatenation-based - merge features after normalization and scaling; simple but ignores inter-omic relationships. 2) Transformation-based - project to common latent space using multi-block PCA (MOFA), CCA, or autoencoders. 3) Network-based - build interaction networks across omics layers; integrate with prior knowledge (STRING, Reactome). 4) Bayesian approaches - model relationships probabilistically, propagate information across layers. 5) Pathway-based - aggregate signals at pathway level (PARADIGM, iCluster). Challenges: different scales, missing data, batch effects across platforms, sample mismatch, biological vs technical variation. Best practices: careful experimental design, matched samples, appropriate normalization, validation across cohorts. Tools: MOFA+, mixOmics, SNF, NetICS.

Question 39

How should AlphaFold predictions be interpreted and what are its limitations?

Accepted Answer

AlphaFold2 interpretation requires understanding its outputs and limitations: 1) Confidence metrics - pLDDT (per-residue confidence, >90 high, 70-90 good, <50 likely disordered); PAE (predicted aligned error) indicates domain relationships and confidence in relative positions. 2) Strengths - excellent for globular domains with homologs; accurately predicts backbone and most side chains. 3) Limitations - predicts single static structure, not conformational ensembles or dynamics; may not capture effects of ligands, post-translational modifications, or partner proteins; struggles with intrinsically disordered regions, membrane proteins in lipid context, and large conformational changes; novel folds without homologs have lower accuracy. 4) Multi-domain proteins - domains may be correctly folded but relative orientations uncertain (check PAE). 5) Complexes - AlphaFold-Multimer addresses this but accuracy varies. Best practices: validate with experimental data, use molecular dynamics for dynamics, be cautious with low-confidence regions.

Question 40

What methods are used for rare variant association testing and why are they different from common variant GWAS?

Accepted Answer

Rare variants (MAF < 1%) require specialized association methods due to statistical challenges: 1) Problem - standard single-variant tests lack power because few individuals carry each variant; multiple testing burden is severe. 2) Aggregation strategies - collapse variants in functional units (genes, pathways, regulatory regions). Methods: burden tests (CAST, CMC) sum variant effects assuming same direction; variance-component tests (SKAT) allow mixed directions using kernel methods; SKAT-O optimally combines both. 3) Weighting - weight variants by MAF (rarer = larger effect) or functional annotation (CADD, PolyPhen, loss-of-function). 4) Whole-genome sequencing requirements - capture all variation, not just tagged. 5) Large sample sizes - UK Biobank, gnomAD enable discovery. 6) Study designs - family-based studies enrich for rare variants; extreme phenotypes. 7) Interpretation challenges - functional validation critical; many rare variants of uncertain significance.

Question 41

How are gene regulatory networks inferred from expression data and what are the challenges?

Accepted Answer

Gene regulatory network (GRN) inference reconstructs transcription factor-target relationships from expression data. Methods: 1) Correlation-based - WGCNA identifies co-expression modules; limited to undirected associations. 2) Information theory - mutual information (ARACNe, CLR) captures non-linear relationships; data processing inequality reduces indirect edges. 3) Regression-based - GENIE3 uses random forests to predict each gene from all TFs; TIGRESS uses stability selection. 4) Bayesian networks - model causal relationships but computationally expensive. 5) Perturbation data - knockouts/knockdowns provide causal information. Challenges: distinguishing direct from indirect interactions; causality vs correlation; combinatorial regulation; context-specificity; validation is expensive. Best practices: integrate multiple inference methods (wisdom of crowds); incorporate ChIP-Seq, ATAC-Seq, and motif data; validate key predictions experimentally.

Question 42

What are the challenges in detecting structural variants and what methods are used?

Accepted Answer

Structural variants (SVs: deletions, duplications, inversions, translocations >50bp) are challenging to detect due to complexity and size. Detection methods by evidence type: 1) Read-pair - discordant mapping (unexpected orientation/distance) indicates SVs; Delly, Lumpy. 2) Split-reads - alignments spanning breakpoints; Pindel. 3) Read-depth - copy number changes from coverage; CNVnator, cn.MOPS. 4) Assembly - local or de novo assembly resolves complex events; SvABA, GRIDSS. 5) Long-reads - span entire SVs enabling direct detection; Sniffles, PBSV, cuteSV; gold standard for complex regions. Challenges: repetitive regions, false positives at mapping artifacts, merging calls across methods, breakpoint resolution, haplotype phasing, genotyping. Best practice: combine methods, use SV callers with different evidence types, ensemble calling (SURVIVOR), validate with orthogonal methods.

Question 43

How are B-cell and T-cell epitopes predicted computationally?

Accepted Answer

Epitope prediction enables vaccine design and immunotherapy development. B-cell epitopes: 1) Linear - predict exposed, flexible, hydrophilic regions using propensity scales (Parker, Kolaskar-Tongaonkar) or machine learning (BepiPred, ABCpred). 2) Conformational (majority) - require 3D structure; predict surface accessibility, protrusion, electrostatics; tools: DiscoTope, ElliPro. T-cell epitopes: 1) MHC Class I (CD8+) - predict peptide-MHC binding using position-specific scoring matrices or neural networks (NetMHCpan, MHCflurry); allele-specific models. 2) MHC Class II (CD4+) - more challenging due to open-ended binding groove; NetMHCIIpan. 3) Processing prediction - proteasomal cleavage, TAP transport. 4) Immunogenicity - not all binders are immunogenic; consider T-cell recognition, self-tolerance. Challenges: polymorphic MHC molecules, rare alleles, validation requirements.

Question 44

How do you ensure reproducibility in bioinformatics analysis pipelines?

Accepted Answer

Reproducibility in bioinformatics requires systematic practices: 1) Environment management - containerization (Docker, Singularity) captures exact software versions; conda environments for reproducible installations. 2) Workflow managers - Snakemake, Nextflow, WDL define pipelines as code with automatic dependency handling, parallel execution, and resumability. 3) Version control - Git for code and pipeline definitions; track all modifications. 4) Data provenance - document input data sources, checksums, access dates; use data repositories (SRA, GEO) with accession numbers. 5) Parameter documentation - record all parameters, random seeds, reference versions. 6) Testing - unit tests for individual tools; integration tests for pipelines; benchmark datasets with known outputs. 7) Documentation - README files, inline comments, analysis notebooks (R Markdown, Jupyter). 8) Publishing - provide code repositories, container images, workflow definitions with publications. Standards: FAIR principles, GA4GH standards.

Question 45

What are the key considerations in tumor genomics analysis?

Accepted Answer

Tumor genomics presents unique challenges: 1) Tumor heterogeneity - subclonal populations require high sequencing depth (>100x); clonal architecture reconstruction (PyClone, FACETS). 2) Somatic vs germline - distinguish somatic mutations using matched normal tissue; specialized callers (Mutect2, Strelka2, VarScan2). 3) Tumor purity and ploidy - estimate using copy number tools (ASCAT, ABSOLUTE); affects variant allele frequency interpretation. 4) Driver identification - distinguish drivers from passengers using databases (COSMIC, OncoKB), prediction tools (CHASMplus, PolyPhen), and statistical methods (MutSig, dNdScv for significantly mutated genes). 5) Copy number alterations - GISTIC identifies recurrent CNAs; integrate with expression. 6) Mutational signatures - decompose mutation patterns to infer etiologies (COSMIC signatures, SigProfiler). 7) Neoantigen prediction - identify immunogenic mutations for immunotherapy. 8) Liquid biopsy - ctDNA detection requires specialized sensitive methods.

Question 46

How does cryo-EM structure determination workflow differ from X-ray crystallography in terms of computational analysis?

Accepted Answer

Cryo-EM computational analysis has distinct characteristics: 1) Data collection - millions of 2D particle images from frozen-hydrated samples; no crystallization required. 2) Particle picking - identify and extract particles using template matching or neural networks (Topaz, crYOLO). 3) 2D classification - group particles by similar views; remove bad particles and heterogeneous classes. 4) 3D reconstruction - iterative refinement of 3D map from 2D projections; ab initio or reference-based (RELION, cryoSPARC). 5) Resolution determination - FSC (Fourier Shell Correlation) between half-maps. 6) Heterogeneity - 3D classification reveals conformational states; continuous heterogeneity methods emerging. 7) Model building - fit atomic models into density maps; automated tools (ModelAngelo) and manual building (Coot); real-space refinement (Phenix). 8) Validation - map-model FSC, geometry checks. Advantages: captures multiple states, no size limitations; challenges: resolution dependent on particle behavior, computational demands.

Question 47

What methods are used for haplotype phasing and why is it important?

Accepted Answer

Haplotype phasing determines which alleles are on the same chromosome, critical for: compound heterozygosity assessment, population genetics, disease inheritance patterns, and personalized medicine. Methods: 1) Statistical phasing - uses population reference panels (1000 Genomes, TOPMed) and linkage disequilibrium patterns; tools: SHAPEIT, Eagle, Beagle; accurate for common variants. 2) Read-backed phasing - phase nearby variants covered by same read/read-pair; limited by read/fragment length; WhatsHap, HapCUT2. 3) Long-read phasing - PacBio HiFi and ONR reads span many variants; enables direct phasing over tens of kb. 4) Linked-read phasing (10x Genomics) - molecular barcodes link reads from same molecule. 5) Family-based phasing - use parental genotypes for complete phasing. 6) Hi-C phasing - chromatin contacts provide chromosome-scale phasing. Challenges: rare variants, complex regions, computational cost for long-range phasing.

Question 48

How are deep learning architectures adapted for genomic sequence analysis?

Accepted Answer

Deep learning architectures for genomics: 1) Convolutional Neural Networks (CNNs) - detect motifs and patterns in DNA/protein sequences; 1D convolutions scan sequences like text; DeepBind, Basenji for regulatory element prediction. 2) Recurrent Networks (LSTMs, GRUs) - capture long-range dependencies in sequences; used in protein language models. 3) Attention/Transformers - self-attention models capture long-range interactions without position constraints; protein language models (ESM, ProtTrans); Enformer for expression prediction spanning 200kb. 4) Graph Neural Networks - represent molecules and protein structures as graphs; capture spatial relationships. 5) Variational Autoencoders - generate novel sequences, protein design. Considerations: one-hot encoding vs embeddings; handling variable-length sequences; interpretability (attention weights, gradient-based attribution, in-silico mutagenesis); data augmentation (reverse complement); transfer learning from pre-trained models.

Question 49

How is causal inference performed in omics studies and what are the limitations?

Accepted Answer

Causal inference distinguishes causation from correlation in omics data. Approaches: 1) Mendelian Randomization (MR) - uses genetic variants as instrumental variables to test causal effects of exposures on outcomes; SNPs associated with exposure used as instruments; robust to confounding and reverse causation. Methods: inverse variance weighted, MR-Egger (tests pleiotropy), GSMR. 2) Mediation analysis - tests if effect of A on C is mediated through B. 3) Intervention studies - perturbations (CRISPR, drugs) provide causal evidence; Perturb-seq combines single-cell and perturbation. 4) Time-series - Granger causality, dynamic causal modeling. 5) Causal discovery - PC algorithm, FCI for learning causal graphs from observational data. Limitations: genetic instrument validity (pleiotropy, weak instruments); unmeasured confounders in observational data; model assumptions; distinguishing direct from indirect effects; generalizing across populations.

Question 50

How are genetic variants classified for clinical interpretation and what resources are used?

Accepted Answer

Clinical variant interpretation follows ACMG/AMP guidelines classifying variants into 5 categories: Pathogenic, Likely Pathogenic, Uncertain Significance (VUS), Likely Benign, Benign. Evidence types: 1) Population data - allele frequency in gnomAD; rare in controls supports pathogenicity. 2) Computational predictions - CADD, REVEL, SpliceAI for predicted impact; moderate evidence only. 3) Functional data - experimental assays, cell-based studies, animal models; strong evidence. 4) Segregation - co-segregation with disease in families. 5) De novo status - confirmed de novo variants in sporadic cases. Resources: ClinVar (aggregated interpretations), HGMD (mutation database), gnomAD (population frequencies), UniProt/InterPro (functional domains), PubMed (literature). Challenges: VUS resolution requires functional studies or additional cases; reinterpretation as knowledge grows; phenotype-genotype correlation; somatic vs germline distinction in cancer. Tools: Varsome, InterVar automate ACMG criteria.

Bioinformatics Interview Questions