Question 1

What is a database and what is a DBMS?

Accepted Answer

A database is an organized collection of structured data stored electronically. A DBMS (Database Management System) is software that manages databases, providing interfaces for defining, creating, querying, updating, and administering databases. DBMS handles data storage, retrieval, security, concurrency, backup, and recovery. Examples include MySQL, PostgreSQL, Oracle, MongoDB, and SQL Server.

Question 2

What is the difference between a primary key and a foreign key?

Accepted Answer

A primary key uniquely identifies each row in a table - it must be unique, not null, and there can be only one per table. A foreign key is a column that references the primary key of another table, establishing a relationship between tables. Foreign keys enforce referential integrity - values must exist in the referenced table or be null (if allowed). Example: order.customer_id references customer.id.

Question 3

What are the ACID properties in database transactions?

Accepted Answer

ACID ensures reliable transaction processing: Atomicity (transactions are all-or-nothing - either all operations complete or none do), Consistency (transactions bring database from one valid state to another), Isolation (concurrent transactions don't interfere with each other), Durability (committed transactions persist even after system failure). ACID properties are fundamental to relational databases and ensure data integrity.

Question 4

What are the different types of SQL JOINs?

Accepted Answer

INNER JOIN returns matching rows from both tables. LEFT JOIN returns all rows from left table plus matching rows from right (nulls for non-matches). RIGHT JOIN is opposite of LEFT. FULL OUTER JOIN returns all rows from both tables (nulls for non-matches). CROSS JOIN returns Cartesian product of both tables. Self-join joins a table to itself. Each type serves different purposes for combining data across tables.

Question 5

What is a database index and why is it used?

Accepted Answer

An index is a data structure that improves the speed of data retrieval operations on a table by providing quick access paths to rows. Like a book index, it maps column values to row locations. Without indexes, queries perform full table scans (O(n)). Indexes enable O(log n) lookups using B-trees. Trade-off: indexes speed up reads but slow down writes and use additional storage. Index columns used in WHERE, JOIN, and ORDER BY clauses.

Question 6

What is the difference between SQL and NoSQL databases?

Accepted Answer

SQL databases are relational, using structured tables with predefined schemas, supporting ACID transactions and complex queries via SQL. NoSQL databases are non-relational with flexible schemas - types include document (MongoDB), key-value (Redis), wide-column (Cassandra), and graph (Neo4j). NoSQL offers horizontal scaling and schema flexibility but often sacrifices ACID for eventual consistency. Choose based on data structure, scalability needs, and consistency requirements.

Question 7

What is database normalization and why is it important?

Accepted Answer

Normalization is the process of organizing a database to reduce data redundancy and improve data integrity. It divides large tables into smaller related tables linked by relationships. Benefits: eliminates redundant data (saves storage), prevents update anomalies (inconsistent data), enforces data dependencies logically. Normal forms (1NF through 5NF) define progressive levels of normalization. Most databases aim for 3NF or BCNF as a practical balance.

Question 8

What is the difference between WHERE and HAVING clauses in SQL?

Accepted Answer

WHERE filters rows before grouping and aggregation - it cannot use aggregate functions. HAVING filters groups after GROUP BY and aggregation - it can use aggregate functions. Example: WHERE filters individual orders (WHERE amount > 100), HAVING filters customer groups (HAVING SUM(amount) > 1000). Execution order: FROM -> WHERE -> GROUP BY -> HAVING -> SELECT -> ORDER BY. Use WHERE when possible for better performance.

Question 9

What is a database transaction?

Accepted Answer

A transaction is a sequence of database operations executed as a single logical unit of work. It either completes entirely (commits) or has no effect (rolls back). Example: bank transfer debits one account and credits another - both must succeed or neither should. Transactions are controlled with BEGIN, COMMIT, and ROLLBACK statements. They ensure data consistency even with concurrent access and system failures.

Question 10

What is the difference between UNIQUE and PRIMARY KEY constraints?

Accepted Answer

Both ensure uniqueness, but PRIMARY KEY also enforces NOT NULL and only one is allowed per table - it defines the main identifier for rows. UNIQUE allows NULL values (one null per column in most databases) and multiple UNIQUE constraints can exist on a table. PRIMARY KEY typically creates a clustered index (physical row ordering), while UNIQUE creates a non-clustered index. Use PRIMARY KEY for main identifier, UNIQUE for alternate keys.

Question 11

What are aggregate functions in SQL? Give examples.

Accepted Answer

Aggregate functions perform calculations on sets of rows and return a single value. Common functions: COUNT (number of rows), SUM (total of numeric column), AVG (average value), MIN (smallest value), MAX (largest value). Used with GROUP BY to calculate aggregates per group. Example: SELECT department, AVG(salary) FROM employees GROUP BY department. COUNT(*) counts all rows; COUNT(column) counts non-null values.

Question 12

What is the difference between clustered and non-clustered indexes?

Accepted Answer

A clustered index determines the physical order of data in the table - the table IS the index. Only one clustered index per table (usually on primary key). A non-clustered index is a separate structure with pointers to data rows - multiple can exist per table. Clustered index lookups are faster as data is retrieved directly. Non-clustered requires additional lookup to get actual data (bookmark lookup). Choose based on query patterns.

Question 13

How does NULL work in SQL and what are its implications?

Accepted Answer

NULL represents missing or unknown data - it's not zero, empty string, or false. NULL comparisons require IS NULL or IS NOT NULL (using = returns UNKNOWN, not TRUE/FALSE). NULL in calculations propagates: 5 + NULL = NULL. Aggregate functions ignore NULLs (except COUNT(*)). NVL/COALESCE/IFNULL handle NULLs. NULL affects index usage, join behavior, and sorting. Design carefully: sometimes a default value is better than allowing NULLs.

Question 14

What is an ER diagram and what are its components?

Accepted Answer

An Entity-Relationship (ER) diagram is a visual representation of database structure showing entities (tables), their attributes (columns), and relationships between them. Components: Rectangles represent entities, Ovals represent attributes, Diamonds represent relationships, Lines connect entities through relationships. Relationship cardinalities show one-to-one (1:1), one-to-many (1:N), or many-to-many (M:N). ER diagrams help in database design before implementation.

Question 15

What is the difference between DELETE and TRUNCATE commands?

Accepted Answer

DELETE removes rows based on condition (WHERE clause), logs each deletion for rollback, fires triggers, and is slower. TRUNCATE removes all rows, uses minimal logging (deallocates pages), doesn't fire row triggers, resets identity/auto-increment, and is faster. DELETE can be rolled back in transaction; TRUNCATE behavior varies (some DBMS support rollback). Use DELETE for selective removal, TRUNCATE for clearing entire table. Neither removes table structure.

Question 16

Explain the different normal forms (1NF, 2NF, 3NF, BCNF).

Accepted Answer

1NF: atomic values, no repeating groups (each cell has single value). 2NF: 1NF + no partial dependencies (non-key attributes depend on entire primary key, not part of it). 3NF: 2NF + no transitive dependencies (non-key attributes depend only on primary key, not through other non-key attributes). BCNF: every determinant is a candidate key (stronger than 3NF). Higher forms reduce redundancy but may impact query performance through more joins.

Question 17

Explain the different transaction isolation levels.

Accepted Answer

Isolation levels define visibility of uncommitted changes: READ UNCOMMITTED (dirty reads allowed - sees uncommitted data), READ COMMITTED (only sees committed data - default in many DBMS), REPEATABLE READ (same query returns same results within transaction), SERIALIZABLE (highest isolation - transactions appear sequential). Higher isolation prevents more anomalies (dirty/non-repeatable reads, phantom rows) but reduces concurrency. Trade-off between consistency and performance.

Question 18

How does a composite index work and when should you use it?

Accepted Answer

A composite (compound) index includes multiple columns. Column order matters - index is most effective when queries filter on leftmost columns (leftmost prefix principle). Index on (A, B, C) helps queries on A, (A,B), or (A,B,C), but not B alone or (B,C). Consider selectivity: place most selective columns first. Useful for queries with multiple WHERE conditions or ORDER BY on multiple columns. Covering index includes all query columns, avoiding table access.

Question 19

What is a query execution plan and how do you read it?

Accepted Answer

An execution plan shows how the database executes a query - operations, order, costs, and data access methods. View with EXPLAIN (MySQL/PostgreSQL) or SET STATISTICS IO (SQL Server). Key elements: Seq Scan vs Index Scan, Join types (nested loop, hash, merge), estimated rows, cost/actual time. Read bottom-up or inside-out (innermost operations first). Use to identify full table scans, missing indexes, inefficient joins. Essential for query optimization.

Question 20

What is database sharding and what are its strategies?

Accepted Answer

Sharding horizontally partitions data across multiple database instances for scalability. Strategies: Range-based (shard by value ranges - e.g., user IDs 1-1M on shard1), Hash-based (consistent hashing of key - even distribution but no range queries), Directory-based (lookup service maps keys to shards - flexible but lookup overhead), Geographic (by user location). Challenges: cross-shard queries, transactions across shards, rebalancing. Often used with application-level routing.

Question 21

When should you use a subquery vs a JOIN?

Accepted Answer

JOINs are typically preferred: optimizers handle them well, easier to read, can return columns from both tables. Use subqueries when: checking existence (EXISTS), comparing against aggregates, or for clarity in complex logic. Correlated subqueries (reference outer query) execute per row - can be slow. Modern optimizers often transform subqueries to joins. EXISTS can outperform IN for large datasets. Always test with execution plans - performance varies by query and data.

Question 22

What is database replication and what are the different types?

Accepted Answer

Replication copies data across multiple database servers. Types: Master-Slave (writes to master, reads from slaves - read scaling), Master-Master (writes to any master, complex conflict resolution), Synchronous (waits for replicas - consistency but slower), Asynchronous (doesn't wait - faster but potential data loss). Benefits: read scaling, high availability, geographic distribution. Challenges: replication lag, conflict resolution, consistency guarantees.

Question 23

What are stored procedures and what are their advantages and disadvantages?

Accepted Answer

Stored procedures are precompiled SQL code stored in the database. Advantages: reduced network traffic (send procedure call not full SQL), security (grant access to procedure not tables), performance (precompiled, cached execution plans), encapsulation (centralized business logic). Disadvantages: vendor lock-in, harder to version control, debugging challenges, may not scale horizontally, logic split between app and database. Modern trend favors application-level logic.

Question 24

Compare document databases with relational databases.

Accepted Answer

Document databases (MongoDB) store JSON-like documents with flexible schemas - good for hierarchical data, rapid development, varying structures. Relational databases use fixed tables with relationships - good for structured data, complex queries, ACID transactions. Documents embed related data (fewer joins but possible duplication); relational normalizes (no duplication but more joins). Choose based on: data relationships, consistency needs, schema stability, query complexity.

Question 25

What is a database deadlock and how is it handled?

Accepted Answer

Deadlock occurs when two transactions wait for each other's locks indefinitely. Example: T1 locks row A, wants B; T2 locks row B, wants A. Detection: wait-for graph with cycles. Resolution: one transaction is chosen as victim and rolled back (based on cost, age, or priority). Prevention: acquire all locks upfront, order lock acquisition consistently, use shorter transactions, lower isolation levels. Most databases detect and resolve automatically.

Question 26

What are SQL window functions and how are they used?

Accepted Answer

Window functions perform calculations across rows related to current row without collapsing rows (unlike GROUP BY). Syntax: function() OVER (PARTITION BY cols ORDER BY cols). Functions include: ROW_NUMBER(), RANK(), DENSE_RANK() for ranking; SUM(), AVG() for running totals; LAG(), LEAD() for accessing adjacent rows; FIRST_VALUE(), LAST_VALUE() for boundary values. Example: ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) ranks employees within each department.

Question 27

Explain the CAP theorem and its implications.

Accepted Answer

CAP theorem states a distributed system can guarantee at most two of three properties: Consistency (all nodes see same data), Availability (every request gets response), Partition tolerance (system works despite network partitions). Since partitions are inevitable, choice is between CP (consistent but may be unavailable - MongoDB, HBase) and AP (available but eventually consistent - Cassandra, DynamoDB). PACELC extends this: when no partition, trade-off is Latency vs Consistency.

Question 28

When should you NOT use an index?

Accepted Answer

Avoid indexes when: table is small (full scan is fast enough), columns have low selectivity (many duplicates - e.g., gender), table is heavily write-intensive (index maintenance overhead), columns are rarely queried, using expressions/functions on columns (unless using expression index), wide columns (large index size). Also avoid over-indexing - each index slows writes and uses storage. Monitor unused indexes and remove them.

Question 29

What is denormalization and when should you use it?

Accepted Answer

Denormalization intentionally adds redundancy to improve read performance by reducing joins. Techniques: adding computed columns, duplicating data across tables, pre-aggregating values. Use when: read performance is critical, joins are expensive, data changes infrequently, query patterns are well-known. Trade-offs: increased storage, update complexity (must update all copies), potential inconsistency. Common in data warehouses, caching, and reporting systems. Always measure actual performance impact.

Question 30

Compare optimistic and pessimistic locking strategies.

Accepted Answer

Pessimistic locking locks resources when accessed, preventing conflicts but reducing concurrency - SELECT FOR UPDATE. Good when conflicts are frequent. Optimistic locking assumes conflicts are rare, checks at commit time (using version/timestamp) - if data changed, reject and retry. Better for high-read systems with infrequent writes. ORM frameworks often support optimistic locking. Choice depends on conflict frequency and acceptable retry rate.

Question 31

What data structures does Redis support and when would you use each?

Accepted Answer

Redis supports: Strings (caching, counters), Lists (queues, timelines), Sets (unique items, tags), Sorted Sets (leaderboards, rate limiting), Hashes (objects, user sessions), Bitmaps (flags, bloom filters), HyperLogLog (cardinality estimation), Streams (event sourcing, logs). Use Strings for simple caching, Sorted Sets for ranked data with scores, Hashes for structured objects, Lists for FIFO queues. Pub/Sub enables messaging. Choose based on access patterns.

Question 32

What are common SQL query optimization techniques?

Accepted Answer

Key techniques: Use appropriate indexes (check EXPLAIN), avoid SELECT * (retrieve only needed columns), use JOINs over subqueries when possible, limit result sets early (WHERE before JOIN), avoid functions on indexed columns in WHERE, use EXISTS over IN for large datasets, batch operations, use appropriate data types, avoid OR (use UNION instead), leverage query caching. Monitor slow query logs and analyze execution plans regularly.

Question 33

Explain the MongoDB aggregation pipeline.

Accepted Answer

Aggregation pipeline processes documents through stages: $match (filter), $project (reshape), $group (aggregate by key), $sort, $limit, $skip, $unwind (deconstruct arrays), $lookup (join collections), $facet (multiple pipelines). Documents flow through stages in order. Each stage transforms documents and passes to next. Example: match active users, group by country, calculate average age. Similar to SQL GROUP BY but more flexible. Consider indexes on $match fields for performance.

Question 34

What is database connection pooling and why is it important?

Accepted Answer

Connection pooling maintains a cache of database connections for reuse, avoiding the overhead of establishing new connections for each request. Creating connections is expensive (TCP handshake, authentication, resource allocation). Pool manages min/max connections, connection timeout, validation. Benefits: reduced latency, better resource utilization, connection limit enforcement. Libraries: HikariCP (Java), pgBouncer (PostgreSQL), mysql-connector-pool. Configure based on load and database limits.

Question 35

What are database views and materialized views?

Accepted Answer

Views are virtual tables defined by queries - they don't store data but execute the query each time accessed. Uses: simplify complex queries, provide security (expose subset of data), abstract schema changes. Materialized views store query results physically, updating periodically or on demand. They trade storage for query speed. Use regular views for security/abstraction, materialized views for expensive queries with acceptable staleness. PostgreSQL supports incremental refresh.

Question 36

How does Multi-Version Concurrency Control (MVCC) work?

Accepted Answer

MVCC maintains multiple versions of data to allow concurrent reads without blocking writes. Each transaction sees a snapshot at its start time. PostgreSQL: old versions kept in same table (requires VACUUM cleanup). MySQL InnoDB: undo logs store old versions, rolled back after oldest transaction completes. Versions tracked by transaction IDs. Readers never block writers and vice versa. Trade-off: storage overhead, cleanup complexity. Enables READ COMMITTED and SNAPSHOT isolation efficiently.

Question 37

Compare B-tree and LSM-tree storage engines.

Accepted Answer

B-trees (MySQL InnoDB, PostgreSQL) update data in-place, optimized for reads with O(log n) random I/O per operation. LSM-trees (RocksDB, Cassandra) batch writes to memory, flush sorted segments to disk, merge in background. LSM optimizes writes (sequential I/O) but reads may check multiple levels (bloom filters help). B-tree: better read latency, consistent performance. LSM: better write throughput, space efficiency with compression. Choose based on read/write ratio.

Question 38

How do distributed transactions work (2PC, Saga pattern)?

Accepted Answer

Two-Phase Commit (2PC): coordinator sends prepare to all participants; if all vote yes, sends commit, else abort. Guarantees atomicity but blocking and slow. Saga pattern: sequence of local transactions with compensating transactions for rollback. Choreography (events) or orchestration (central coordinator). Eventual consistency, no distributed locks. 2PC for strong consistency (traditional databases), Saga for microservices and long-running transactions. Consider Paxos/Raft for consensus.

Question 39

Explain horizontal vs vertical partitioning and partition strategies.

Accepted Answer

Vertical partitioning splits columns across tables (normalize or separate hot/cold columns). Horizontal partitioning (sharding) splits rows. Strategies: Range (by value ranges - time, ID), Hash (consistent hash of key - even distribution), List (by explicit values - geography), Composite (combined). Consider: query patterns (avoid cross-partition queries), data distribution (avoid hotspots), rebalancing needs. PostgreSQL has declarative partitioning; application-level sharding for cross-node.

Question 40

How does a database query optimizer work?

Accepted Answer

Query optimizer transforms logical query into efficient physical plan. Steps: parsing (syntax tree), semantic analysis (resolve names, types), logical optimization (predicate pushdown, join reordering), physical optimization (choose algorithms, indexes). Uses statistics (cardinality estimates, histograms) to estimate costs. Algorithms: dynamic programming for join ordering, branch-and-bound pruning. Generates multiple plans, estimates costs, chooses cheapest. Hints can override. Bad statistics cause poor plans.

Question 41

Explain the Raft consensus algorithm used in distributed databases.

Accepted Answer

Raft ensures distributed consensus for replicated state machines. Nodes are leader, follower, or candidate. Leader election: followers timeout waiting for heartbeat, become candidates, request votes. Leader handles all writes, replicates log entries to followers. Safety: committed entries are durable. Split-brain prevented by requiring majority quorum. Simpler than Paxos. Used in etcd, CockroachDB, TiKV. Handles leader failure, network partitions, and ensures linearizability.

Question 42

How are time-series databases optimized for temporal data?

Accepted Answer

Time-series databases (InfluxDB, TimescaleDB, Prometheus) optimize for high-volume timestamped data. Optimizations: column-oriented storage, time-based partitioning (automatic retention), compression (delta, XOR for floats, dictionary for strings), write batching, downsampling/aggregation, specialized indexes (time range). Query patterns: time ranges, aggregations, latest values. Schema: measurement + tags (indexed) + fields. Hot/cold data tiering. Regular data expiration. Consider cardinality limits.

Question 43

How do graph databases handle traversals differently from relational joins?

Accepted Answer

Graph databases (Neo4j, Neptune) store relationships as first-class citizens with direct pointers between nodes - traversals follow pointers with O(1) per hop regardless of data size. Relational joins compute relationships at query time using indexes - cost grows with data size. Graph advantages: variable-length paths, pattern matching (Cypher, Gremlin), relationship properties. Index-free adjacency enables efficient multi-hop traversals. Best for highly connected data: social networks, recommendations, fraud detection.

Question 44

What is Change Data Capture (CDC) and how is it implemented?

Accepted Answer

CDC captures database changes (inserts, updates, deletes) for streaming to other systems. Methods: Log-based (read transaction log - Debezium, most efficient, no query overhead), Trigger-based (database triggers write changes - adds latency), Timestamp-based (poll for recent changes - simple but misses deletes), Diff-based (compare snapshots - resource intensive). Use cases: event sourcing, search index sync, data warehousing, microservices sync. Log-based preferred for production; consider schema evolution.

Question 45

Compare different database caching strategies.

Accepted Answer

Cache-Aside: application checks cache, on miss queries DB and populates cache. Simple but cache misses slow. Write-Through: writes go to cache and DB synchronously - consistent but slower writes. Write-Behind (Write-Back): writes to cache, async to DB - fast writes but potential data loss. Read-Through: cache handles DB reads transparently. Refresh-Ahead: predict and preload frequently accessed data. Consider: invalidation (TTL, events), consistency requirements, cache size, hit rate. Redis, Memcached commonly used.

Question 46

Explain Serializable Snapshot Isolation (SSI).

Accepted Answer

SSI provides serializable isolation without the blocking of traditional locking. Based on snapshot isolation, it detects and aborts transactions that could cause serialization anomalies. Tracks read-write dependencies between concurrent transactions; if dangerous pattern detected (write skew potential), one transaction aborts. Used in PostgreSQL SERIALIZABLE mode and CockroachDB. Offers better performance than S2PL while maintaining serializability. Some false positives lead to unnecessary aborts.

Question 47

How do columnar databases optimize for analytical workloads?

Accepted Answer

Columnar databases (Redshift, BigQuery, ClickHouse) store data by column instead of row. Benefits: read only needed columns (reduced I/O), better compression (similar values together - RLE, dictionary, delta), vectorized processing (SIMD on column batches), zone maps/min-max indexes (skip blocks). Trade-offs: row reconstruction for wide reads, slower single-row operations, complex updates. Perfect for aggregations, scans on few columns. Row stores better for OLTP with full-row access.

Question 48

How do globally distributed databases achieve low-latency writes?

Accepted Answer

Global databases (CockroachDB, Spanner, YugabyteDB) face latency challenge: consensus requires round-trips across regions. Optimizations: Spanner uses TrueTime (GPS + atomic clocks) for lock-free reads. Follower reads (read from local replica at slightly older timestamp). Write locality (place data near writers using geo-partitioning). Multi-region Paxos groups. CockroachDB uses hybrid logical clocks. Trade-offs: consistency vs latency. Consider: data placement, read/write patterns, consistency requirements per use case.

Question 49

Explain table bloat and VACUUM in PostgreSQL.

Accepted Answer

PostgreSQL MVCC keeps old row versions (dead tuples) until no transaction needs them. Bloat: dead tuples waste space, degrade performance. VACUUM marks space reusable but doesn't return to OS. VACUUM FULL rewrites table (locks it). Autovacuum runs automatically based on thresholds. Bloat causes: long transactions, heavy updates/deletes, autovacuum lag. Monitor: pg_stat_user_tables dead tuple count, pgstattuple extension. Prevention: tune autovacuum, avoid long transactions, consider fillfactor for update-heavy tables.

Question 50

Design a database disaster recovery strategy.

Accepted Answer

Comprehensive DR involves: RPO (Recovery Point Objective - max data loss) and RTO (Recovery Time Objective - max downtime). Components: continuous backups (WAL archiving, snapshots), replication (sync for zero data loss, async for performance), geographic distribution (multi-region replicas), automated failover (health checks, DNS/proxy switch), backup testing (regular restore drills). Tools: pg_basebackup, Barman, cloud snapshots. Consider: cost vs recovery guarantees, data sovereignty, network partitions, runbook documentation.

Database Systems Interview Questions