Question 1

What is the difference between horizontal and vertical scaling?

Accepted Answer

Vertical scaling (scaling up) adds more resources (CPU, RAM) to existing servers - simpler but has hardware limits and single point of failure. Horizontal scaling (scaling out) adds more servers to distribute load - more complex but offers better fault tolerance, theoretically unlimited scale, and often more cost-effective. Most large-scale systems use horizontal scaling. Stateless applications are easier to scale horizontally; stateful ones require data partitioning or session management.

Question 2

What is a load balancer and why is it needed?

Accepted Answer

A load balancer distributes incoming network traffic across multiple servers to ensure no single server is overwhelmed. Benefits: improved availability (servers can fail), better performance (distribute load), scalability (add servers easily), health monitoring (route away from unhealthy servers). Types: hardware (F5), software (Nginx, HAProxy), cloud (AWS ALB/NLB). Algorithms: round-robin, least connections, IP hash. Essential component for any scalable web application.

Question 3

What is caching and why is it important in system design?

Accepted Answer

Caching stores copies of frequently accessed data in faster storage (memory vs disk, closer to user). Benefits: reduced latency (faster responses), decreased database load, improved throughput, cost savings. Cache layers: browser cache, CDN, application cache (Redis, Memcached), database query cache. Key considerations: cache invalidation (TTL, events), cache-aside vs read-through, cold start, memory limits. Caching is fundamental for performance at scale.

Question 4

When would you choose SQL vs NoSQL databases?

Accepted Answer

SQL databases (PostgreSQL, MySQL) for: structured data with relationships, complex queries and joins, ACID transactions, data integrity critical. NoSQL databases (MongoDB, Cassandra, Redis) for: flexible/evolving schemas, high write throughput, horizontal scaling, specific data models (document, key-value, graph). Consider: consistency requirements (CAP theorem), query patterns, scaling needs, team expertise. Many systems use both (polyglot persistence) for different use cases.

Question 5

What are the key principles of RESTful API design?

Accepted Answer

REST principles: stateless (no server-side session), resource-based URLs (/users, /orders/{id}), HTTP methods for actions (GET read, POST create, PUT update, DELETE remove), standard status codes (200 OK, 201 Created, 404 Not Found), JSON for data exchange. Design practices: use nouns for resources, version APIs (/v1/), pagination for lists, consistent error format, HATEOAS for discoverability. RESTful APIs are the standard for web services.

Question 6

What are the benefits and challenges of microservices architecture?

Accepted Answer

Benefits: independent deployment, technology flexibility per service, team autonomy, fault isolation, scalability per service. Challenges: distributed system complexity, network latency, data consistency across services, operational overhead (deployment, monitoring), service discovery, debugging distributed transactions. Microservices require: mature DevOps practices, observability stack, clear service boundaries. Start with monolith if domain unclear. Microservices solve organizational scaling as much as technical.

Question 7

What is a message queue and when would you use one?

Accepted Answer

Message queues enable asynchronous communication between services by storing messages until consumers process them. Benefits: decoupling (producers/consumers independent), buffering (handle traffic spikes), reliability (messages persist until processed), scaling (multiple consumers). Use cases: order processing, sending emails/notifications, video processing, event streaming. Examples: RabbitMQ (traditional queuing), Kafka (event streaming), SQS (cloud). Essential for building resilient, scalable systems.

Question 8

What is database replication and why is it used?

Accepted Answer

Replication copies data across multiple database servers. Benefits: high availability (failover if primary fails), read scaling (distribute reads across replicas), data locality (replicas in different regions). Types: synchronous (waits for replicas, stronger consistency) vs asynchronous (faster, eventual consistency). Master-slave (writes to master only) vs master-master (writes to any). Challenges: replication lag, conflict resolution (multi-master). Most production databases use replication for availability.

Question 9

What does it mean to design a stateless service?

Accepted Answer

Stateless service doesn't store client session state between requests - each request contains all needed information. Benefits: easier horizontal scaling (any server can handle any request), simpler failover (no session to migrate), better caching. State storage: externalize to database or cache (Redis). Stateless examples: REST APIs, serverless functions. Stateful: WebSocket connections, shopping carts (externalize to Redis). Design for statelessness to maximize scalability and resilience.

Question 10

How does a CDN improve application performance?

Accepted Answer

CDN (Content Delivery Network) caches content at edge locations worldwide, serving users from nearby servers. Performance benefits: reduced latency (shorter physical distance), decreased origin load, improved availability (distributed, redundant). Content types: static files (images, CSS, JS), video streaming, API responses (edge caching). Providers: Cloudflare, Akamai, CloudFront. Additional features: DDoS protection, SSL termination, compression. Essential for global applications and media delivery.

Question 11

What is a single point of failure (SPOF) and how do you eliminate it?

Accepted Answer

SPOF is any component whose failure would bring down the entire system. Elimination strategies: redundancy (multiple instances), load balancing (distribute across instances), failover (automatic switch to backup), geographic distribution. Common SPOFs: single database server, single load balancer, single DNS provider, single region. Design for no single points: redundant everything critical, health checks, automatic failover. Cloud providers help with managed services. Accept some SPOF for non-critical components based on cost.

Question 12

Why is API rate limiting important and how does it work?

Accepted Answer

Rate limiting controls request frequency to protect resources and ensure fair usage. Purpose: prevent abuse/DDoS, protect backend services, ensure availability for all users, manage costs. Implementation: track requests per API key/IP, return 429 Too Many Requests when exceeded. Algorithms: token bucket, leaky bucket, fixed/sliding window. Tools: API gateways, Nginx, Redis counters. Provide: clear limits in documentation, rate limit headers (X-RateLimit-Remaining), exponential backoff guidance.

Question 13

What is eventual consistency and when is it acceptable?

Accepted Answer

Eventual consistency means all replicas will eventually converge to the same value, but may temporarily show different values. Trade-off for availability and partition tolerance (CAP theorem). Acceptable when: stale reads are tolerable (social media feeds), high availability more important than immediate consistency, system can handle temporary inconsistency. Not acceptable: financial transactions, inventory (overselling), security-critical data. Design with conflict resolution (last-write-wins, merge). Common in distributed databases.

Question 14

What are health checks and why are they important?

Accepted Answer

Health checks verify service/component status for load balancing and alerting. Types: liveness (is process running), readiness (can handle traffic), deep health (dependencies working). Implementation: HTTP endpoint (/health, /ready), return status code 200 or 503. Used by: load balancers (route traffic away from unhealthy), container orchestrators (restart unhealthy containers), monitoring (alert on failures). Design: fast execution, check critical dependencies, don't cache. Essential for self-healing systems.

Question 15

What is service discovery and why is it needed in microservices?

Accepted Answer

Service discovery enables services to find each other's network locations without hardcoding. Necessary because: instances scale dynamically, IPs change, need load balancing across instances. Patterns: client-side (clients query registry and load balance - Eureka), server-side (load balancer queries registry - Kubernetes services). Registry: Consul, etcd, ZooKeeper, DNS-based. Services register on startup, deregister on shutdown, use heartbeats. Essential infrastructure for dynamic microservices environments.

Question 16

What are different cache invalidation strategies?

Accepted Answer

Time-based: TTL (expiration after fixed time) - simple but stale data possible. Event-based: invalidate on data change - more complex but fresher data. Write-through: write to cache and database simultaneously. Write-behind: write to cache, async to database - faster but risk of data loss. Cache-aside: application manages cache (check cache, load from DB, update cache). Strategies: versioned keys (append version to key), tag-based invalidation, pub/sub for distributed invalidation. 'Cache invalidation is one of two hard problems in CS.'

Question 17

What are database sharding strategies and their trade-offs?

Accepted Answer

Sharding horizontally partitions data across databases. Strategies: Range-based (by value ranges - easy but hotspots), Hash-based (consistent hash of key - even distribution, no range queries), Directory-based (lookup table - flexible but overhead), Geographic (by region - data locality). Trade-offs: cross-shard queries are expensive, joins limited, transactions complex, rebalancing difficult. Implementation: application-level routing, proxy layer (Vitess), or native support (CockroachDB). Consider: shard key selection critical, avoid hotspots.

Question 18

What is the Circuit Breaker pattern and when should you use it?

Accepted Answer

Circuit Breaker prevents cascading failures by stopping calls to failing services. States: Closed (normal operation, tracking failures), Open (fails fast, doesn't call service), Half-Open (periodically tests if service recovered). Configuration: failure threshold, timeout, recovery time. Benefits: fail fast, prevent resource exhaustion, allow recovery time. Libraries: Hystrix, Resilience4j, Polly. Use for: external API calls, database connections, any unreliable dependency. Combine with: timeouts, retries with backoff, fallbacks.

Question 19

What is an API Gateway and what problems does it solve?

Accepted Answer

API Gateway is the single entry point for all client requests to microservices. Functions: routing (to appropriate service), authentication/authorization (centralized), rate limiting, request/response transformation, aggregation (combine multiple services), caching, logging/monitoring, SSL termination. Benefits: client simplification, cross-cutting concerns centralized, backend service protection. Examples: Kong, AWS API Gateway, Apigee. Considerations: potential bottleneck, single point of failure (needs redundancy), added latency.

Question 20

What is event-driven architecture and what are its benefits?

Accepted Answer

Event-driven architecture uses events to trigger and communicate between decoupled services. Components: event producers, event channel (message broker), event consumers. Patterns: event notification (minimal data, consumer queries), event-carried state transfer (full data), event sourcing (events as source of truth). Benefits: loose coupling, scalability, real-time processing, audit trail. Challenges: eventual consistency, debugging complexity, event ordering. Technologies: Kafka, RabbitMQ, AWS EventBridge. Foundation for reactive systems.

Question 21

How does consistent hashing work and why is it used?

Accepted Answer

Consistent hashing distributes data across nodes such that adding/removing nodes minimally affects data distribution. Keys and nodes are mapped to a hash ring; keys are assigned to nearest node clockwise. Adding node: only keys between new node and predecessor remapped. Benefits: minimal redistribution (K/n keys move when node added, K=keys, n=nodes), scalability, load balancing. Virtual nodes: multiple positions per node for even distribution. Used in: distributed caches, databases (DynamoDB), CDNs. Essential for distributed systems.

Question 22

What is idempotency and how do you design idempotent APIs?

Accepted Answer

Idempotency means repeated identical requests have same effect as single request - safe to retry. Essential for: payment processing, order creation, any state-changing operation. Implementation: idempotency keys (client-generated unique ID), store processed keys with result, return cached result on duplicate. HTTP methods: GET, PUT, DELETE naturally idempotent; POST needs explicit handling. Storage: Redis with TTL, database table. Response: return same result for duplicate request. Critical for reliable distributed systems.

Question 23

How do you implement read/write splitting for database scaling?

Accepted Answer

Read/write splitting directs writes to primary and reads to replicas. Implementation: application-level routing, proxy (ProxySQL, PgBouncer), framework support. Considerations: replication lag (writes may not be immediately readable), session consistency (route user's reads to same replica or primary after write), query classification. Patterns: read-your-writes (read from primary after write), monotonic reads (sticky sessions). Benefits: scale reads independently, reduce primary load. Common for read-heavy workloads.

Question 24

How do you design a distributed logging system?

Accepted Answer

Distributed logging collects and centralizes logs from all services. Architecture: log agents (Fluentd, Filebeat) on each node, message queue for buffering (Kafka), log aggregator (Logstash, Fluentd), storage (Elasticsearch), visualization (Kibana, Grafana). Design considerations: structured logging (JSON), correlation IDs (trace across services), log levels, retention policies, sampling for high-volume. ELK Stack common. Cloud options: CloudWatch, Datadog. Essential for debugging distributed systems.

Question 25

How would you design a URL shortener like bit.ly?

Accepted Answer

Requirements: shorten URLs, redirect to original, analytics. Design: Generate short code (base62 encoding of incremented ID or hash), store mapping in database (short_code -> original_url), redirect service (lookup, 301/302 redirect). Scaling: read-heavy (cache mappings in Redis), database sharding by short_code, CDN for redirects. Short code generation: counter service, random with collision check. Features: custom aliases, expiration, click analytics. Handle: high read throughput, URL validation, abuse prevention.

Question 26

Compare Kafka and RabbitMQ. When would you use each?

Accepted Answer

Kafka: distributed log, high throughput (millions/sec), persistent storage, replay capability, consumer groups, ordered within partition. Best for: event streaming, log aggregation, real-time analytics. RabbitMQ: traditional message broker, flexible routing (exchanges, queues), message acknowledgment, lower latency for small messages. Best for: task queues, RPC, complex routing. Kafka: more operational complexity, larger infrastructure. RabbitMQ: easier setup, better for simpler use cases. Consider: throughput needs, data retention, exactly-once requirements.

Question 27

What are different API pagination strategies and their trade-offs?

Accepted Answer

Offset pagination: page/offset parameters - simple but slow for large offsets, inconsistent with data changes. Cursor-based: opaque cursor pointing to position - consistent, efficient, but no random access. Keyset: filter by last seen key - efficient, handles inserts well, requires sortable unique field. Time-based: filter by timestamp - good for chronological data. Response: include total_count, next_cursor, has_more. Choose based on: data characteristics, consistency needs, performance requirements. Cursor-based preferred for infinite scroll; offset for traditional pagination.

Question 28

How would you design a blob storage system like S3?

Accepted Answer

Requirements: store/retrieve large files, high durability, scalability. Architecture: metadata service (mapping object names to storage locations), storage nodes (actual data, chunked for large files), load balancer. Features: versioning, encryption, access control, lifecycle policies. Durability: replicate across nodes/regions, erasure coding (store data + parity, reconstruct from subset). Scaling: partition by hash of object name. CDN integration for frequent access. Consistency: eventual for better availability. Consider: multipart upload, range requests, storage tiers (hot/cold).

Question 29

How do you handle transactions across microservices?

Accepted Answer

Challenges: no single ACID database, network failures, partial failures. Patterns: Saga (sequence of local transactions with compensating actions - choreography or orchestration), Two-Phase Commit (2PC - coordinator ensures all-or-nothing, but blocking and slow), Event Sourcing + CQRS (events as truth, eventual consistency). Best practices: design for eventual consistency when possible, use idempotent operations, implement compensating transactions, track transaction state. Avoid distributed transactions when possible through service boundaries.

Question 30

How would you design a notification system for a mobile app?

Accepted Answer

Components: notification service (create/schedule), delivery workers (by channel), device token management, preference service. Channels: push (FCM/APNS), email, SMS, in-app. Design: event triggers notification, queue for processing, fan-out to channels, track delivery/opens. Challenges: high volume (batch, rate limit), device token management (refresh, multiple devices), preferences (user opt-outs), templating, scheduling. Scale: partition by user, horizontal workers, prioritize real-time vs batch. Analytics: delivery rates, open rates, engagement.

Question 31

How do you handle content updates with a CDN?

Accepted Answer

Strategies: cache versioning (include version/hash in URL - new URL bypasses cache), TTL-based (set appropriate expiration - balance freshness vs hit rate), purge/invalidation API (explicitly clear cache - propagation delay), stale-while-revalidate (serve stale while fetching fresh). Best practices: long TTL + versioned URLs for static assets, shorter TTL for dynamic content, use cache tags for grouped invalidation. Consider: invalidation cost, edge propagation time, origin shield (reduce origin load). Different strategies for different content types.

Question 32

What is leader election and how is it implemented in distributed systems?

Accepted Answer

Leader election selects one node as coordinator for a task, ensuring single active leader. Uses: distributed locks, primary database replica, job scheduling. Algorithms: Raft/Paxos (consensus-based), ZooKeeper (ephemeral nodes + watches), etcd (lease-based), Redis (SETNX with expiry). Requirements: detect leader failure, elect new leader quickly, prevent split-brain. Implementation: acquire lock/lease, maintain heartbeat, others monitor, re-election on failure. Fencing tokens prevent stale leaders from taking actions. Critical for consistency in distributed systems.

Question 33

Why is database connection pooling important and how do you configure it?

Accepted Answer

Connection pooling maintains reusable database connections, avoiding overhead of creating new connections per request. Benefits: reduced latency (no connection setup), resource efficiency (limit connections), connection management. Configuration: min/max pool size (based on load and database limits), connection timeout, idle timeout, validation query. Sizing: too small causes waiting; too large overloads database. Libraries: HikariCP (Java), pgBouncer (PostgreSQL). Monitor: pool exhaustion, wait times. Essential for high-throughput applications.

Question 34

How do you design a reliable webhook system?

Accepted Answer

Webhooks deliver events to external URLs. Design: event queue (decouple from main flow), delivery workers, retry logic (exponential backoff), delivery logging. Reliability: at-least-once delivery (retry failures), idempotency keys (receiver dedupes), signatures (HMAC for authenticity), timeouts (protect against slow receivers). Management: registration API, secret rotation, pause/resume, delivery history. Challenges: receiver errors, high volume, security. Features: filtering (subscribe to specific events), batching, test endpoints. SLAs: define max delivery time, retry policy.

Question 35

What is request coalescing and when should you use it?

Accepted Answer

Request coalescing combines multiple identical concurrent requests into one, sharing the result - prevents thundering herd on cache miss. Implementation: for same key/request, first caller proceeds, others wait and receive same result. Use cases: cache population (single DB query on miss), API calls, expensive computations. Libraries: SingleFlight (Go), singleflight pattern. Also: request collapsing (batch multiple different requests). Caution: error handling (propagate to all waiters), timeout handling, lock contention. Essential for high-concurrency cache systems.

Question 36

How would you design a distributed caching system like Memcached or Redis Cluster?

Accepted Answer

Architecture: client library for routing, cache nodes, cluster manager. Data distribution: consistent hashing with virtual nodes for balance. Replication: master-slave per partition for durability, async replication for speed. Cluster management: gossip protocol for membership, failure detection, automatic failover. Features: TTL expiration, eviction policies (LRU, LFU), memory management. Consistency: eventual for performance, read-your-writes via routing. Hot keys: local caching, key replication. Scaling: add nodes, automatic rebalancing. Monitor: hit rates, memory usage, latency percentiles.

Question 37

How would you design a web search engine like Google?

Accepted Answer

Components: Web Crawler (BFS crawl, politeness, URL frontier, deduplication), Indexer (inverted index - word to document IDs, index sharding by term or document), Query Processor (parse query, retrieve matching docs, rank), Ranking (PageRank for authority, TF-IDF for relevance, ML models). Scale: billions of documents, distributed crawling, index partitioning. Serving: query routing, result aggregation, caching popular queries. Freshness: continuous crawling, prioritize by importance. Features: autocomplete, spell correction, personalization. Challenges: spam detection, query understanding, real-time indexing.

Question 38

How would you design a real-time chat system like WhatsApp?

Accepted Answer

Architecture: connection servers (WebSocket), chat servers, message storage, push notification service. Real-time: WebSocket for online users, push notifications for offline. Message flow: client -> server -> recipient's connection server or push. Storage: message DB (by conversation, time-partitioned), media storage (S3), message queues. Features: delivery receipts (sent, delivered, read), group chat (fan-out), presence (online status). Scale: shard by user ID, connection affinity, message ID generation (Snowflake). E2E encryption: client-side encryption, key exchange. Offline: message queue per user.

Question 39

How would you design a distributed rate limiter?

Accepted Answer

Algorithms: Token Bucket (tokens replenish at rate, request consumes token), Leaky Bucket (fixed output rate, queue excess), Sliding Window (count requests in rolling window). Distributed: centralized counter (Redis INCR with TTL) or local with sync. Challenges: clock synchronization, race conditions (Lua scripts for atomicity), hot partitions. Architecture: embedded in service, API gateway, or standalone service. Features: per-user, per-API, burst handling, graduated limits. Response: 429 Too Many Requests, Retry-After header, X-RateLimit headers. Accuracy vs performance trade-off.

Question 40

How would you design a video streaming platform like YouTube?

Accepted Answer

Upload pipeline: client upload -> transcoding service (multiple resolutions, formats), thumbnail generation, metadata extraction, CDN distribution. Storage: blob storage for videos, CDN for delivery, database for metadata. Streaming: adaptive bitrate (HLS/DASH), CDN edge servers, origin shield. Features: recommendations (ML service), search (Elasticsearch), comments, likes, subscriptions. Scale: read-heavy (cache metadata), geographic distribution, hot video caching. Transcoding: distributed workers, priority queues. Analytics: view counts (eventually consistent counters), watch time. Monetization: ad insertion service.

Question 41

How would you implement a distributed locking mechanism?

Accepted Answer

Requirements: mutual exclusion, deadlock avoidance, fault tolerance. Redis approach: SET key value NX EX (set if not exists with expiry), owner token for safe release. Redlock (multi-node): acquire lock on majority of independent nodes. ZooKeeper: ephemeral sequential nodes, watch for predecessor deletion. Challenges: lock holder failure (TTL), clock drift (fencing tokens), split-brain. Best practices: always set timeout, use fencing tokens to detect stale locks, prefer local coordination when possible. Libraries: Redisson, Curator. Consider: is strong consistency needed, or eventual consistency acceptable?

Question 42

How would you design a social media news feed like Facebook?

Accepted Answer

Approaches: Pull (query friends' posts at read time - slow for many friends), Push/Fanout (write post to all followers' feeds at write time - fast reads, write amplification), Hybrid (push for normal users, pull for celebrities). Architecture: post service, feed service, graph service (followers), ranking service. Storage: posts DB, feed cache (sorted sets per user), activity log. Ranking: chronological + ML ranking (engagement prediction). Scale: denormalize, cache feeds, shard by user. Hot users: separate handling for high-follower accounts. Real-time: WebSocket for live updates.

Question 43

How would you design a search autocomplete/typeahead system?

Accepted Answer

Data structure: Trie for prefix matching, with frequency counts at nodes. Architecture: prefix trees in memory, sharded by prefix ranges, replication for availability. Query flow: client sends prefix, service returns top-k suggestions. Ranking: frequency, recency, personalization, trending. Updates: offline batch processing for global frequencies, real-time updates for user history. Optimization: precompute top-k at each node, limit tree depth. Scale: horizontal sharding, CDN edge caching for common prefixes. Features: spelling correction (edit distance), phrase suggestions. Response time critical (<100ms).

Question 44

How would you design a ride-sharing system like Uber?

Accepted Answer

Services: rider app, driver app, matching service, pricing service, map/routing, payment. Real-time location: drivers send location updates, geospatial indexing (S2, Geohash), efficient radius queries. Matching: find nearby available drivers, ETA calculation, optimal assignment (Hungarian algorithm or greedy). Dispatch: WebSocket/push to driver, timeout and retry. Surge pricing: supply/demand per geohash cell. Data: trip history (time-series), driver locations (in-memory with persistence). Scale: city-based sharding, location service handles millions of updates. Consistency: ride state machine, idempotent state transitions.

Question 45

How would you design a payment processing system?

Accepted Answer

Requirements: exactly-once processing, high availability, security, compliance. Flow: payment request -> validation -> authorization (payment gateway) -> capture -> settlement. Idempotency: unique transaction IDs, store processed transactions. Double-entry: every credit has debit, reconciliation. State machine: pending -> authorized -> captured -> settled. Failure handling: retry with backoff, compensation, manual review queue. Security: tokenization, PCI compliance, encryption. Features: refunds, disputes, multiple payment methods, fraud detection. Reconciliation: match with bank statements. Audit trail: immutable event log. Consistency critical.

Question 46

How would you design a distributed task scheduling system?

Accepted Answer

Requirements: schedule tasks for future, recurring tasks, exactly-once execution, high availability. Architecture: scheduler service (determine when to run), executor workers (run tasks), task queue (pending tasks), task store (persistent). Scheduling: check upcoming tasks, push to queue at execution time. At-least-once: queue with acknowledgment, retry on failure, idempotent tasks. Exactly-once: distributed lock per task instance. Recurring: cron parser, generate next run time. Scale: partition by task ID, multiple schedulers with leader election. Features: priorities, dependencies, rate limiting. Examples: Airflow, Celery Beat.

Question 47

How would you design a metrics collection and monitoring system?

Accepted Answer

Pipeline: agents collect (StatsD, Prometheus scrapers) -> message queue (Kafka) -> stream processing (aggregation) -> time-series DB (InfluxDB, TimescaleDB) -> visualization (Grafana). Metrics types: counters, gauges, histograms. Aggregation: pre-aggregate at collection, rollups for longer retention. Query: downsampling for large ranges, caching. Alerting: rules engine, notification routing (PagerDuty). Scale: shard by metric name or tag combination, compression (gorilla encoding). Retention: high-resolution recent, rollups for historical. Cardinality: limit unique tag combinations. Distributed tracing integration: correlate metrics with traces.

Question 48

How would you design a distributed file system like HDFS or GFS?

Accepted Answer

Architecture: master (namespace, metadata, chunk locations) and chunk servers (store data chunks). Files split into chunks (64-128MB), replicated across chunk servers (typically 3 replicas). Write: client gets chunk servers from master, writes to primary which replicates. Read: client gets chunk locations, reads from nearest replica. Master: single (with hot standby), handles metadata only, uses operation log + checkpoints. Chunk servers: report chunk lists to master, heartbeat for failure detection. Consistency: lease-based write ordering, atomic record append. Fault tolerance: re-replicate on failure detection. GC: lazy deletion, background cleanup.

Question 49

How would you design a real-time analytics dashboard system?

Accepted Answer

Architecture: event ingestion (Kafka) -> stream processing (Flink, Spark Streaming) -> serving layer (pre-aggregated results) -> dashboard (WebSocket updates). Processing: windowed aggregations (tumbling, sliding, session windows), exactly-once semantics. Serving: materialized views in fast store (Redis, Druid), denormalized for query patterns. Real-time updates: poll or WebSocket to frontend. Historical: Lambda architecture (batch + stream) or Kappa (stream only). Scale: partition by dimension, parallel processing. Challenges: late data (watermarks), high cardinality dimensions. Query: OLAP databases (ClickHouse, Druid) for ad-hoc analysis.

Question 50

How would you design a distributed key-value store like DynamoDB?

Accepted Answer

Architecture: consistent hashing for partitioning, virtual nodes for load balance. Replication: configurable N replicas per key, quorum reads/writes (W+R>N for consistency). Conflict resolution: vector clocks + application-specified resolution, or last-write-wins. Storage engine: LSM tree for write optimization (in-memory + SSTables). Operations: get, put, delete with conditional operations (compare-and-swap). Consistency: tunable (eventual to strong via quorum). Failure handling: hinted handoff, anti-entropy with Merkle trees. Scale: add nodes, automatic rebalancing. Features: secondary indexes (local or global), TTL, transactions (Dynamo-style). CAP: AP by default, CP possible with strict quorum.

System Design Interview Questions