System Design Interview Questions
Scalability, microservices, distributed systems, caching, and load balancing
1 What is the difference between horizontal and vertical scaling?
Easy
What is the difference between horizontal and vertical scaling?
Vertical scaling (scaling up) adds more resources (CPU, RAM) to existing servers - simpler but has hardware limits and single point of failure. Horizontal scaling (scaling out) adds more servers to distribute load - more complex but offers better fault tolerance, theoretically unlimited scale, and often more cost-effective. Most large-scale systems use horizontal scaling. Stateless applications are easier to scale horizontally; stateful ones require data partitioning or session management.
2 What is a load balancer and why is it needed?
Easy
What is a load balancer and why is it needed?
A load balancer distributes incoming network traffic across multiple servers to ensure no single server is overwhelmed. Benefits: improved availability (servers can fail), better performance (distribute load), scalability (add servers easily), health monitoring (route away from unhealthy servers). Types: hardware (F5), software (Nginx, HAProxy), cloud (AWS ALB/NLB). Algorithms: round-robin, least connections, IP hash. Essential component for any scalable web application.
3 What is caching and why is it important in system design?
Easy
What is caching and why is it important in system design?
Caching stores copies of frequently accessed data in faster storage (memory vs disk, closer to user). Benefits: reduced latency (faster responses), decreased database load, improved throughput, cost savings. Cache layers: browser cache, CDN, application cache (Redis, Memcached), database query cache. Key considerations: cache invalidation (TTL, events), cache-aside vs read-through, cold start, memory limits. Caching is fundamental for performance at scale.
4 When would you choose SQL vs NoSQL databases?
Easy
When would you choose SQL vs NoSQL databases?
SQL databases (PostgreSQL, MySQL) for: structured data with relationships, complex queries and joins, ACID transactions, data integrity critical. NoSQL databases (MongoDB, Cassandra, Redis) for: flexible/evolving schemas, high write throughput, horizontal scaling, specific data models (document, key-value, graph). Consider: consistency requirements (CAP theorem), query patterns, scaling needs, team expertise. Many systems use both (polyglot persistence) for different use cases.
5 What are the key principles of RESTful API design?
Easy
What are the key principles of RESTful API design?
REST principles: stateless (no server-side session), resource-based URLs (/users, /orders/{id}), HTTP methods for actions (GET read, POST create, PUT update, DELETE remove), standard status codes (200 OK, 201 Created, 404 Not Found), JSON for data exchange. Design practices: use nouns for resources, version APIs (/v1/), pagination for lists, consistent error format, HATEOAS for discoverability. RESTful APIs are the standard for web services.
Get IIT Jammu PG Certification
Master these concepts with 175+ hours of industry projects and hands-on training.
6 What are the benefits and challenges of microservices architecture?
Easy
What are the benefits and challenges of microservices architecture?
Benefits: independent deployment, technology flexibility per service, team autonomy, fault isolation, scalability per service. Challenges: distributed system complexity, network latency, data consistency across services, operational overhead (deployment, monitoring), service discovery, debugging distributed transactions. Microservices require: mature DevOps practices, observability stack, clear service boundaries. Start with monolith if domain unclear. Microservices solve organizational scaling as much as technical.
7 What is a message queue and when would you use one?
Easy
What is a message queue and when would you use one?
Message queues enable asynchronous communication between services by storing messages until consumers process them. Benefits: decoupling (producers/consumers independent), buffering (handle traffic spikes), reliability (messages persist until processed), scaling (multiple consumers). Use cases: order processing, sending emails/notifications, video processing, event streaming. Examples: RabbitMQ (traditional queuing), Kafka (event streaming), SQS (cloud). Essential for building resilient, scalable systems.
8 What is database replication and why is it used?
Easy
What is database replication and why is it used?
Replication copies data across multiple database servers. Benefits: high availability (failover if primary fails), read scaling (distribute reads across replicas), data locality (replicas in different regions). Types: synchronous (waits for replicas, stronger consistency) vs asynchronous (faster, eventual consistency). Master-slave (writes to master only) vs master-master (writes to any). Challenges: replication lag, conflict resolution (multi-master). Most production databases use replication for availability.
9 What does it mean to design a stateless service?
Easy
What does it mean to design a stateless service?
Stateless service doesn't store client session state between requests - each request contains all needed information. Benefits: easier horizontal scaling (any server can handle any request), simpler failover (no session to migrate), better caching. State storage: externalize to database or cache (Redis). Stateless examples: REST APIs, serverless functions. Stateful: WebSocket connections, shopping carts (externalize to Redis). Design for statelessness to maximize scalability and resilience.
10 How does a CDN improve application performance?
Easy
How does a CDN improve application performance?
CDN (Content Delivery Network) caches content at edge locations worldwide, serving users from nearby servers. Performance benefits: reduced latency (shorter physical distance), decreased origin load, improved availability (distributed, redundant). Content types: static files (images, CSS, JS), video streaming, API responses (edge caching). Providers: Cloudflare, Akamai, CloudFront. Additional features: DDoS protection, SSL termination, compression. Essential for global applications and media delivery.
11 What is a single point of failure (SPOF) and how do you eliminate it?
Easy
What is a single point of failure (SPOF) and how do you eliminate it?
SPOF is any component whose failure would bring down the entire system. Elimination strategies: redundancy (multiple instances), load balancing (distribute across instances), failover (automatic switch to backup), geographic distribution. Common SPOFs: single database server, single load balancer, single DNS provider, single region. Design for no single points: redundant everything critical, health checks, automatic failover. Cloud providers help with managed services. Accept some SPOF for non-critical components based on cost.
12 Why is API rate limiting important and how does it work?
Easy
Why is API rate limiting important and how does it work?
Rate limiting controls request frequency to protect resources and ensure fair usage. Purpose: prevent abuse/DDoS, protect backend services, ensure availability for all users, manage costs. Implementation: track requests per API key/IP, return 429 Too Many Requests when exceeded. Algorithms: token bucket, leaky bucket, fixed/sliding window. Tools: API gateways, Nginx, Redis counters. Provide: clear limits in documentation, rate limit headers (X-RateLimit-Remaining), exponential backoff guidance.
13 What is eventual consistency and when is it acceptable?
Easy
What is eventual consistency and when is it acceptable?
Eventual consistency means all replicas will eventually converge to the same value, but may temporarily show different values. Trade-off for availability and partition tolerance (CAP theorem). Acceptable when: stale reads are tolerable (social media feeds), high availability more important than immediate consistency, system can handle temporary inconsistency. Not acceptable: financial transactions, inventory (overselling), security-critical data. Design with conflict resolution (last-write-wins, merge). Common in distributed databases.
14 What are health checks and why are they important?
Easy
What are health checks and why are they important?
Health checks verify service/component status for load balancing and alerting. Types: liveness (is process running), readiness (can handle traffic), deep health (dependencies working). Implementation: HTTP endpoint (/health, /ready), return status code 200 or 503. Used by: load balancers (route traffic away from unhealthy), container orchestrators (restart unhealthy containers), monitoring (alert on failures). Design: fast execution, check critical dependencies, don't cache. Essential for self-healing systems.
15 What is service discovery and why is it needed in microservices?
Easy
What is service discovery and why is it needed in microservices?
Service discovery enables services to find each other's network locations without hardcoding. Necessary because: instances scale dynamically, IPs change, need load balancing across instances. Patterns: client-side (clients query registry and load balance - Eureka), server-side (load balancer queries registry - Kubernetes services). Registry: Consul, etcd, ZooKeeper, DNS-based. Services register on startup, deregister on shutdown, use heartbeats. Essential infrastructure for dynamic microservices environments.
3,000+ Engineers Placed at Top Companies
Join Bosch, Tata Motors, L&T, Mahindra and 500+ hiring partners.
16 What are different cache invalidation strategies?
Medium
What are different cache invalidation strategies?
Time-based: TTL (expiration after fixed time) - simple but stale data possible. Event-based: invalidate on data change - more complex but fresher data. Write-through: write to cache and database simultaneously. Write-behind: write to cache, async to database - faster but risk of data loss. Cache-aside: application manages cache (check cache, load from DB, update cache). Strategies: versioned keys (append version to key), tag-based invalidation, pub/sub for distributed invalidation. 'Cache invalidation is one of two hard problems in CS.'
17 What are database sharding strategies and their trade-offs?
Medium
What are database sharding strategies and their trade-offs?
Sharding horizontally partitions data across databases. Strategies: Range-based (by value ranges - easy but hotspots), Hash-based (consistent hash of key - even distribution, no range queries), Directory-based (lookup table - flexible but overhead), Geographic (by region - data locality). Trade-offs: cross-shard queries are expensive, joins limited, transactions complex, rebalancing difficult. Implementation: application-level routing, proxy layer (Vitess), or native support (CockroachDB). Consider: shard key selection critical, avoid hotspots.
18 What is the Circuit Breaker pattern and when should you use it?
Medium
What is the Circuit Breaker pattern and when should you use it?
Circuit Breaker prevents cascading failures by stopping calls to failing services. States: Closed (normal operation, tracking failures), Open (fails fast, doesn't call service), Half-Open (periodically tests if service recovered). Configuration: failure threshold, timeout, recovery time. Benefits: fail fast, prevent resource exhaustion, allow recovery time. Libraries: Hystrix, Resilience4j, Polly. Use for: external API calls, database connections, any unreliable dependency. Combine with: timeouts, retries with backoff, fallbacks.
19 What is an API Gateway and what problems does it solve?
Medium
What is an API Gateway and what problems does it solve?
API Gateway is the single entry point for all client requests to microservices. Functions: routing (to appropriate service), authentication/authorization (centralized), rate limiting, request/response transformation, aggregation (combine multiple services), caching, logging/monitoring, SSL termination. Benefits: client simplification, cross-cutting concerns centralized, backend service protection. Examples: Kong, AWS API Gateway, Apigee. Considerations: potential bottleneck, single point of failure (needs redundancy), added latency.
20 What is event-driven architecture and what are its benefits?
Medium
What is event-driven architecture and what are its benefits?
Event-driven architecture uses events to trigger and communicate between decoupled services. Components: event producers, event channel (message broker), event consumers. Patterns: event notification (minimal data, consumer queries), event-carried state transfer (full data), event sourcing (events as source of truth). Benefits: loose coupling, scalability, real-time processing, audit trail. Challenges: eventual consistency, debugging complexity, event ordering. Technologies: Kafka, RabbitMQ, AWS EventBridge. Foundation for reactive systems.
21 How does consistent hashing work and why is it used?
Medium
How does consistent hashing work and why is it used?
Consistent hashing distributes data across nodes such that adding/removing nodes minimally affects data distribution. Keys and nodes are mapped to a hash ring; keys are assigned to nearest node clockwise. Adding node: only keys between new node and predecessor remapped. Benefits: minimal redistribution (K/n keys move when node added, K=keys, n=nodes), scalability, load balancing. Virtual nodes: multiple positions per node for even distribution. Used in: distributed caches, databases (DynamoDB), CDNs. Essential for distributed systems.
22 What is idempotency and how do you design idempotent APIs?
Medium
What is idempotency and how do you design idempotent APIs?
Idempotency means repeated identical requests have same effect as single request - safe to retry. Essential for: payment processing, order creation, any state-changing operation. Implementation: idempotency keys (client-generated unique ID), store processed keys with result, return cached result on duplicate. HTTP methods: GET, PUT, DELETE naturally idempotent; POST needs explicit handling. Storage: Redis with TTL, database table. Response: return same result for duplicate request. Critical for reliable distributed systems.
23 How do you implement read/write splitting for database scaling?
Medium
How do you implement read/write splitting for database scaling?
Read/write splitting directs writes to primary and reads to replicas. Implementation: application-level routing, proxy (ProxySQL, PgBouncer), framework support. Considerations: replication lag (writes may not be immediately readable), session consistency (route user's reads to same replica or primary after write), query classification. Patterns: read-your-writes (read from primary after write), monotonic reads (sticky sessions). Benefits: scale reads independently, reduce primary load. Common for read-heavy workloads.
24 How do you design a distributed logging system?
Medium
How do you design a distributed logging system?
Distributed logging collects and centralizes logs from all services. Architecture: log agents (Fluentd, Filebeat) on each node, message queue for buffering (Kafka), log aggregator (Logstash, Fluentd), storage (Elasticsearch), visualization (Kibana, Grafana). Design considerations: structured logging (JSON), correlation IDs (trace across services), log levels, retention policies, sampling for high-volume. ELK Stack common. Cloud options: CloudWatch, Datadog. Essential for debugging distributed systems.
25 How would you design a URL shortener like bit.ly?
Medium
How would you design a URL shortener like bit.ly?
Requirements: shorten URLs, redirect to original, analytics. Design: Generate short code (base62 encoding of incremented ID or hash), store mapping in database (short_code -> original_url), redirect service (lookup, 301/302 redirect). Scaling: read-heavy (cache mappings in Redis), database sharding by short_code, CDN for redirects. Short code generation: counter service, random with collision check. Features: custom aliases, expiration, click analytics. Handle: high read throughput, URL validation, abuse prevention.
Harshal
Fiat Chrysler
Abhishek
TATA ELXSI
Srinithin
Xitadel
Ranjith
Core Automotive
Gaurav
Automotive Company
Bino
Design Firm
Aseem
EV Company
Puneet
Automotive Company
Vishal
EV Startup
More Success Stories
26 Compare Kafka and RabbitMQ. When would you use each?
Medium
Compare Kafka and RabbitMQ. When would you use each?
Kafka: distributed log, high throughput (millions/sec), persistent storage, replay capability, consumer groups, ordered within partition. Best for: event streaming, log aggregation, real-time analytics. RabbitMQ: traditional message broker, flexible routing (exchanges, queues), message acknowledgment, lower latency for small messages. Best for: task queues, RPC, complex routing. Kafka: more operational complexity, larger infrastructure. RabbitMQ: easier setup, better for simpler use cases. Consider: throughput needs, data retention, exactly-once requirements.
27 What are different API pagination strategies and their trade-offs?
Medium
What are different API pagination strategies and their trade-offs?
Offset pagination: page/offset parameters - simple but slow for large offsets, inconsistent with data changes. Cursor-based: opaque cursor pointing to position - consistent, efficient, but no random access. Keyset: filter by last seen key - efficient, handles inserts well, requires sortable unique field. Time-based: filter by timestamp - good for chronological data. Response: include total_count, next_cursor, has_more. Choose based on: data characteristics, consistency needs, performance requirements. Cursor-based preferred for infinite scroll; offset for traditional pagination.
28 How would you design a blob storage system like S3?
Medium
How would you design a blob storage system like S3?
Requirements: store/retrieve large files, high durability, scalability. Architecture: metadata service (mapping object names to storage locations), storage nodes (actual data, chunked for large files), load balancer. Features: versioning, encryption, access control, lifecycle policies. Durability: replicate across nodes/regions, erasure coding (store data + parity, reconstruct from subset). Scaling: partition by hash of object name. CDN integration for frequent access. Consistency: eventual for better availability. Consider: multipart upload, range requests, storage tiers (hot/cold).
29 How do you handle transactions across microservices?
Medium
How do you handle transactions across microservices?
Challenges: no single ACID database, network failures, partial failures. Patterns: Saga (sequence of local transactions with compensating actions - choreography or orchestration), Two-Phase Commit (2PC - coordinator ensures all-or-nothing, but blocking and slow), Event Sourcing + CQRS (events as truth, eventual consistency). Best practices: design for eventual consistency when possible, use idempotent operations, implement compensating transactions, track transaction state. Avoid distributed transactions when possible through service boundaries.
30 How would you design a notification system for a mobile app?
Medium
How would you design a notification system for a mobile app?
Components: notification service (create/schedule), delivery workers (by channel), device token management, preference service. Channels: push (FCM/APNS), email, SMS, in-app. Design: event triggers notification, queue for processing, fan-out to channels, track delivery/opens. Challenges: high volume (batch, rate limit), device token management (refresh, multiple devices), preferences (user opt-outs), templating, scheduling. Scale: partition by user, horizontal workers, prioritize real-time vs batch. Analytics: delivery rates, open rates, engagement.
31 How do you handle content updates with a CDN?
Medium
How do you handle content updates with a CDN?
Strategies: cache versioning (include version/hash in URL - new URL bypasses cache), TTL-based (set appropriate expiration - balance freshness vs hit rate), purge/invalidation API (explicitly clear cache - propagation delay), stale-while-revalidate (serve stale while fetching fresh). Best practices: long TTL + versioned URLs for static assets, shorter TTL for dynamic content, use cache tags for grouped invalidation. Consider: invalidation cost, edge propagation time, origin shield (reduce origin load). Different strategies for different content types.
32 What is leader election and how is it implemented in distributed systems?
Medium
What is leader election and how is it implemented in distributed systems?
Leader election selects one node as coordinator for a task, ensuring single active leader. Uses: distributed locks, primary database replica, job scheduling. Algorithms: Raft/Paxos (consensus-based), ZooKeeper (ephemeral nodes + watches), etcd (lease-based), Redis (SETNX with expiry). Requirements: detect leader failure, elect new leader quickly, prevent split-brain. Implementation: acquire lock/lease, maintain heartbeat, others monitor, re-election on failure. Fencing tokens prevent stale leaders from taking actions. Critical for consistency in distributed systems.
33 Why is database connection pooling important and how do you configure it?
Medium
Why is database connection pooling important and how do you configure it?
Connection pooling maintains reusable database connections, avoiding overhead of creating new connections per request. Benefits: reduced latency (no connection setup), resource efficiency (limit connections), connection management. Configuration: min/max pool size (based on load and database limits), connection timeout, idle timeout, validation query. Sizing: too small causes waiting; too large overloads database. Libraries: HikariCP (Java), pgBouncer (PostgreSQL). Monitor: pool exhaustion, wait times. Essential for high-throughput applications.
34 How do you design a reliable webhook system?
Medium
How do you design a reliable webhook system?
Webhooks deliver events to external URLs. Design: event queue (decouple from main flow), delivery workers, retry logic (exponential backoff), delivery logging. Reliability: at-least-once delivery (retry failures), idempotency keys (receiver dedupes), signatures (HMAC for authenticity), timeouts (protect against slow receivers). Management: registration API, secret rotation, pause/resume, delivery history. Challenges: receiver errors, high volume, security. Features: filtering (subscribe to specific events), batching, test endpoints. SLAs: define max delivery time, retry policy.
35 What is request coalescing and when should you use it?
Medium
What is request coalescing and when should you use it?
Request coalescing combines multiple identical concurrent requests into one, sharing the result - prevents thundering herd on cache miss. Implementation: for same key/request, first caller proceeds, others wait and receive same result. Use cases: cache population (single DB query on miss), API calls, expensive computations. Libraries: SingleFlight (Go), singleflight pattern. Also: request collapsing (batch multiple different requests). Caution: error handling (propagate to all waiters), timeout handling, lock contention. Essential for high-concurrency cache systems.
36 How would you design a distributed caching system like Memcached or Redis Cluster?
Hard
How would you design a distributed caching system like Memcached or Redis Cluster?
Architecture: client library for routing, cache nodes, cluster manager. Data distribution: consistent hashing with virtual nodes for balance. Replication: master-slave per partition for durability, async replication for speed. Cluster management: gossip protocol for membership, failure detection, automatic failover. Features: TTL expiration, eviction policies (LRU, LFU), memory management. Consistency: eventual for performance, read-your-writes via routing. Hot keys: local caching, key replication. Scaling: add nodes, automatic rebalancing. Monitor: hit rates, memory usage, latency percentiles.
37 How would you design a web search engine like Google?
Hard
How would you design a web search engine like Google?
Components: Web Crawler (BFS crawl, politeness, URL frontier, deduplication), Indexer (inverted index - word to document IDs, index sharding by term or document), Query Processor (parse query, retrieve matching docs, rank), Ranking (PageRank for authority, TF-IDF for relevance, ML models). Scale: billions of documents, distributed crawling, index partitioning. Serving: query routing, result aggregation, caching popular queries. Freshness: continuous crawling, prioritize by importance. Features: autocomplete, spell correction, personalization. Challenges: spam detection, query understanding, real-time indexing.
38 How would you design a real-time chat system like WhatsApp?
Hard
How would you design a real-time chat system like WhatsApp?
Architecture: connection servers (WebSocket), chat servers, message storage, push notification service. Real-time: WebSocket for online users, push notifications for offline. Message flow: client -> server -> recipient's connection server or push. Storage: message DB (by conversation, time-partitioned), media storage (S3), message queues. Features: delivery receipts (sent, delivered, read), group chat (fan-out), presence (online status). Scale: shard by user ID, connection affinity, message ID generation (Snowflake). E2E encryption: client-side encryption, key exchange. Offline: message queue per user.
39 How would you design a distributed rate limiter?
Hard
How would you design a distributed rate limiter?
Algorithms: Token Bucket (tokens replenish at rate, request consumes token), Leaky Bucket (fixed output rate, queue excess), Sliding Window (count requests in rolling window). Distributed: centralized counter (Redis INCR with TTL) or local with sync. Challenges: clock synchronization, race conditions (Lua scripts for atomicity), hot partitions. Architecture: embedded in service, API gateway, or standalone service. Features: per-user, per-API, burst handling, graduated limits. Response: 429 Too Many Requests, Retry-After header, X-RateLimit headers. Accuracy vs performance trade-off.
40 How would you design a video streaming platform like YouTube?
Hard
How would you design a video streaming platform like YouTube?
Upload pipeline: client upload -> transcoding service (multiple resolutions, formats), thumbnail generation, metadata extraction, CDN distribution. Storage: blob storage for videos, CDN for delivery, database for metadata. Streaming: adaptive bitrate (HLS/DASH), CDN edge servers, origin shield. Features: recommendations (ML service), search (Elasticsearch), comments, likes, subscriptions. Scale: read-heavy (cache metadata), geographic distribution, hot video caching. Transcoding: distributed workers, priority queues. Analytics: view counts (eventually consistent counters), watch time. Monetization: ad insertion service.
41 How would you implement a distributed locking mechanism?
Hard
How would you implement a distributed locking mechanism?
Requirements: mutual exclusion, deadlock avoidance, fault tolerance. Redis approach: SET key value NX EX (set if not exists with expiry), owner token for safe release. Redlock (multi-node): acquire lock on majority of independent nodes. ZooKeeper: ephemeral sequential nodes, watch for predecessor deletion. Challenges: lock holder failure (TTL), clock drift (fencing tokens), split-brain. Best practices: always set timeout, use fencing tokens to detect stale locks, prefer local coordination when possible. Libraries: Redisson, Curator. Consider: is strong consistency needed, or eventual consistency acceptable?
42 How would you design a social media news feed like Facebook?
Hard
How would you design a social media news feed like Facebook?
Approaches: Pull (query friends' posts at read time - slow for many friends), Push/Fanout (write post to all followers' feeds at write time - fast reads, write amplification), Hybrid (push for normal users, pull for celebrities). Architecture: post service, feed service, graph service (followers), ranking service. Storage: posts DB, feed cache (sorted sets per user), activity log. Ranking: chronological + ML ranking (engagement prediction). Scale: denormalize, cache feeds, shard by user. Hot users: separate handling for high-follower accounts. Real-time: WebSocket for live updates.
43 How would you design a search autocomplete/typeahead system?
Hard
How would you design a search autocomplete/typeahead system?
Data structure: Trie for prefix matching, with frequency counts at nodes. Architecture: prefix trees in memory, sharded by prefix ranges, replication for availability. Query flow: client sends prefix, service returns top-k suggestions. Ranking: frequency, recency, personalization, trending. Updates: offline batch processing for global frequencies, real-time updates for user history. Optimization: precompute top-k at each node, limit tree depth. Scale: horizontal sharding, CDN edge caching for common prefixes. Features: spelling correction (edit distance), phrase suggestions. Response time critical (<100ms).
44 How would you design a ride-sharing system like Uber?
Hard
How would you design a ride-sharing system like Uber?
Services: rider app, driver app, matching service, pricing service, map/routing, payment. Real-time location: drivers send location updates, geospatial indexing (S2, Geohash), efficient radius queries. Matching: find nearby available drivers, ETA calculation, optimal assignment (Hungarian algorithm or greedy). Dispatch: WebSocket/push to driver, timeout and retry. Surge pricing: supply/demand per geohash cell. Data: trip history (time-series), driver locations (in-memory with persistence). Scale: city-based sharding, location service handles millions of updates. Consistency: ride state machine, idempotent state transitions.
45 How would you design a payment processing system?
Hard
How would you design a payment processing system?
Requirements: exactly-once processing, high availability, security, compliance. Flow: payment request -> validation -> authorization (payment gateway) -> capture -> settlement. Idempotency: unique transaction IDs, store processed transactions. Double-entry: every credit has debit, reconciliation. State machine: pending -> authorized -> captured -> settled. Failure handling: retry with backoff, compensation, manual review queue. Security: tokenization, PCI compliance, encryption. Features: refunds, disputes, multiple payment methods, fraud detection. Reconciliation: match with bank statements. Audit trail: immutable event log. Consistency critical.
46 How would you design a distributed task scheduling system?
Hard
How would you design a distributed task scheduling system?
Requirements: schedule tasks for future, recurring tasks, exactly-once execution, high availability. Architecture: scheduler service (determine when to run), executor workers (run tasks), task queue (pending tasks), task store (persistent). Scheduling: check upcoming tasks, push to queue at execution time. At-least-once: queue with acknowledgment, retry on failure, idempotent tasks. Exactly-once: distributed lock per task instance. Recurring: cron parser, generate next run time. Scale: partition by task ID, multiple schedulers with leader election. Features: priorities, dependencies, rate limiting. Examples: Airflow, Celery Beat.
47 How would you design a metrics collection and monitoring system?
Hard
How would you design a metrics collection and monitoring system?
Pipeline: agents collect (StatsD, Prometheus scrapers) -> message queue (Kafka) -> stream processing (aggregation) -> time-series DB (InfluxDB, TimescaleDB) -> visualization (Grafana). Metrics types: counters, gauges, histograms. Aggregation: pre-aggregate at collection, rollups for longer retention. Query: downsampling for large ranges, caching. Alerting: rules engine, notification routing (PagerDuty). Scale: shard by metric name or tag combination, compression (gorilla encoding). Retention: high-resolution recent, rollups for historical. Cardinality: limit unique tag combinations. Distributed tracing integration: correlate metrics with traces.
48 How would you design a distributed file system like HDFS or GFS?
Hard
How would you design a distributed file system like HDFS or GFS?
Architecture: master (namespace, metadata, chunk locations) and chunk servers (store data chunks). Files split into chunks (64-128MB), replicated across chunk servers (typically 3 replicas). Write: client gets chunk servers from master, writes to primary which replicates. Read: client gets chunk locations, reads from nearest replica. Master: single (with hot standby), handles metadata only, uses operation log + checkpoints. Chunk servers: report chunk lists to master, heartbeat for failure detection. Consistency: lease-based write ordering, atomic record append. Fault tolerance: re-replicate on failure detection. GC: lazy deletion, background cleanup.
49 How would you design a real-time analytics dashboard system?
Hard
How would you design a real-time analytics dashboard system?
Architecture: event ingestion (Kafka) -> stream processing (Flink, Spark Streaming) -> serving layer (pre-aggregated results) -> dashboard (WebSocket updates). Processing: windowed aggregations (tumbling, sliding, session windows), exactly-once semantics. Serving: materialized views in fast store (Redis, Druid), denormalized for query patterns. Real-time updates: poll or WebSocket to frontend. Historical: Lambda architecture (batch + stream) or Kappa (stream only). Scale: partition by dimension, parallel processing. Challenges: late data (watermarks), high cardinality dimensions. Query: OLAP databases (ClickHouse, Druid) for ad-hoc analysis.
50 How would you design a distributed key-value store like DynamoDB?
Hard
How would you design a distributed key-value store like DynamoDB?
Architecture: consistent hashing for partitioning, virtual nodes for load balance. Replication: configurable N replicas per key, quorum reads/writes (W+R>N for consistency). Conflict resolution: vector clocks + application-specified resolution, or last-write-wins. Storage engine: LSM tree for write optimization (in-memory + SSTables). Operations: get, put, delete with conditional operations (compare-and-swap). Consistency: tunable (eventual to strong via quorum). Failure handling: hinted handoff, anti-entropy with Merkle trees. Scale: add nodes, automatic rebalancing. Features: secondary indexes (local or global), TTL, transactions (Dynamo-style). CAP: AP by default, CP possible with strict quorum.