System Design Interview Questions - Computer Science | Skill-Lync Resources

Only 42 Seats Left!

System Design Interview Questions

Scalability, microservices, distributed systems, caching, and load balancing

50 Questions
15 Easy
20 Medium
15 Hard
Scalability Concepts Caching Strategies Load Balancing Database Design Microservices Message Queues API Design Distributed Systems
1

What is the difference between horizontal and vertical scaling?

Easy

Vertical scaling (scaling up) adds more resources (CPU, RAM) to existing servers - simpler but has hardware limits and single point of failure. Horizontal scaling (scaling out) adds more servers to distribute load - more complex but offers better fault tolerance, theoretically unlimited scale, and often more cost-effective. Most large-scale systems use horizontal scaling. Stateless applications are easier to scale horizontally; stateful ones require data partitioning or session management.

Subtopic: Scalability Concepts
Relevant for: Software EngineerSystems ArchitectDevOps Engineer
View full answer
2

What is a load balancer and why is it needed?

Easy

A load balancer distributes incoming network traffic across multiple servers to ensure no single server is overwhelmed. Benefits: improved availability (servers can fail), better performance (distribute load), scalability (add servers easily), health monitoring (route away from unhealthy servers). Types: hardware (F5), software (Nginx, HAProxy), cloud (AWS ALB/NLB). Algorithms: round-robin, least connections, IP hash. Essential component for any scalable web application.

Subtopic: Load Balancing
Relevant for: Software EngineerDevOps EngineerSystems Engineer
View full answer
3

What is caching and why is it important in system design?

Easy

Caching stores copies of frequently accessed data in faster storage (memory vs disk, closer to user). Benefits: reduced latency (faster responses), decreased database load, improved throughput, cost savings. Cache layers: browser cache, CDN, application cache (Redis, Memcached), database query cache. Key considerations: cache invalidation (TTL, events), cache-aside vs read-through, cold start, memory limits. Caching is fundamental for performance at scale.

Subtopic: Caching Strategies
Relevant for: Software EngineerBackend DeveloperSystems Architect
View full answer
4

When would you choose SQL vs NoSQL databases?

Easy

SQL databases (PostgreSQL, MySQL) for: structured data with relationships, complex queries and joins, ACID transactions, data integrity critical. NoSQL databases (MongoDB, Cassandra, Redis) for: flexible/evolving schemas, high write throughput, horizontal scaling, specific data models (document, key-value, graph). Consider: consistency requirements (CAP theorem), query patterns, scaling needs, team expertise. Many systems use both (polyglot persistence) for different use cases.

Subtopic: Database Design
Relevant for: Software EngineerBackend DeveloperData Engineer
View full answer
5

What are the key principles of RESTful API design?

Easy

REST principles: stateless (no server-side session), resource-based URLs (/users, /orders/{id}), HTTP methods for actions (GET read, POST create, PUT update, DELETE remove), standard status codes (200 OK, 201 Created, 404 Not Found), JSON for data exchange. Design practices: use nouns for resources, version APIs (/v1/), pagination for lists, consistent error format, HATEOAS for discoverability. RESTful APIs are the standard for web services.

Subtopic: API Design
Relevant for: Software EngineerBackend DeveloperAPI Developer
View full answer
Get IIT Jammu PG Certification
IIT Certified

Get IIT Jammu PG Certification

Master these concepts with 175+ hours of industry projects and hands-on training.

6

What are the benefits and challenges of microservices architecture?

Easy

Benefits: independent deployment, technology flexibility per service, team autonomy, fault isolation, scalability per service. Challenges: distributed system complexity, network latency, data consistency across services, operational overhead (deployment, monitoring), service discovery, debugging distributed transactions. Microservices require: mature DevOps practices, observability stack, clear service boundaries. Start with monolith if domain unclear. Microservices solve organizational scaling as much as technical.

Subtopic: Microservices
Relevant for: Software EngineerSoftware ArchitectBackend Developer
View full answer
7

What is a message queue and when would you use one?

Easy

Message queues enable asynchronous communication between services by storing messages until consumers process them. Benefits: decoupling (producers/consumers independent), buffering (handle traffic spikes), reliability (messages persist until processed), scaling (multiple consumers). Use cases: order processing, sending emails/notifications, video processing, event streaming. Examples: RabbitMQ (traditional queuing), Kafka (event streaming), SQS (cloud). Essential for building resilient, scalable systems.

Subtopic: Message Queues
Relevant for: Software EngineerBackend DeveloperSystems Architect
View full answer
8

What is database replication and why is it used?

Easy

Replication copies data across multiple database servers. Benefits: high availability (failover if primary fails), read scaling (distribute reads across replicas), data locality (replicas in different regions). Types: synchronous (waits for replicas, stronger consistency) vs asynchronous (faster, eventual consistency). Master-slave (writes to master only) vs master-master (writes to any). Challenges: replication lag, conflict resolution (multi-master). Most production databases use replication for availability.

Subtopic: Database Design
Relevant for: Backend DeveloperDatabase AdministratorDevOps Engineer
View full answer
9

What does it mean to design a stateless service?

Easy

Stateless service doesn't store client session state between requests - each request contains all needed information. Benefits: easier horizontal scaling (any server can handle any request), simpler failover (no session to migrate), better caching. State storage: externalize to database or cache (Redis). Stateless examples: REST APIs, serverless functions. Stateful: WebSocket connections, shopping carts (externalize to Redis). Design for statelessness to maximize scalability and resilience.

Subtopic: Scalability Concepts
Relevant for: Software EngineerBackend DeveloperSystems Architect
View full answer
10

How does a CDN improve application performance?

Easy

CDN (Content Delivery Network) caches content at edge locations worldwide, serving users from nearby servers. Performance benefits: reduced latency (shorter physical distance), decreased origin load, improved availability (distributed, redundant). Content types: static files (images, CSS, JS), video streaming, API responses (edge caching). Providers: Cloudflare, Akamai, CloudFront. Additional features: DDoS protection, SSL termination, compression. Essential for global applications and media delivery.

Subtopic: Caching Strategies
Relevant for: Software EngineerDevOps EngineerFrontend Developer
View full answer
11

What is a single point of failure (SPOF) and how do you eliminate it?

Easy

SPOF is any component whose failure would bring down the entire system. Elimination strategies: redundancy (multiple instances), load balancing (distribute across instances), failover (automatic switch to backup), geographic distribution. Common SPOFs: single database server, single load balancer, single DNS provider, single region. Design for no single points: redundant everything critical, health checks, automatic failover. Cloud providers help with managed services. Accept some SPOF for non-critical components based on cost.

Subtopic: Scalability Concepts
Relevant for: Software EngineerSystems ArchitectDevOps Engineer
View full answer
12

Why is API rate limiting important and how does it work?

Easy

Rate limiting controls request frequency to protect resources and ensure fair usage. Purpose: prevent abuse/DDoS, protect backend services, ensure availability for all users, manage costs. Implementation: track requests per API key/IP, return 429 Too Many Requests when exceeded. Algorithms: token bucket, leaky bucket, fixed/sliding window. Tools: API gateways, Nginx, Redis counters. Provide: clear limits in documentation, rate limit headers (X-RateLimit-Remaining), exponential backoff guidance.

Subtopic: API Design
Relevant for: Backend DeveloperAPI DeveloperSystems Engineer
View full answer
13

What is eventual consistency and when is it acceptable?

Easy

Eventual consistency means all replicas will eventually converge to the same value, but may temporarily show different values. Trade-off for availability and partition tolerance (CAP theorem). Acceptable when: stale reads are tolerable (social media feeds), high availability more important than immediate consistency, system can handle temporary inconsistency. Not acceptable: financial transactions, inventory (overselling), security-critical data. Design with conflict resolution (last-write-wins, merge). Common in distributed databases.

Subtopic: Distributed Systems
Relevant for: Software EngineerBackend DeveloperSystems Architect
View full answer
14

What are health checks and why are they important?

Easy

Health checks verify service/component status for load balancing and alerting. Types: liveness (is process running), readiness (can handle traffic), deep health (dependencies working). Implementation: HTTP endpoint (/health, /ready), return status code 200 or 503. Used by: load balancers (route traffic away from unhealthy), container orchestrators (restart unhealthy containers), monitoring (alert on failures). Design: fast execution, check critical dependencies, don't cache. Essential for self-healing systems.

Subtopic: Scalability Concepts
Relevant for: Software EngineerDevOps EngineerSRE
View full answer
15

What is service discovery and why is it needed in microservices?

Easy

Service discovery enables services to find each other's network locations without hardcoding. Necessary because: instances scale dynamically, IPs change, need load balancing across instances. Patterns: client-side (clients query registry and load balance - Eureka), server-side (load balancer queries registry - Kubernetes services). Registry: Consul, etcd, ZooKeeper, DNS-based. Services register on startup, deregister on shutdown, use heartbeats. Essential infrastructure for dynamic microservices environments.

Subtopic: Microservices
Relevant for: Software EngineerBackend DeveloperDevOps Engineer
View full answer
3,000+ Engineers Placed at Top Companies
Placements

3,000+ Engineers Placed at Top Companies

Join Bosch, Tata Motors, L&T, Mahindra and 500+ hiring partners.

16

What are different cache invalidation strategies?

Medium

Time-based: TTL (expiration after fixed time) - simple but stale data possible. Event-based: invalidate on data change - more complex but fresher data. Write-through: write to cache and database simultaneously. Write-behind: write to cache, async to database - faster but risk of data loss. Cache-aside: application manages cache (check cache, load from DB, update cache). Strategies: versioned keys (append version to key), tag-based invalidation, pub/sub for distributed invalidation. 'Cache invalidation is one of two hard problems in CS.'

Subtopic: Caching Strategies
Relevant for: Software EngineerBackend DeveloperSystems Architect
View full answer
17

What are database sharding strategies and their trade-offs?

Medium

Sharding horizontally partitions data across databases. Strategies: Range-based (by value ranges - easy but hotspots), Hash-based (consistent hash of key - even distribution, no range queries), Directory-based (lookup table - flexible but overhead), Geographic (by region - data locality). Trade-offs: cross-shard queries are expensive, joins limited, transactions complex, rebalancing difficult. Implementation: application-level routing, proxy layer (Vitess), or native support (CockroachDB). Consider: shard key selection critical, avoid hotspots.

Subtopic: Database Design
Relevant for: Software EngineerDatabase ArchitectBackend Developer
View full answer
18

What is the Circuit Breaker pattern and when should you use it?

Medium

Circuit Breaker prevents cascading failures by stopping calls to failing services. States: Closed (normal operation, tracking failures), Open (fails fast, doesn't call service), Half-Open (periodically tests if service recovered). Configuration: failure threshold, timeout, recovery time. Benefits: fail fast, prevent resource exhaustion, allow recovery time. Libraries: Hystrix, Resilience4j, Polly. Use for: external API calls, database connections, any unreliable dependency. Combine with: timeouts, retries with backoff, fallbacks.

Subtopic: Microservices
Relevant for: Software EngineerBackend DeveloperSystems Architect
View full answer
19

What is an API Gateway and what problems does it solve?

Medium

API Gateway is the single entry point for all client requests to microservices. Functions: routing (to appropriate service), authentication/authorization (centralized), rate limiting, request/response transformation, aggregation (combine multiple services), caching, logging/monitoring, SSL termination. Benefits: client simplification, cross-cutting concerns centralized, backend service protection. Examples: Kong, AWS API Gateway, Apigee. Considerations: potential bottleneck, single point of failure (needs redundancy), added latency.

Subtopic: Microservices
Relevant for: Software EngineerBackend DeveloperSystems Architect
View full answer
20

What is event-driven architecture and what are its benefits?

Medium

Event-driven architecture uses events to trigger and communicate between decoupled services. Components: event producers, event channel (message broker), event consumers. Patterns: event notification (minimal data, consumer queries), event-carried state transfer (full data), event sourcing (events as source of truth). Benefits: loose coupling, scalability, real-time processing, audit trail. Challenges: eventual consistency, debugging complexity, event ordering. Technologies: Kafka, RabbitMQ, AWS EventBridge. Foundation for reactive systems.

Subtopic: Message Queues
Relevant for: Software EngineerSystems ArchitectBackend Developer
View full answer
21

How does consistent hashing work and why is it used?

Medium

Consistent hashing distributes data across nodes such that adding/removing nodes minimally affects data distribution. Keys and nodes are mapped to a hash ring; keys are assigned to nearest node clockwise. Adding node: only keys between new node and predecessor remapped. Benefits: minimal redistribution (K/n keys move when node added, K=keys, n=nodes), scalability, load balancing. Virtual nodes: multiple positions per node for even distribution. Used in: distributed caches, databases (DynamoDB), CDNs. Essential for distributed systems.

Subtopic: Distributed Systems
Relevant for: Software EngineerSystems ArchitectBackend Developer
View full answer
22

What is idempotency and how do you design idempotent APIs?

Medium

Idempotency means repeated identical requests have same effect as single request - safe to retry. Essential for: payment processing, order creation, any state-changing operation. Implementation: idempotency keys (client-generated unique ID), store processed keys with result, return cached result on duplicate. HTTP methods: GET, PUT, DELETE naturally idempotent; POST needs explicit handling. Storage: Redis with TTL, database table. Response: return same result for duplicate request. Critical for reliable distributed systems.

Subtopic: API Design
Relevant for: Backend DeveloperAPI DeveloperSystems Architect
View full answer
23

How do you implement read/write splitting for database scaling?

Medium

Read/write splitting directs writes to primary and reads to replicas. Implementation: application-level routing, proxy (ProxySQL, PgBouncer), framework support. Considerations: replication lag (writes may not be immediately readable), session consistency (route user's reads to same replica or primary after write), query classification. Patterns: read-your-writes (read from primary after write), monotonic reads (sticky sessions). Benefits: scale reads independently, reduce primary load. Common for read-heavy workloads.

Subtopic: Database Design
Relevant for: Backend DeveloperDatabase AdministratorSystems Architect
View full answer
24

How do you design a distributed logging system?

Medium

Distributed logging collects and centralizes logs from all services. Architecture: log agents (Fluentd, Filebeat) on each node, message queue for buffering (Kafka), log aggregator (Logstash, Fluentd), storage (Elasticsearch), visualization (Kibana, Grafana). Design considerations: structured logging (JSON), correlation IDs (trace across services), log levels, retention policies, sampling for high-volume. ELK Stack common. Cloud options: CloudWatch, Datadog. Essential for debugging distributed systems.

Subtopic: Microservices
Relevant for: DevOps EngineerSRESoftware Engineer
View full answer
25

How would you design a URL shortener like bit.ly?

Medium

Requirements: shorten URLs, redirect to original, analytics. Design: Generate short code (base62 encoding of incremented ID or hash), store mapping in database (short_code -> original_url), redirect service (lookup, 301/302 redirect). Scaling: read-heavy (cache mappings in Redis), database sharding by short_code, CDN for redirects. Short code generation: counter service, random with collision check. Features: custom aliases, expiration, click analytics. Handle: high read throughput, URL validation, abuse prevention.

Subtopic: Scalability Concepts
Relevant for: Software EngineerBackend DeveloperSystems Architect
View full answer
🎯 3,000+ Engineers Placed
Sponsored
Harshal Sukenkar

Harshal

Fiat Chrysler

Abhishek

Abhishek

TATA ELXSI

Srinithin

Srinithin

Xitadel

Ranjith

Ranjith

Core Automotive

Gaurav Jadhav

Gaurav

Automotive Company

Bino K Biju

Bino

Design Firm

Aseem Shrivastava

Aseem

EV Company

Puneet

Puneet

Automotive Company

Vishal Kumar

Vishal

EV Startup

26

Compare Kafka and RabbitMQ. When would you use each?

Medium

Kafka: distributed log, high throughput (millions/sec), persistent storage, replay capability, consumer groups, ordered within partition. Best for: event streaming, log aggregation, real-time analytics. RabbitMQ: traditional message broker, flexible routing (exchanges, queues), message acknowledgment, lower latency for small messages. Best for: task queues, RPC, complex routing. Kafka: more operational complexity, larger infrastructure. RabbitMQ: easier setup, better for simpler use cases. Consider: throughput needs, data retention, exactly-once requirements.

Subtopic: Message Queues
Relevant for: Software EngineerBackend DeveloperData Engineer
View full answer
27

What are different API pagination strategies and their trade-offs?

Medium

Offset pagination: page/offset parameters - simple but slow for large offsets, inconsistent with data changes. Cursor-based: opaque cursor pointing to position - consistent, efficient, but no random access. Keyset: filter by last seen key - efficient, handles inserts well, requires sortable unique field. Time-based: filter by timestamp - good for chronological data. Response: include total_count, next_cursor, has_more. Choose based on: data characteristics, consistency needs, performance requirements. Cursor-based preferred for infinite scroll; offset for traditional pagination.

Subtopic: API Design
Relevant for: Backend DeveloperAPI DeveloperSoftware Engineer
View full answer
28

How would you design a blob storage system like S3?

Medium

Requirements: store/retrieve large files, high durability, scalability. Architecture: metadata service (mapping object names to storage locations), storage nodes (actual data, chunked for large files), load balancer. Features: versioning, encryption, access control, lifecycle policies. Durability: replicate across nodes/regions, erasure coding (store data + parity, reconstruct from subset). Scaling: partition by hash of object name. CDN integration for frequent access. Consistency: eventual for better availability. Consider: multipart upload, range requests, storage tiers (hot/cold).

Subtopic: Distributed Systems
Relevant for: Software EngineerStorage EngineerSystems Architect
View full answer
29

How do you handle transactions across microservices?

Medium

Challenges: no single ACID database, network failures, partial failures. Patterns: Saga (sequence of local transactions with compensating actions - choreography or orchestration), Two-Phase Commit (2PC - coordinator ensures all-or-nothing, but blocking and slow), Event Sourcing + CQRS (events as truth, eventual consistency). Best practices: design for eventual consistency when possible, use idempotent operations, implement compensating transactions, track transaction state. Avoid distributed transactions when possible through service boundaries.

Subtopic: Microservices
Relevant for: Software EngineerSystems ArchitectBackend Developer
View full answer
30

How would you design a notification system for a mobile app?

Medium

Components: notification service (create/schedule), delivery workers (by channel), device token management, preference service. Channels: push (FCM/APNS), email, SMS, in-app. Design: event triggers notification, queue for processing, fan-out to channels, track delivery/opens. Challenges: high volume (batch, rate limit), device token management (refresh, multiple devices), preferences (user opt-outs), templating, scheduling. Scale: partition by user, horizontal workers, prioritize real-time vs batch. Analytics: delivery rates, open rates, engagement.

Subtopic: Message Queues
Relevant for: Software EngineerBackend DeveloperMobile Developer
View full answer
31

How do you handle content updates with a CDN?

Medium

Strategies: cache versioning (include version/hash in URL - new URL bypasses cache), TTL-based (set appropriate expiration - balance freshness vs hit rate), purge/invalidation API (explicitly clear cache - propagation delay), stale-while-revalidate (serve stale while fetching fresh). Best practices: long TTL + versioned URLs for static assets, shorter TTL for dynamic content, use cache tags for grouped invalidation. Consider: invalidation cost, edge propagation time, origin shield (reduce origin load). Different strategies for different content types.

Subtopic: Caching Strategies
Relevant for: Software EngineerDevOps EngineerFrontend Developer
View full answer
32

What is leader election and how is it implemented in distributed systems?

Medium

Leader election selects one node as coordinator for a task, ensuring single active leader. Uses: distributed locks, primary database replica, job scheduling. Algorithms: Raft/Paxos (consensus-based), ZooKeeper (ephemeral nodes + watches), etcd (lease-based), Redis (SETNX with expiry). Requirements: detect leader failure, elect new leader quickly, prevent split-brain. Implementation: acquire lock/lease, maintain heartbeat, others monitor, re-election on failure. Fencing tokens prevent stale leaders from taking actions. Critical for consistency in distributed systems.

Subtopic: Distributed Systems
Relevant for: Software EngineerSystems ArchitectDistributed Systems Engineer
View full answer
33

Why is database connection pooling important and how do you configure it?

Medium

Connection pooling maintains reusable database connections, avoiding overhead of creating new connections per request. Benefits: reduced latency (no connection setup), resource efficiency (limit connections), connection management. Configuration: min/max pool size (based on load and database limits), connection timeout, idle timeout, validation query. Sizing: too small causes waiting; too large overloads database. Libraries: HikariCP (Java), pgBouncer (PostgreSQL). Monitor: pool exhaustion, wait times. Essential for high-throughput applications.

Subtopic: Database Design
Relevant for: Backend DeveloperDatabase AdministratorDevOps Engineer
View full answer
34

How do you design a reliable webhook system?

Medium

Webhooks deliver events to external URLs. Design: event queue (decouple from main flow), delivery workers, retry logic (exponential backoff), delivery logging. Reliability: at-least-once delivery (retry failures), idempotency keys (receiver dedupes), signatures (HMAC for authenticity), timeouts (protect against slow receivers). Management: registration API, secret rotation, pause/resume, delivery history. Challenges: receiver errors, high volume, security. Features: filtering (subscribe to specific events), batching, test endpoints. SLAs: define max delivery time, retry policy.

Subtopic: API Design
Relevant for: Backend DeveloperAPI DeveloperSystems Engineer
View full answer
35

What is request coalescing and when should you use it?

Medium

Request coalescing combines multiple identical concurrent requests into one, sharing the result - prevents thundering herd on cache miss. Implementation: for same key/request, first caller proceeds, others wait and receive same result. Use cases: cache population (single DB query on miss), API calls, expensive computations. Libraries: SingleFlight (Go), singleflight pattern. Also: request collapsing (batch multiple different requests). Caution: error handling (propagate to all waiters), timeout handling, lock contention. Essential for high-concurrency cache systems.

Subtopic: Caching Strategies
Relevant for: Software EngineerBackend DeveloperPerformance Engineer
View full answer
36

How would you design a distributed caching system like Memcached or Redis Cluster?

Hard

Architecture: client library for routing, cache nodes, cluster manager. Data distribution: consistent hashing with virtual nodes for balance. Replication: master-slave per partition for durability, async replication for speed. Cluster management: gossip protocol for membership, failure detection, automatic failover. Features: TTL expiration, eviction policies (LRU, LFU), memory management. Consistency: eventual for performance, read-your-writes via routing. Hot keys: local caching, key replication. Scaling: add nodes, automatic rebalancing. Monitor: hit rates, memory usage, latency percentiles.

Subtopic: Distributed Systems
Relevant for: Senior Software EngineerSystems ArchitectInfrastructure Engineer
View full answer
37

How would you design a web search engine like Google?

Hard

Components: Web Crawler (BFS crawl, politeness, URL frontier, deduplication), Indexer (inverted index - word to document IDs, index sharding by term or document), Query Processor (parse query, retrieve matching docs, rank), Ranking (PageRank for authority, TF-IDF for relevance, ML models). Scale: billions of documents, distributed crawling, index partitioning. Serving: query routing, result aggregation, caching popular queries. Freshness: continuous crawling, prioritize by importance. Features: autocomplete, spell correction, personalization. Challenges: spam detection, query understanding, real-time indexing.

Subtopic: Scalability Concepts
Relevant for: Senior Software EngineerSystems ArchitectSearch Engineer
View full answer
38

How would you design a real-time chat system like WhatsApp?

Hard

Architecture: connection servers (WebSocket), chat servers, message storage, push notification service. Real-time: WebSocket for online users, push notifications for offline. Message flow: client -> server -> recipient's connection server or push. Storage: message DB (by conversation, time-partitioned), media storage (S3), message queues. Features: delivery receipts (sent, delivered, read), group chat (fan-out), presence (online status). Scale: shard by user ID, connection affinity, message ID generation (Snowflake). E2E encryption: client-side encryption, key exchange. Offline: message queue per user.

Subtopic: Scalability Concepts
Relevant for: Senior Software EngineerSystems ArchitectBackend Developer
View full answer
39

How would you design a distributed rate limiter?

Hard

Algorithms: Token Bucket (tokens replenish at rate, request consumes token), Leaky Bucket (fixed output rate, queue excess), Sliding Window (count requests in rolling window). Distributed: centralized counter (Redis INCR with TTL) or local with sync. Challenges: clock synchronization, race conditions (Lua scripts for atomicity), hot partitions. Architecture: embedded in service, API gateway, or standalone service. Features: per-user, per-API, burst handling, graduated limits. Response: 429 Too Many Requests, Retry-After header, X-RateLimit headers. Accuracy vs performance trade-off.

Subtopic: API Design
Relevant for: Senior Software EngineerBackend DeveloperSystems Architect
View full answer
40

How would you design a video streaming platform like YouTube?

Hard

Upload pipeline: client upload -> transcoding service (multiple resolutions, formats), thumbnail generation, metadata extraction, CDN distribution. Storage: blob storage for videos, CDN for delivery, database for metadata. Streaming: adaptive bitrate (HLS/DASH), CDN edge servers, origin shield. Features: recommendations (ML service), search (Elasticsearch), comments, likes, subscriptions. Scale: read-heavy (cache metadata), geographic distribution, hot video caching. Transcoding: distributed workers, priority queues. Analytics: view counts (eventually consistent counters), watch time. Monetization: ad insertion service.

Subtopic: Scalability Concepts
Relevant for: Senior Software EngineerSystems ArchitectStreaming Engineer
View full answer
41

How would you implement a distributed locking mechanism?

Hard

Requirements: mutual exclusion, deadlock avoidance, fault tolerance. Redis approach: SET key value NX EX (set if not exists with expiry), owner token for safe release. Redlock (multi-node): acquire lock on majority of independent nodes. ZooKeeper: ephemeral sequential nodes, watch for predecessor deletion. Challenges: lock holder failure (TTL), clock drift (fencing tokens), split-brain. Best practices: always set timeout, use fencing tokens to detect stale locks, prefer local coordination when possible. Libraries: Redisson, Curator. Consider: is strong consistency needed, or eventual consistency acceptable?

Subtopic: Distributed Systems
Relevant for: Senior Software EngineerDistributed Systems EngineerSystems Architect
View full answer
42

How would you design a social media news feed like Facebook?

Hard

Approaches: Pull (query friends' posts at read time - slow for many friends), Push/Fanout (write post to all followers' feeds at write time - fast reads, write amplification), Hybrid (push for normal users, pull for celebrities). Architecture: post service, feed service, graph service (followers), ranking service. Storage: posts DB, feed cache (sorted sets per user), activity log. Ranking: chronological + ML ranking (engagement prediction). Scale: denormalize, cache feeds, shard by user. Hot users: separate handling for high-follower accounts. Real-time: WebSocket for live updates.

Subtopic: Scalability Concepts
Relevant for: Senior Software EngineerSystems ArchitectBackend Developer
View full answer
43

How would you design a search autocomplete/typeahead system?

Hard

Data structure: Trie for prefix matching, with frequency counts at nodes. Architecture: prefix trees in memory, sharded by prefix ranges, replication for availability. Query flow: client sends prefix, service returns top-k suggestions. Ranking: frequency, recency, personalization, trending. Updates: offline batch processing for global frequencies, real-time updates for user history. Optimization: precompute top-k at each node, limit tree depth. Scale: horizontal sharding, CDN edge caching for common prefixes. Features: spelling correction (edit distance), phrase suggestions. Response time critical (<100ms).

Subtopic: Scalability Concepts
Relevant for: Senior Software EngineerSearch EngineerBackend Developer
View full answer
44

How would you design a ride-sharing system like Uber?

Hard

Services: rider app, driver app, matching service, pricing service, map/routing, payment. Real-time location: drivers send location updates, geospatial indexing (S2, Geohash), efficient radius queries. Matching: find nearby available drivers, ETA calculation, optimal assignment (Hungarian algorithm or greedy). Dispatch: WebSocket/push to driver, timeout and retry. Surge pricing: supply/demand per geohash cell. Data: trip history (time-series), driver locations (in-memory with persistence). Scale: city-based sharding, location service handles millions of updates. Consistency: ride state machine, idempotent state transitions.

Subtopic: Scalability Concepts
Relevant for: Senior Software EngineerSystems ArchitectBackend Developer
View full answer
45

How would you design a payment processing system?

Hard

Requirements: exactly-once processing, high availability, security, compliance. Flow: payment request -> validation -> authorization (payment gateway) -> capture -> settlement. Idempotency: unique transaction IDs, store processed transactions. Double-entry: every credit has debit, reconciliation. State machine: pending -> authorized -> captured -> settled. Failure handling: retry with backoff, compensation, manual review queue. Security: tokenization, PCI compliance, encryption. Features: refunds, disputes, multiple payment methods, fraud detection. Reconciliation: match with bank statements. Audit trail: immutable event log. Consistency critical.

Subtopic: Distributed Systems
Relevant for: Senior Software EngineerFintech EngineerSystems Architect
View full answer
46

How would you design a distributed task scheduling system?

Hard

Requirements: schedule tasks for future, recurring tasks, exactly-once execution, high availability. Architecture: scheduler service (determine when to run), executor workers (run tasks), task queue (pending tasks), task store (persistent). Scheduling: check upcoming tasks, push to queue at execution time. At-least-once: queue with acknowledgment, retry on failure, idempotent tasks. Exactly-once: distributed lock per task instance. Recurring: cron parser, generate next run time. Scale: partition by task ID, multiple schedulers with leader election. Features: priorities, dependencies, rate limiting. Examples: Airflow, Celery Beat.

Subtopic: Distributed Systems
Relevant for: Senior Software EngineerPlatform EngineerSystems Architect
View full answer
47

How would you design a metrics collection and monitoring system?

Hard

Pipeline: agents collect (StatsD, Prometheus scrapers) -> message queue (Kafka) -> stream processing (aggregation) -> time-series DB (InfluxDB, TimescaleDB) -> visualization (Grafana). Metrics types: counters, gauges, histograms. Aggregation: pre-aggregate at collection, rollups for longer retention. Query: downsampling for large ranges, caching. Alerting: rules engine, notification routing (PagerDuty). Scale: shard by metric name or tag combination, compression (gorilla encoding). Retention: high-resolution recent, rollups for historical. Cardinality: limit unique tag combinations. Distributed tracing integration: correlate metrics with traces.

Subtopic: Microservices
Relevant for: SREPlatform EngineerDevOps Engineer
View full answer
48

How would you design a distributed file system like HDFS or GFS?

Hard

Architecture: master (namespace, metadata, chunk locations) and chunk servers (store data chunks). Files split into chunks (64-128MB), replicated across chunk servers (typically 3 replicas). Write: client gets chunk servers from master, writes to primary which replicates. Read: client gets chunk locations, reads from nearest replica. Master: single (with hot standby), handles metadata only, uses operation log + checkpoints. Chunk servers: report chunk lists to master, heartbeat for failure detection. Consistency: lease-based write ordering, atomic record append. Fault tolerance: re-replicate on failure detection. GC: lazy deletion, background cleanup.

Subtopic: Distributed Systems
Relevant for: Senior Software EngineerStorage EngineerSystems Architect
View full answer
49

How would you design a real-time analytics dashboard system?

Hard

Architecture: event ingestion (Kafka) -> stream processing (Flink, Spark Streaming) -> serving layer (pre-aggregated results) -> dashboard (WebSocket updates). Processing: windowed aggregations (tumbling, sliding, session windows), exactly-once semantics. Serving: materialized views in fast store (Redis, Druid), denormalized for query patterns. Real-time updates: poll or WebSocket to frontend. Historical: Lambda architecture (batch + stream) or Kappa (stream only). Scale: partition by dimension, parallel processing. Challenges: late data (watermarks), high cardinality dimensions. Query: OLAP databases (ClickHouse, Druid) for ad-hoc analysis.

Subtopic: Distributed Systems
Relevant for: Data EngineerSenior Software EngineerAnalytics Engineer
View full answer
50

How would you design a distributed key-value store like DynamoDB?

Hard

Architecture: consistent hashing for partitioning, virtual nodes for load balance. Replication: configurable N replicas per key, quorum reads/writes (W+R>N for consistency). Conflict resolution: vector clocks + application-specified resolution, or last-write-wins. Storage engine: LSM tree for write optimization (in-memory + SSTables). Operations: get, put, delete with conditional operations (compare-and-swap). Consistency: tunable (eventual to strong via quorum). Failure handling: hinted handoff, anti-entropy with Merkle trees. Scale: add nodes, automatic rebalancing. Features: secondary indexes (local or global), TTL, transactions (Dynamo-style). CAP: AP by default, CP possible with strict quorum.

Subtopic: Distributed Systems
Relevant for: Senior Software EngineerDatabase EngineerSystems Architect
View full answer