System Design - Complete Interview Preparation Guide

Table of Contents

  1. Introduction to System Design
  2. Fundamental Concepts
  3. Core Components
  4. Scalability Patterns
  5. Database Design
  6. Caching Strategies
  7. Message Queues and Async Processing
  8. Microservices Architecture
  9. API Design
  10. Real-World System Design Examples
  11. Common Interview Questions
  12. Domain-Specific System Design Topics

Introduction to System Design

What is System Design?

System design is the process of defining the architecture, components, modules, interfaces, and data for a system to satisfy specified requirements. It involves making high-level decisions about how different parts of a system will work together.

Why is it important?



The System Design Interview Process

Typical Structure:

  1. Requirements Clarification (5 mins) - Ask questions to understand scope
  2. High-Level Design (10-15 mins) - Draw architecture diagram
  3. Deep Dive (15-20 mins) - Discuss specific components in detail
  4. Bottlenecks & Trade-offs (5-10 mins) - Identify limitations and improvements

Key Framework (RESHADE):


Fundamental Concepts

1. Scalability

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

When to use what?

2. Latency vs Throughput

Latency

Throughput

Trade-off: Sometimes optimizing for low latency can reduce throughput and vice versa.

3. Availability vs Consistency (CAP Theorem)

CAP Theorem states that a distributed system can only guarantee 2 out of 3:

Consistency (C)

Availability (A)

Partition Tolerance (P)

Real-World Choices:

4. Consistency Models

Strong Consistency

Eventual Consistency

Read-after-Write Consistency

5. Load Balancing Fundamentals

What is Load Balancing?
Distributing incoming traffic across multiple servers to ensure no single server is overwhelmed.

Load Balancing Algorithms:

  1. Round Robin

    • Requests distributed sequentially
    • Simple and fair
    • Doesn't account for server capacity
    Request 1 → Server A
    Request 2 → Server B
    Request 3 → Server C
    Request 4 → Server A (cycle repeats)
    
  2. Weighted Round Robin

    • Servers with higher capacity get more requests
    • Server A (weight 3) gets 3 requests before Server B (weight 1) gets 1
  3. Least Connections

    • New request goes to server with fewest active connections
    • Good for long-lived connections
  4. Least Response Time

    • Routes to server with fastest response time
    • Accounts for server performance
  5. IP Hash

    • Client IP determines which server handles request
    • Ensures same client always goes to same server (session persistence)
    • Useful for stateful applications
  6. Random

    • Randomly select a server
    • Simple but can cause uneven distribution

Layer 4 (Transport Layer) vs Layer 7 (Application Layer) Load Balancing:

Layer 4 Load Balancer

Layer 7 Load Balancer

6. Database Sharding

What is Sharding?
Splitting a large database into smaller, faster, more manageable pieces called shards.

Why Shard?

Sharding Strategies:

  1. Horizontal Sharding (Row-based)

    • Split rows across multiple databases
    • Example: Users 1-1000 on Shard1, Users 1001-2000 on Shard2
  2. Vertical Sharding (Column-based)

    • Split columns/tables across databases
    • Example: User profiles in DB1, User activity in DB2
  3. Hash-based Sharding

    shard_number = hash(user_id) % number_of_shards
    
    • Pros: Uniform distribution
    • Cons: Adding/removing shards is difficult
  4. Range-based Sharding

    • User IDs 1-10000 → Shard 1
    • User IDs 10001-20000 → Shard 2
    • Pros: Simple, easy to add shards
    • Cons: Can lead to uneven distribution (hot shards)
  5. Geographic Sharding

    • Users by region (India users → Asia shard)
    • Pros: Low latency for regional users
    • Cons: Complex to implement

Challenges:

7. Replication

What is Replication?
Maintaining copies of data across multiple servers for reliability and performance.

Types of Replication:

  1. Master-Slave (Primary-Replica)

    • One master handles writes
    • Multiple slaves handle reads
    • Master replicates data to slaves
    Client → Write → Master → Replicate → Slave 1, Slave 2
    Client → Read → Slave 1 or Slave 2
    
    • Pros: Scales reads, simple
    • Cons: Master is single point of failure, replication lag
  2. Master-Master (Multi-Master)

    • Multiple masters can handle writes
    • Masters sync with each other
    • Pros: No single point of failure, scales writes
    • Cons: Conflict resolution needed, complex
  3. Synchronous Replication

    • Write confirmed only after all replicas update
    • Pros: Strong consistency
    • Cons: Slower writes, one slow replica affects all
  4. Asynchronous Replication

    • Write confirmed immediately, replicas update later
    • Pros: Fast writes
    • Cons: Potential data loss, eventual consistency

8. Partitioning vs Sharding vs Replication

Partitioning

Sharding

Replication


Core Components

1. DNS (Domain Name System)

What it does:
Translates domain names (www.example.com) to IP addresses (192.168.1.1)

How it works:

  1. User types www.example.com
  2. Browser checks cache
  3. If not found, queries DNS resolver
  4. Resolver queries root server → TLD server → Authoritative server
  5. IP address returned to browser
  6. Browser connects to that IP

DNS Record Types:

TTL (Time To Live):

2. CDN (Content Delivery Network)

What it does:
Distributes static content (images, CSS, JS) across geographically distributed servers.

How it works:

  1. User in Mumbai requests image
  2. CDN routes to nearest edge server (Mumbai)
  3. If image not in cache, fetch from origin server
  4. Cache for future requests
  5. Serve to user (low latency!)

Benefits:

What to cache:

Popular CDNs:

3. Reverse Proxy

What it does:
Sits in front of web servers and forwards client requests.

Functions:

Reverse Proxy vs Forward Proxy:

Popular Tools:

4. Application Servers

What they do:
Run your business logic and handle API requests.

Stateless vs Stateful:

Stateless Servers

Stateful Servers

Best Practice: Keep application servers stateless for easier scaling.

5. Databases

Relational Databases (SQL)

NoSQL Databases

  1. Document Stores

    • Store JSON-like documents
    • Flexible schema
    • Example: MongoDB, CouchDB
    • Use when:
      • Schema flexibility needed
      • Hierarchical data (user profiles, catalogs)
      • Rapid development
  2. Key-Value Stores

    • Simple key-value pairs
    • Fast reads/writes
    • Example: Redis, DynamoDB
    • Use when:
      • Session management
      • Caching
      • Simple data structures
      • Need high performance
  3. Column-Family Stores

    • Store data in columns rather than rows
    • Good for analytics
    • Example: Cassandra, HBase
    • Use when:
      • Write-heavy workloads
      • Time-series data
      • Need high availability
  4. Graph Databases

    • Store relationships between entities
    • Example: Neo4j, Amazon Neptune
    • Use when:
      • Social networks
      • Recommendation engines
      • Fraud detection

Database Selection Criteria:


Scalability Patterns

1. Horizontal vs Vertical Scaling (Detailed)

Vertical Scaling Deep Dive:

Before: 4 cores, 8GB RAM, 500GB SSD
After:  16 cores, 64GB RAM, 2TB SSD

Horizontal Scaling Deep Dive:

Before: 1 server handling 1000 req/sec
After:  10 servers each handling 100 req/sec

2. Caching Patterns (Detailed in separate section)

3. Database Scaling Patterns

Read Replicas

Master (Writes) → Replicate → Replica 1 (Reads)
                            → Replica 2 (Reads)
                            → Replica 3 (Reads)

Write Sharding

CQRS (Command Query Responsibility Segregation)

4. Asynchronous Processing

Why Async?

Patterns:

  1. Fire and Forget

    User uploads image → Return 200 OK immediately
    → Background job processes image (resize, compress)
    
  2. Callback/Webhook

    User initiates payment → Return processing status
    → Payment gateway calls webhook when done
    
  3. Polling

    User requests report → Get job ID
    → Poll status endpoint until complete
    
  4. WebSockets/Server-Sent Events

    Real-time updates pushed to client
    Example: Live stock prices, chat messages
    

5. Batch Processing

Use Case: Process large volumes of data periodically

Example: Calculating daily portfolio returns

Technologies: Apache Spark, Hadoop, AWS Batch

6. Stream Processing

Use Case: Process data in real-time as it arrives

Example: Fraud detection in payment transactions

Technologies: Apache Kafka, AWS Kinesis, Apache Flink


Database Design

1. Normalization

What is Normalization?
Organizing data to reduce redundancy and improve integrity.

Normal Forms:

1NF (First Normal Form)

Before (Not 1NF):

UserID Name Phone Numbers
1 Jash 123, 456, 789

After (1NF):

UserID Name PhoneNumber
1 Jash 123
1 Jash 456
1 Jash 789

2NF (Second Normal Form)

Before (Not 2NF):

OrderID ProductID ProductName Quantity
1 101 iPhone 2

Problem: ProductName depends only on ProductID, not the full key (OrderID, ProductID)

After (2NF):

Orders:

OrderID ProductID Quantity
1 101 2

Products:

ProductID ProductName
101 iPhone

3NF (Third Normal Form)

Before (Not 3NF):

EmployeeID DepartmentID DepartmentName
1 10 Engineering

Problem: DepartmentName depends on DepartmentID (transitive dependency)

After (3NF):

Employees:

EmployeeID DepartmentID
1 10

Departments:

DepartmentID DepartmentName
10 Engineering

2. Denormalization

What is Denormalization?
Intentionally adding redundancy to improve read performance.

When to Denormalize:

Example for a financial platform:

Normalized:

Portfolio: portfolio_id, user_id
Holdings: holding_id, portfolio_id, stock_id, quantity
Stocks: stock_id, symbol, name, current_price

Denormalized for faster reads:

Portfolio_Snapshot: portfolio_id, user_id, total_value, holdings_json
{
  "holdings": [
    {"symbol": "RELIANCE", "quantity": 10, "price": 2500, "value": 25000}
  ]
}

3. Indexing

What is an Index?
Data structure that improves query speed at the cost of write performance and storage.

Types of Indexes:

1. Primary Index

CREATE TABLE users (
  id SERIAL PRIMARY KEY,  -- Primary index
  email VARCHAR(255)
);

2. Secondary Index

CREATE INDEX idx_email ON users(email);

3. Composite Index

CREATE INDEX idx_name_age ON users(last_name, first_name);

4. Unique Index

CREATE UNIQUE INDEX idx_email ON users(email);

5. Full-Text Index

CREATE FULLTEXT INDEX idx_description ON products(description);

Index Trade-offs:

When to Index:

When NOT to Index:

4. Database Transactions and ACID

ACID Properties:

Atomicity

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 1;
  UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT; -- Both happen or neither

Consistency

-- If account balance < 0 constraint exists, transaction will fail

Isolation

Durability

Isolation Levels:

  1. Read Uncommitted

    • Lowest isolation
    • Can read uncommitted changes (dirty reads)
    • Rarely used
  2. Read Committed (PostgreSQL default)

    • Only see committed data
    • Prevents dirty reads
    • Non-repeatable reads possible
  3. Repeatable Read

    • Same query returns same results throughout transaction
    • Phantom reads possible (new rows can appear)
  4. Serializable

    • Highest isolation
    • Transactions appear to run serially
    • Slowest, used for critical operations

5. Connection Pooling

Problem:
Creating new database connection for each request is expensive (TCP handshake, authentication).

Solution:
Maintain a pool of reusable connections.

How it works:

  1. Application starts → Create 10 connections to database
  2. Request comes → Borrow connection from pool
  3. Execute query
  4. Return connection to pool (don't close)
  5. Next request reuses same connection

Benefits:

Configuration:

# Example with PostgreSQL
pool = psycopg2.pool.SimpleConnectionPool(
    minconn=5,      # Minimum connections
    maxconn=20,     # Maximum connections
    host='localhost',
    database='mydb'
)

Best Practices:


Caching Strategies

1. Why Cache?

Benefits:

What to Cache:

2. Cache Levels

1. Browser Cache

2. CDN Cache

3. Application Cache

4. Distributed Cache

5. Database Cache

3. Caching Patterns

1. Cache-Aside (Lazy Loading)

def get_user(user_id):
    # 1. Check cache
    user = cache.get(f"user:{user_id}")
    
    if user is not None:
        return user  # Cache hit
    
    # 2. Cache miss - Query database
    user = db.query("SELECT * FROM users WHERE id = ?", user_id)
    
    # 3. Update cache
    cache.set(f"user:{user_id}", user, ttl=3600)
    
    return user

Pros:

Cons:

2. Read-Through

# Cache library handles DB interaction
user = cache.get(f"user:{user_id}")  # If not in cache, cache fetches from DB

Pros:

Cons:

3. Write-Through

def update_user(user_id, data):
    # 1. Update cache
    cache.set(f"user:{user_id}", data)
    
    # 2. Cache updates database
    # (Handled by cache layer)

Pros:

Cons:

4. Write-Behind (Write-Back)

def update_user(user_id, data):
    # 1. Update cache only
    cache.set(f"user:{user_id}", data)
    
    # 2. Asynchronously write to DB later
    # (Handled by cache)

Pros:

Cons:

5. Refresh-Ahead

# Cache proactively refreshes data before expiration
# Based on usage patterns

Pros:

Cons:

4. Cache Invalidation

Problem: How to ensure cache has fresh data?

Strategies:

1. TTL (Time-To-Live)

cache.set("user:123", user_data, ttl=3600)  # Expires in 1 hour

2. Explicit Invalidation

def update_user(user_id, data):
    db.update(user_id, data)
    cache.delete(f"user:{user_id}")  # Remove from cache

3. Event-Based Invalidation

# When user updates profile
event_bus.publish("user.updated", user_id)

# Cache listener invalidates
event_bus.subscribe("user.updated", lambda user_id: cache.delete(f"user:{user_id}"))

5. Cache Eviction Policies

When cache is full, which item to remove?

1. LRU (Least Recently Used)

2. LFU (Least Frequently Used)

3. FIFO (First In First Out)

4. Random

5. TTL-based

6. Cache Stampede / Thundering Herd

Problem:

Cache expires → 1000 concurrent requests → All hit database → DB overload

Solutions:

1. Probabilistic Early Expiration

def get_with_early_expiration(key, ttl):
    value, expiry = cache.get_with_expiry(key)
    
    # Refresh cache randomly before actual expiration
    time_remaining = expiry - now()
    if random() < (1.0 / time_remaining):
        refresh_cache_async(key)
    
    return value

2. Locking

def get_user(user_id):
    user = cache.get(f"user:{user_id}")
    if user:
        return user
    
    # Acquire lock
    if cache.set_if_not_exists(f"lock:user:{user_id}", "1", ttl=10):
        try:
            user = db.query("SELECT * FROM users WHERE id = ?", user_id)
            cache.set(f"user:{user_id}", user)
            return user
        finally:
            cache.delete(f"lock:user:{user_id}")
    else:
        # Another request is fetching, wait and retry
        sleep(0.1)
        return get_user(user_id)

3. Serve Stale Data

# Serve slightly stale data while refreshing in background

7. Distributed Caching Challenges

1. Consistency

2. Hot Keys

3. Cache Penetration

4. Cache Breakdown


Message Queues and Async Processing

1. What is a Message Queue?

Definition: A message queue is a form of asynchronous communication where messages are stored until they're processed.

Components:

Benefits:

2. Use Cases

Email Notifications

User signs up → Enqueue welcome email → Email service processes queue

Image Processing

User uploads photo → Enqueue processing job → Worker generates thumbnails

Payment Processing

User initiates payment → Enqueue → Payment service processes → Notify user

Data Analytics

User action → Enqueue event → Analytics service processes → Update dashboards

Order Processing (example)

User places order to buy stocks → Enqueue → Order processing service → Execute trade

3. Queue vs Pub/Sub

Queue (Point-to-Point)

Producer → [M1, M2, M3] → Consumer A (gets M1)
                       → Consumer B (gets M2)
                       → Consumer C (gets M3)

Pub/Sub (Publish/Subscribe)

Publisher → [Event] → Subscriber A (gets Event)
                   → Subscriber B (gets Event)
                   → Subscriber C (gets Event)

1. RabbitMQ

2. Apache Kafka

3. AWS SQS

4. Redis (Pub/Sub + Streams)

5. Message Queue Patterns

1. Work Queue Pattern

Producer → Queue → Worker 1
                 → Worker 2
                 → Worker 3

2. Pub/Sub Pattern

Publisher → Topic → Subscriber 1 (Analytics)
                  → Subscriber 2 (Email)
                  → Subscriber 3 (Logging)

3. Request/Reply Pattern

Client → Request Queue → Server
Client ← Reply Queue   ← Server

4. Priority Queue

High Priority: [M1, M2]
Low Priority:  [M3, M4, M5]

5. Dead Letter Queue

Main Queue → Failed Message → DLQ

6. Message Delivery Guarantees

At-Most-Once

At-Least-Once

Exactly-Once

7. Idempotency (Important!)

What is Idempotency?
Performing operation multiple times has same effect as performing it once.

Why important for queues?
Messages can be delivered multiple times (at-least-once delivery).

Idempotent Operations:

# Idempotent - can safely retry
user.email = "new@email.com"
SET balance = 100

# NOT Idempotent - multiple executions cause issues  
balance += 100  # Retry adds 100 multiple times!

Making Operations Idempotent:

1. Idempotency Key

def process_payment(order_id, amount, idempotency_key):
    # Check if already processed
    if db.exists(f"payment:{idempotency_key}"):
        return "Already processed"
    
    # Process payment
    execute_payment(order_id, amount)
    
    # Mark as processed
    db.set(f"payment:{idempotency_key}", "processed")

2. Natural Idempotency

# Use absolute values instead of increments
UPDATE portfolio SET shares = 100 WHERE id = 1
# Instead of
UPDATE portfolio SET shares = shares + 10 WHERE id = 1

8. Handling Failures

Retry Strategies:

1. Immediate Retry

for attempt in range(3):
    try:
        process_message(msg)
        break
    except Exception:
        if attempt == 2:
            raise
        continue

2. Exponential Backoff

delay = 1
for attempt in range(5):
    try:
        process_message(msg)
        break
    except Exception:
        sleep(delay)
        delay *= 2  # 1s, 2s, 4s, 8s, 16s

3. Dead Letter Queue

try:
    process_message(msg)
except Exception:
    if msg.retry_count > 3:
        dlq.send(msg)  # Send to DLQ
    else:
        msg.retry_count += 1
        queue.send(msg, delay=60)  # Retry after 60s

9. Message Queue Best Practices

  1. Keep messages small - Large messages slow down queue
  2. Use idempotent consumers - Handle duplicate messages
  3. Monitor queue depth - Alert if queue growing (consumer too slow)
  4. Set appropriate TTL - Remove old messages
  5. Use DLQ - Don't lose failed messages
  6. Scale consumers based on queue depth - Auto-scaling
  7. Log message processing - For debugging
  8. Use batch processing - Process multiple messages together for efficiency

Microservices Architecture

1. Monolith vs Microservices

Monolithic Architecture

Single Application
├── User Service
├── Payment Service
├── Notification Service
└── Shared Database

Pros:

Cons:

Microservices Architecture

User Service      → User DB
Payment Service   → Payment DB  
Notification Service → Notification DB

Pros:

Cons:

2. When to Use Microservices?

Use Microservices when:

Stick with Monolith when:

Rule of thumb: Start with monolith, break into microservices when needed.

3. Microservice Design Principles

1. Single Responsibility

2. Autonomy

3. Domain-Driven Design

4. Decentralization

5. Fault Tolerance

4. Service Communication

Synchronous Communication

1. REST APIs

User Service → HTTP GET → Product Service
             ← JSON Response ←

Pros:

Cons:

2. gRPC

Service A → gRPC call → Service B
          ← Binary Response ←

Pros:

Cons:

Asynchronous Communication

Message Queues

Service A → Message Queue → Service B

Pros:

Cons:

Event-Driven

Order Service → Event: "Order Placed" → Inventory Service (updates stock)
                                     → Email Service (sends confirmation)
                                     → Analytics Service (tracks metrics)

Pros:

Cons:

5. Service Discovery

Problem: Service A needs to call Service B, but Service B's location changes (new instances, scaling).

Solution: Service Discovery

Client-Side Discovery

Service A → Query Registry → Get Service B's address → Call Service B

Server-Side Discovery

Service A → Load Balancer → Routes to Service B instances

Service Registry:

6. API Gateway

What is API Gateway?
Single entry point for all client requests.

Client → API Gateway → User Service
                    → Order Service
                    → Product Service

Responsibilities:

Example:

GET /api/user/123/orders

API Gateway:
1. Authenticates request
2. Calls User Service to verify user
3. Calls Order Service to get orders
4. Aggregates responses
5. Returns combined result

Popular Tools:

7. Data Management in Microservices

Database per Service Pattern

User Service → Users DB
Order Service → Orders DB
Product Service → Products DB

Pros:

Cons:

Handling Distributed Transactions:

1. Saga Pattern

Choreography-based Saga:

Order Service → Create Order → Publishes "OrderCreated" event
Payment Service → Listens → Process Payment → Publishes "PaymentProcessed"
Inventory Service → Listens → Reserve Items → Publishes "ItemsReserved"

If Payment fails:
Payment Service → Publishes "PaymentFailed"
Order Service → Listens → Cancel Order

Orchestration-based Saga:

Saga Orchestrator:
1. Call Order Service → Create Order
2. Call Payment Service → Process Payment
3. If payment succeeds:
   Call Inventory Service → Reserve Items
   Call Notification Service → Send Email
4. If any step fails:
   Execute compensating transactions (undo previous steps)

2. Event Sourcing

3. CQRS (Command Query Responsibility Segregation)

8. Microservices Challenges & Solutions

1. Network Latency

2. Partial Failures

3. Circuit Breaker Pattern

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN
        self.timeout = timeout
    
    def call(self, func):
        if self.state == "OPEN":
            if time_since_open > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise ServiceUnavailableError("Circuit is OPEN")
        
        try:
            result = func()
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
            raise

# Usage
breaker = CircuitBreaker()
breaker.call(lambda: call_payment_service())

States:

4. Data Consistency

5. Monitoring & Debugging

Distributed Tracing Example:

Request ID: abc123

User Service (20ms) → Order Service (50ms) → Payment Service (100ms)
                                           → Inventory Service (30ms)

Total: 200ms, Bottleneck: Payment Service

6. Testing

9. Microservices Best Practices

  1. Start with monolith - Don't over-engineer early
  2. Define clear boundaries - Use domain-driven design
  3. Automate everything - CI/CD pipelines
  4. Monitor relentlessly - Metrics, logs, traces
  5. Design for failure - Circuit breakers, retries, timeouts
  6. Version your APIs - /v1/users, /v2/users
  7. Document APIs - OpenAPI/Swagger
  8. Use async communication - Prefer events over sync calls
  9. Secure service-to-service - mTLS, JWT
  10. Keep services small - Should fit in developer's head

API Design

1. REST API Design

REST Principles:

  1. Resource-Based

    • URLs represent resources (nouns), not actions (verbs)
    • Good: GET /users/123
    • Bad: GET /getUser?id=123
  2. HTTP Methods

    • GET: Retrieve resource
    • POST: Create resource
    • PUT: Update entire resource
    • PATCH: Partial update
    • DELETE: Remove resource
  3. Stateless

    • Each request contains all info needed
    • Server doesn't store client state
  4. Uniform Interface

    • Consistent naming and structure

RESTful API Examples:

# Users
GET    /users              # List all users
GET    /users/123          # Get specific user
POST   /users              # Create new user
PUT    /users/123          # Update entire user
PATCH  /users/123          # Partially update user
DELETE /users/123          # Delete user

# Nested Resources
GET    /users/123/orders   # Get orders for user 123
POST   /users/123/orders   # Create order for user 123
GET    /users/123/orders/456  # Get specific order

# Filtering, Sorting, Pagination
GET /users?status=active&sort=created_at&page=2&limit=20

HTTP Status Codes:

2xx Success

3xx Redirection

4xx Client Errors

5xx Server Errors

Response Format:

{
  "success": true,
  "data": {
    "id": 123,
    "name": "Jash Rashne",
    "email": "jash@example.com"
  },
  "meta": {
    "page": 1,
    "total": 100
  }
}

Error Response:

{
  "success": false,
  "error": {
    "code": "INVALID_EMAIL",
    "message": "Email format is invalid",
    "details": {
      "field": "email",
      "value": "invalid-email"
    }
  }
}

2. API Versioning

Why Version?

Versioning Strategies:

1. URI Versioning

/v1/users
/v2/users

2. Header Versioning

GET /users
Accept: application/vnd.myapi.v1+json

3. Query Parameter

/users?version=1

Best Practice: URI versioning for major versions, deprecate old versions gracefully.

3. Pagination

Why Paginate?

Offset-based Pagination:

GET /users?page=2&limit=20

Response:
{
  "data": [...],
  "meta": {
    "page": 2,
    "limit": 20,
    "total": 1000,
    "total_pages": 50
  }
}

Pros: Simple
Cons: Inconsistent results if data changes (items added/removed)

Cursor-based Pagination:

GET /users?cursor=eyJpZCI6MTAwfQ&limit=20

Response:
{
  "data": [...],
  "meta": {
    "next_cursor": "eyJpZCI6MTIwfQ",
    "has_more": true
  }
}

Pros: Consistent, efficient
Cons: Can't jump to specific page

Best Practice: Use cursor-based for real-time feeds, offset-based for tables.

4. Rate Limiting

Why Rate Limit?

Algorithms:

1. Token Bucket

class TokenBucket:
    def __init__(self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate  # tokens per second
        self.last_refill = time.now()
    
    def allow_request(self):
        self.refill()
        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False
    
    def refill(self):
        now = time.now()
        elapsed = now - self.last_refill
        tokens_to_add = elapsed * self.refill_rate
        self.tokens = min(self.capacity, self.tokens + tokens_to_add)
        self.last_refill = now

2. Leaky Bucket

3. Fixed Window

Time: 10:00:00 - 10:00:59
Requests: 100 allowed per minute
Reset: At 10:01:00

Problem: Burst at window boundaries

4. Sliding Window

Rate Limit Headers:

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1609459200
Retry-After: 60

5. API Security

1. Authentication

JWT (JSON Web Token)

Header.Payload.Signature

Example:
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.
eyJ1c2VyX2lkIjoxMjMsImV4cCI6MTYwOTQ1OTIwMH0.
signature_here

Flow:

  1. User logs in with credentials
  2. Server generates JWT
  3. Client stores JWT
  4. Client sends JWT in header: Authorization: Bearer <token>
  5. Server verifies signature

Advantages:

2. API Keys

GET /api/users
X-API-Key: your_api_key_here

3. OAuth 2.0

4. Input Validation

def create_user(request):
    email = request.data.get('email')
    
    # Validate
    if not email or '@' not in email:
        return Response({"error": "Invalid email"}, status=400)
    
    # Sanitize (prevent SQL injection, XSS)
    email = sanitize(email)
    
    # Process
    user = User.create(email=email)
    return Response(user.to_dict(), status=201)

5. HTTPS

6. CORS (Cross-Origin Resource Sharing)

# Allow requests from specific origin
Access-Control-Allow-Origin: https://mycompany.com
Access-Control-Allow-Methods: GET, POST, PUT
Access-Control-Allow-Headers: Content-Type, Authorization

6. Idempotency in APIs

Idempotent HTTP Methods:

Making POST Idempotent:

POST /orders
Idempotency-Key: uuid-1234-5678

Server:
1. Check if request with this key already processed
2. If yes, return same response (don't create duplicate)
3. If no, process and store result with key

Example:

def create_order(request):
    idempotency_key = request.headers.get('Idempotency-Key')
    
    # Check cache
    cached_response = cache.get(f"order:{idempotency_key}")
    if cached_response:
        return cached_response
    
    # Process order
    order = Order.create(request.data)
    response = {"order_id": order.id}
    
    # Cache response
    cache.set(f"order:{idempotency_key}", response, ttl=86400)
    
    return response

7. API Documentation

OpenAPI/Swagger Specification

openapi: 3.0.0
info:
  title: My App API
  version: 1.0.0
paths:
  /users/{userId}:
    get:
      summary: Get user by ID
      parameters:
        - name: userId
          in: path
          required: true
          schema:
            type: integer
      responses:
        200:
          description: Successful response
          content:
            application/json:
              schema:
                type: object
                properties:
                  id:
                    type: integer
                  name:
                    type: string
                  email:
                    type: string

Tools:


Real-World System Design Examples

1. Design URL Shortener (bit.ly)

Requirements:

Functional Requirements:

Non-Functional Requirements:

Estimations:

Write QPS = 100M / (30 days * 24 hrs * 3600 sec) = ~40 writes/sec
Read QPS = 40 * 100 = 4000 reads/sec

Storage (5 years):
100M * 12 months * 5 years = 6B URLs
Average URL size = 500 bytes
Total = 6B * 500 bytes = 3TB

Database Schema:

CREATE TABLE urls (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    short_code VARCHAR(10) UNIQUE NOT NULL,
    original_url TEXT NOT NULL,
    created_at TIMESTAMP,
    expires_at TIMESTAMP,
    INDEX(short_code)
);

URL Encoding:

Approach 1: Hash-based

import hashlib

def shorten_url(url):
    hash_value = hashlib.md5(url.encode()).hexdigest()
    short_code = hash_value[:7]  # Take first 7 characters
    return short_code

Problem: Collision (two URLs get same hash)

Solution: Check if exists, if yes, append counter and rehash

Approach 2: Base62 Encoding

def base62_encode(num):
    chars = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    if num == 0:
        return chars[0]
    
    result = []
    while num:
        result.append(chars[num % 62])
        num //= 62
    
    return ''.join(reversed(result))

# Auto-increment ID = 12345
short_code = base62_encode(12345)  # "3D7"

62^7 = 3.5 trillion possible codes (enough!)

High-Level Design:

Client
   ↓
Load Balancer
   ↓
Application Servers (stateless)
   ↓
   ├→ Write → Master DB
   └→ Read → Read Replicas (3 slaves)
   
Cache (Redis)
- Store popular URLs (80/20 rule)

API Design:

POST /api/v1/shorten
Request:
{
  "url": "https://www.example.com/very/long/url",
  "custom_alias": "mylink" (optional),
  "expires_in": 86400 (optional)
}

Response:
{
  "short_url": "https://short.ly/abc123",
  "created_at": "2026-02-12T10:00:00Z"
}

GET /{short_code}
Response: 301 Redirect to original URL

Workflow:

Create Short URL:

  1. User submits long URL
  2. Generate unique short code (base62 of auto-increment ID)
  3. Store in database: short_code → original_url
  4. Return short URL

Redirect:

  1. User visits short URL
  2. Check cache for short_code
  3. If not in cache, query database
  4. Update cache
  5. Return 301 redirect to original URL

Optimizations:

Advanced Features:

2. Design Pastebin

Requirements:

Similar to URL shortener with differences:

Database Schema:

CREATE TABLE pastes (
    id BIGINT PRIMARY KEY,
    short_code VARCHAR(10) UNIQUE,
    content TEXT,  -- or S3 key if large
    language VARCHAR(50),
    expires_at TIMESTAMP,
    created_at TIMESTAMP
);

Storage Strategy:

3. Design Rate Limiter

Requirements:

Approaches:

1. In-Memory (Single Server)

from collections import defaultdict
import time

class RateLimiter:
    def __init__(self, max_requests, window_seconds):
        self.max_requests = max_requests
        self.window_seconds = window_seconds
        self.requests = defaultdict(list)
    
    def allow_request(self, user_id):
        now = time.time()
        window_start = now - self.window_seconds
        
        # Remove old requests
        self.requests[user_id] = [
            req_time for req_time in self.requests[user_id]
            if req_time > window_start
        ]
        
        # Check limit
        if len(self.requests[user_id]) < self.max_requests:
            self.requests[user_id].append(now)
            return True
        
        return False

2. Distributed (Redis)

def rate_limit(user_id, max_requests, window_seconds):
    key = f"rate_limit:{user_id}"
    now = int(time.time())
    window_start = now - window_seconds
    
    # Redis pipeline for atomicity
    pipe = redis.pipeline()
    
    # Remove old entries
    pipe.zremrangebyscore(key, 0, window_start)
    
    # Count current requests
    pipe.zcard(key)
    
    # Add current request
    pipe.zadd(key, {now: now})
    
    # Set expiration
    pipe.expire(key, window_seconds)
    
    results = pipe.execute()
    request_count = results[1]
    
    return request_count < max_requests

Token Bucket (Recommended):

def token_bucket_rate_limit(user_id, capacity, refill_rate):
    key = f"rate_limit:{user_id}"
    
    # Get current tokens and last refill time
    data = redis.get(key)
    if data:
        tokens, last_refill = json.loads(data)
    else:
        tokens = capacity
        last_refill = time.time()
    
    # Refill tokens
    now = time.time()
    elapsed = now - last_refill
    tokens_to_add = elapsed * refill_rate
    tokens = min(capacity, tokens + tokens_to_add)
    
    # Check if request allowed
    if tokens >= 1:
        tokens -= 1
        redis.set(key, json.dumps([tokens, now]), ex=3600)
        return True
    
    redis.set(key, json.dumps([tokens, last_refill]), ex=3600)
    return False

High-Level Design:

Client Request
   ↓
API Gateway / Load Balancer
   ↓
Rate Limiter (checks Redis)
   ↓
If allowed → Application Server
If blocked → Return 429 Too Many Requests

4. Design Notification System

Requirements:

Components:

  1. API Service - Receives notification requests
  2. Message Queue - Stores notifications to process
  3. Workers - Process notifications
  4. Third-party Services - SendGrid (email), Twilio (SMS), FCM (push)
  5. Database - Store notification status

Flow:

Application → POST /notifications → API Service
                                        ↓
                                   Message Queue
                                        ↓
                    Worker 1 ← Worker 2 ← Worker 3
                        ↓           ↓           ↓
                    SendGrid    Twilio      FCM

Database Schema:

CREATE TABLE notifications (
    id BIGINT PRIMARY KEY,
    user_id BIGINT,
    type ENUM('email', 'sms', 'push'),
    message TEXT,
    status ENUM('pending', 'sent', 'failed'),
    retry_count INT DEFAULT 0,
    created_at TIMESTAMP,
    sent_at TIMESTAMP
);

API:

POST /api/notifications
{
  "user_id": 123,
  "type": "email",
  "template": "welcome_email",
  "data": {
    "name": "Jash"
  }
}

Worker Logic:

def process_notification(notification):
    try:
        if notification.type == 'email':
            sendgrid.send(notification.email, notification.message)
        elif notification.type == 'sms':
            twilio.send(notification.phone, notification.message)
        elif notification.type == 'push':
            fcm.send(notification.device_token, notification.message)
        
        db.update(notification.id, status='sent', sent_at=now())
    
    except Exception as e:
        if notification.retry_count < 3:
            # Retry with exponential backoff
            queue.send_delayed(notification, delay=2^retry_count * 60)
            db.update(notification.id, retry_count=retry_count+1)
        else:
            db.update(notification.id, status='failed')

Preventing Duplicates:

User Preferences:

CREATE TABLE notification_preferences (
    user_id BIGINT,
    type ENUM('email', 'sms', 'push'),
    enabled BOOLEAN,
    PRIMARY KEY(user_id, type)
);

5. Design News Feed (Twitter/Facebook)

Requirements:

Approaches:

1. Pull Model (Read-Heavy)

User requests feed
→ Fetch list of followed users
→ Fetch recent posts from each
→ Merge and sort by timestamp
→ Return feed

Pros: Simple, saves storage
Cons: Slow for users following many people

2. Push Model (Write-Heavy - Fanout)

User posts update
→ Fetch list of followers
→ Write post to each follower's feed
→ Followers read their pre-computed feed (fast)

Pros: Fast reads
Cons: Expensive writes (celebrity with millions of followers)

3. Hybrid Model

Normal users: Push model
Celebrities: Pull model

Database Schema:

-- Posts
CREATE TABLE posts (
    id BIGINT PRIMARY KEY,
    user_id BIGINT,
    content TEXT,
    created_at TIMESTAMP,
    INDEX(user_id, created_at)
);

-- Followers
CREATE TABLE followers (
    follower_id BIGINT,
    followee_id BIGINT,
    PRIMARY KEY(follower_id, followee_id)
);

-- Feed (for push model)
CREATE TABLE feed (
    user_id BIGINT,
    post_id BIGINT,
    created_at TIMESTAMP,
    PRIMARY KEY(user_id, created_at)
);

Feed Generation (Push Model):

def create_post(user_id, content):
    # 1. Create post
    post = db.create_post(user_id, content)
    
    # 2. Get followers
    followers = db.get_followers(user_id)
    
    # 3. Fanout to followers (async via queue)
    for follower_id in followers:
        queue.send({
            'action': 'add_to_feed',
            'user_id': follower_id,
            'post_id': post.id
        })
    
    return post

def add_to_feed_worker(message):
    db.insert_feed(message.user_id, message.post_id)

Feed Retrieval:

def get_feed(user_id, page=1, limit=20):
    # Simple read from feed table
    return db.query("""
        SELECT posts.* FROM feed
        JOIN posts ON feed.post_id = posts.id
        WHERE feed.user_id = ?
        ORDER BY feed.created_at DESC
        LIMIT ? OFFSET ?
    """, user_id, limit, (page-1)*limit)

Optimizations:

6. Design YouTube / Video Streaming

Requirements:

Components:

  1. Upload Service - Handle video uploads
  2. Transcoding Service - Convert to multiple formats/resolutions
  3. Object Storage - Store videos (S3, GCS)
  4. CDN - Distribute videos globally
  5. Metadata Database - Video info, comments, likes

Upload Flow:

User uploads video (large file)
   ↓
Upload Service (chunks upload)
   ↓
Store raw video in S3
   ↓
Trigger transcoding (message queue)
   ↓
Transcoding Service
- Generate 360p, 480p, 720p, 1080p
- Generate thumbnail
   ↓
Store transcoded videos in S3
   ↓
Update metadata in database

Streaming Flow:

User requests video
   ↓
CDN (nearest edge server)
   ↓
If not in CDN, fetch from S3
   ↓
Stream to user (HLS, DASH protocol)

Database Schema:

CREATE TABLE videos (
    id BIGINT PRIMARY KEY,
    user_id BIGINT,
    title VARCHAR(255),
    description TEXT,
    s3_key VARCHAR(255),  -- Raw video
    thumbnail_url VARCHAR(255),
    duration INT,
    status ENUM('processing', 'ready', 'failed'),
    created_at TIMESTAMP,
    INDEX(user_id),
    INDEX(created_at)
);

CREATE TABLE video_files (
    id BIGINT PRIMARY KEY,
    video_id BIGINT,
    resolution VARCHAR(10),  -- 360p, 720p, etc.
    s3_key VARCHAR(255),
    format VARCHAR(10),  -- mp4, webm
    FOREIGN KEY(video_id) REFERENCES videos(id)
);

Adaptive Bitrate Streaming:

Challenges:

7. Design Instagram

Requirements:

Functional Requirements:

Database Schema:

CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    username VARCHAR(50) UNIQUE,
    email VARCHAR(255) UNIQUE,
    created_at TIMESTAMP
);

CREATE TABLE photos (
    id BIGINT PRIMARY KEY,
    user_id BIGINT,
    image_url VARCHAR(500),
    caption TEXT,
    created_at TIMESTAMP,
    FOREIGN KEY(user_id) REFERENCES users(id),
    INDEX(user_id, created_at)
);

CREATE TABLE follows (
    follower_id BIGINT,
    followee_id BIGINT,
    created_at TIMESTAMP,
    PRIMARY KEY(follower_id, followee_id)
);

CREATE TABLE likes (
    user_id BIGINT,
    photo_id BIGINT,
    created_at TIMESTAMP,
    PRIMARY KEY(user_id, photo_id)
);

CREATE TABLE comments (
    id BIGINT PRIMARY KEY,
    user_id BIGINT,
    photo_id BIGINT,
    content TEXT,
    created_at TIMESTAMP,
    FOREIGN KEY(user_id) REFERENCES users(id),
    FOREIGN KEY(photo_id) REFERENCES photos(id)
);

Photo Upload Flow:

User uploads photo
   ↓
API Server
   ↓
Upload to S3 (original)
   ↓
Trigger processing (queue)
   ↓
Image Processing Service:
- Resize to multiple sizes (thumbnail, medium, large)
- Apply filters
- Compress
   ↓
Store processed images in S3
   ↓
Store metadata in database
   ↓
Fanout to followers' feeds (like Twitter feed)

Feed Generation:

Challenges:


Common Interview Questions

1. How would you design a system to handle 1 million concurrent users?

Approach:

  1. Horizontal Scaling

    • Add more application servers
    • Use load balancer to distribute traffic
  2. Caching

    • Cache frequently accessed data (Redis)
    • Reduce database load
  3. Database Optimization

    • Read replicas for read-heavy workload
    • Indexing for faster queries
    • Connection pooling
  4. Asynchronous Processing

    • Use message queues for non-critical tasks
    • Background jobs
  5. CDN

    • Serve static content from edge servers
    • Reduce latency
  6. Microservices

    • Break monolith into services
    • Scale services independently
  7. Auto-scaling

    • Scale servers based on traffic
    • AWS Auto Scaling, Kubernetes HPA

2. How do you ensure data consistency in a distributed system?

Approaches:

  1. Strong Consistency

    • Use distributed transactions (2PC, 3PC)
    • Example: Google Spanner
    • Pros: Always consistent
    • Cons: Slow, complex
  2. Eventual Consistency

    • Accept temporary inconsistency
    • Example: DynamoDB, Cassandra
    • Pros: Fast, available
    • Cons: Complexity in application logic
  3. Use Saga Pattern

    • Break transaction into steps
    • Compensate on failure
  4. Event Sourcing

    • Store events, derive state
    • Replay events for consistency

3. How would you design a system with 99.99% availability?

Strategies:

  1. Redundancy

    • Multiple servers, databases
    • No single point of failure
  2. Load Balancing

    • Distribute traffic across servers
    • Health checks
  3. Database Replication

    • Master-slave replication
    • Automatic failover
  4. Monitoring & Alerts

    • Detect failures quickly
    • Prometheus, Grafana, PagerDuty
  5. Graceful Degradation

    • If service fails, degrade gracefully
    • Example: Netflix continues with cached data
  6. Circuit Breaker

    • Stop calling failed services
    • Prevent cascade failures
  7. Multi-region Deployment

    • Deploy in multiple AWS regions
    • If one region fails, others serve

Calculation:
99.99% uptime = 52.56 minutes downtime per year

4. How do you handle database schema changes in production?

Strategies:

  1. Blue-Green Deployment

    • Deploy new version alongside old
    • Switch traffic when ready
    • Rollback if issues
  2. Backward Compatible Changes

    • Add new column (nullable)
    • Deploy code that uses new column
    • Populate data
    • Remove old column later
  3. Online Schema Change Tools

    • pt-online-schema-change (Percona)
    • gh-ost (GitHub)
    • Make changes without locking table
  4. Database Migrations

    • Version controlled migration scripts
    • Alembic (Python), Flyway (Java)

Example:

-- Step 1: Add new column (nullable)
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Step 2: Deploy code that writes to both name and full_name

-- Step 3: Backfill data
UPDATE users SET full_name = CONCAT(first_name, ' ', last_name);

-- Step 4: Make column NOT NULL
ALTER TABLE users MODIFY full_name VARCHAR(255) NOT NULL;

-- Step 5: Deploy code that only uses full_name

-- Step 6: Drop old columns
ALTER TABLE users DROP COLUMN first_name, DROP COLUMN last_name;

5. How would you debug a slow API?

Steps:

  1. Check Logs

    • Look for errors
    • Check API latency
  2. Application Performance Monitoring (APM)

    • New Relic, DataDog
    • Identify slow database queries, API calls
  3. Database Query Analysis

    • Use EXPLAIN on SQL queries
    • Check if indexes used
    • Slow query log
  4. Check External Dependencies

    • Third-party API calls
    • Timeouts?
  5. Profiling

    • CPU profiling (which function is slow?)
    • Memory profiling (memory leaks?)
  6. Load Testing

    • JMeter, Locust
    • Identify bottlenecks
  7. Caching

    • Are we caching frequently accessed data?
    • Cache hit rate?

Common Culprits:


Domain-Specific System Design Topics

1. Design Stock Portfolio Tracker

Requirements:

Database Schema:

CREATE TABLE users (
    id BIGINT PRIMARY KEY,
    name VARCHAR(255),
    email VARCHAR(255) UNIQUE
);

CREATE TABLE portfolios (
    id BIGINT PRIMARY KEY,
    user_id BIGINT,
    name VARCHAR(255),
    created_at TIMESTAMP,
    FOREIGN KEY(user_id) REFERENCES users(id)
);

CREATE TABLE holdings (
    id BIGINT PRIMARY KEY,
    portfolio_id BIGINT,
    stock_symbol VARCHAR(10),
    quantity DECIMAL(15, 2),
    avg_buy_price DECIMAL(15, 2),
    last_updated TIMESTAMP,
    FOREIGN KEY(portfolio_id) REFERENCES portfolios(id)
);

CREATE TABLE transactions (
    id BIGINT PRIMARY KEY,
    portfolio_id BIGINT,
    stock_symbol VARCHAR(10),
    type ENUM('BUY', 'SELL'),
    quantity DECIMAL(15, 2),
    price DECIMAL(15, 2),
    executed_at TIMESTAMP,
    FOREIGN KEY(portfolio_id) REFERENCES portfolios(id)
);

CREATE TABLE stock_prices (
    symbol VARCHAR(10),
    price DECIMAL(15, 2),
    timestamp TIMESTAMP,
    PRIMARY KEY(symbol, timestamp)
);

Real-time Portfolio Value:

Approach 1: Calculate on Demand

def get_portfolio_value(portfolio_id):
    holdings = db.get_holdings(portfolio_id)
    total_value = 0
    
    for holding in holdings:
        current_price = get_stock_price(holding.symbol)  # External API or cache
        value = holding.quantity * current_price
        total_value += value
    
    return total_value

Problem: Slow if portfolio has many stocks

Approach 2: Pre-compute and Cache

# Batch job every minute
def update_portfolio_values():
    portfolios = db.get_all_portfolios()
    stock_prices = get_all_stock_prices()  # Batch fetch
    
    for portfolio in portfolios:
        holdings = db.get_holdings(portfolio.id)
        total_value = 0
        
        for holding in holdings:
            current_price = stock_prices[holding.symbol]
            value = holding.quantity * current_price
            total_value += value
        
        # Cache portfolio value
        cache.set(f"portfolio_value:{portfolio.id}", total_value, ttl=60)

Get portfolio value:

def get_portfolio_value(portfolio_id):
    # Try cache first
    value = cache.get(f"portfolio_value:{portfolio_id}")
    if value:
        return value
    
    # Calculate and cache
    value = calculate_portfolio_value(portfolio_id)
    cache.set(f"portfolio_value:{portfolio_id}", value, ttl=60)
    return value

Real-time Stock Prices:

Option 1: Poll External API

Option 2: WebSocket from Exchange

Stock Exchange (WebSocket) → Our Server → Redis Pub/Sub → Client WebSocket

Historical Performance:

CREATE TABLE portfolio_snapshots (
    portfolio_id BIGINT,
    value DECIMAL(15, 2),
    date DATE,
    PRIMARY KEY(portfolio_id, date)
);

Daily job:

# Run at market close
def create_daily_snapshots():
    portfolios = db.get_all_portfolios()
    stock_prices = get_closing_prices()
    
    for portfolio in portfolios:
        value = calculate_portfolio_value(portfolio.id, stock_prices)
        db.insert_snapshot(portfolio.id, value, today())

Charts:

def get_performance(portfolio_id, period='1M'):
    snapshots = db.query("""
        SELECT date, value FROM portfolio_snapshots
        WHERE portfolio_id = ? AND date >= ?
        ORDER BY date
    """, portfolio_id, start_date(period))
    
    return snapshots

2. Design Order Execution System

Requirements:

Order States:

PENDING → SUBMITTED → EXECUTED → SETTLED
            ↓
         FAILED/CANCELLED

Database Schema:

CREATE TABLE orders (
    id BIGINT PRIMARY KEY,
    user_id BIGINT,
    portfolio_id BIGINT,
    stock_symbol VARCHAR(10),
    type ENUM('BUY', 'SELL'),
    quantity DECIMAL(15, 2),
    price DECIMAL(15, 2),
    status ENUM('PENDING', 'SUBMITTED', 'EXECUTED', 'SETTLED', 'FAILED', 'CANCELLED'),
    broker_order_id VARCHAR(255),
    created_at TIMESTAMP,
    executed_at TIMESTAMP,
    INDEX(user_id, created_at),
    INDEX(status)
);

Order Flow:

1. User places order
   ↓
2. Validate (sufficient balance, valid quantity)
   ↓
3. Create order record (status=PENDING)
   ↓
4. Enqueue order processing
   ↓
5. Worker picks order
   ↓
6. Call broker API (with idempotency key)
   ↓
7. Update status (SUBMITTED)
   ↓
8. Poll broker for execution status
   ↓
9. Update status (EXECUTED)
   ↓
10. Update holdings
   ↓
11. Update status (SETTLED)

Idempotency:

def submit_order_to_broker(order):
    idempotency_key = f"order:{order.id}"
    
    # Check if already submitted
    broker_order = cache.get(idempotency_key)
    if broker_order:
        return broker_order
    
    # Submit to broker
    broker_order = broker_api.place_order(
        symbol=order.stock_symbol,
        quantity=order.quantity,
        price=order.price,
        idempotency_key=idempotency_key
    )
    
    # Cache result
    cache.set(idempotency_key, broker_order, ttl=86400)
    
    # Update order
    db.update_order(order.id, {
        'status': 'SUBMITTED',
        'broker_order_id': broker_order.id
    })
    
    return broker_order

Handling Failures:

def process_order(order_id):
    order = db.get_order(order_id)
    
    try:
        # Submit to broker
        broker_order = submit_order_to_broker(order)
        
        # Poll for execution
        for _ in range(10):  # Max 10 attempts
            status = broker_api.get_order_status(broker_order.id)
            
            if status == 'EXECUTED':
                db.update_order(order.id, {'status': 'EXECUTED'})
                update_holdings(order)
                db.update_order(order.id, {'status': 'SETTLED'})
                break
            elif status == 'FAILED':
                db.update_order(order.id, {'status': 'FAILED'})
                break
            
            time.sleep(5)  # Poll every 5 seconds
    
    except BrokerAPIException as e:
        db.update_order(order.id, {'status': 'FAILED', 'error': str(e)})
        
        # Retry with exponential backoff
        if order.retry_count < 3:
            queue.send_delayed(order, delay=2^order.retry_count * 60)
            db.update_order(order.id, {'retry_count': order.retry_count + 1})

Scalability:

3. Design Real-time Stock Price Updates

Requirements:

Approaches:

1. Server-Sent Events (SSE)

Server → Client (one-way stream)
# Server
@app.route('/stream/')
def stream_price(symbol):
    def generate():
        pubsub = redis.pubsub()
        pubsub.subscribe(f'stock:{symbol}')
        
        for message in pubsub.listen():
            if message['type'] == 'message':
                yield f"data: {message['data']}\n\n"
    
    return Response(generate(), mimetype='text/event-stream')

# Client (JavaScript)
const eventSource = new EventSource('/stream/RELIANCE');
eventSource.onmessage = (event) => {
    const price = JSON.parse(event.data);
    updateUI(price);
};

2. WebSockets

Server ↔ Client (two-way)
# Server
@socketio.on('subscribe')
def handle_subscribe(symbol):
    join_room(f'stock:{symbol}')

# When price updates
def broadcast_price(symbol, price):
    socketio.emit('price_update', {'symbol': symbol, 'price': price}, 
                  room=f'stock:{symbol}')

Architecture:

Stock Exchange API (WebSocket)
   ↓
Price Aggregator Service
   ↓
Redis Pub/Sub
   ↓
WebSocket Servers (multiple instances)
   ↓
Clients

Optimizations:


Key Takeaways & Interview Tips

1. Framework for System Design Interviews

Step 1: Clarify Requirements (5 mins)

Example Questions:

Step 2: High-Level Design (10 mins)

Step 3: Deep Dive (15 mins)

Step 4: Address Bottlenecks (5 mins)

2. What Interviewers Look For

  1. Communication

    • Think aloud
    • Ask clarifying questions
    • Explain trade-offs
  2. Problem Solving

    • Break down complex problem
    • Identify constraints
    • Propose solutions
  3. Technical Knowledge

    • Understanding of systems
    • Knowledge of trade-offs
    • Awareness of technologies
  4. Pragmatism

    • Don't over-engineer
    • Start simple, iterate
    • Consider real-world constraints

3. Common Mistakes to Avoid

  1. Jumping into solution without clarifying requirements

    • Always ask questions first
  2. Over-engineering

    • Start simple, add complexity if needed
    • Don't use Kafka if a simple queue suffices
  3. Ignoring scale

    • Always think about scalability
    • But don't optimize prematurely
  4. Not considering trade-offs

    • Every decision has pros/cons
    • Discuss them
  5. Poor communication

    • Don't stay silent
    • Explain your thought process

4. Must-Know Concepts Summary

Foundational:

Intermediate:

Advanced:

5. Technologies to Know

Databases:

Caching:

Message Queues:

Load Balancers:

Monitoring:

Cloud:


Practice Questions

Easy

  1. Design a URL shortener
  2. Design Pastebin
  3. Design a rate limiter
  4. Design a cache system

Medium

  1. Design Instagram
  2. Design Twitter feed
  3. Design notification system
  4. Design API gateway
  5. Design leaderboard system
  6. Design stock portfolio tracker

Hard

  1. Design YouTube
  2. Design Uber
  3. Design distributed cache
  4. Design search autocomplete
  5. Design payment system

Additional Resources

Books

Websites

Practice


Good Luck with Your Interview!