Chapter 8: System Design Interview Mastery

8.5 Design a messaging system (WhatsApp / Facebook Messenger)

1. Problem Definition and Scope

We are designing a global, real-time messaging application where users can communicate text and media instantly.

The goal is to build a highly reliable system that delivers messages with low latency while handling billions of active users.

  • Main User Groups: Mobile users (iOS and Android).

  • Main Actions: Sending text messages, sharing images, and seeing message delivery status.

  • Scope:

    • Included: 1-on-1 text chat, basic Group chat, Message receipts (Sent/Delivered/Read), Offline support, and Online status.
    • Excluded: Voice/Video calling (VoIP), Status/Stories, Web client specifics, and deep cryptographic details of End-to-End Encryption (E2EE).

2. Clarify functional requirements

Must Have:

  • 1-on-1 Messaging: Users can send and receive text messages in real-time.

  • Group Messaging: Users can broadcast messages to a group of people (up to 256 members).

  • Message Receipts:

    • Sent: Reached the server (1 gray tick).
    • Delivered: Reached the recipient's device (2 gray ticks).
    • Read: User opened the chat (2 blue ticks).
  • Offline Support: If a recipient is offline, the message is queued and delivered immediately when they return.

  • Media Sharing: Users can upload and send images/videos.

Nice to Have:

  • Last Seen: Users can see when a contact was last online.
  • Push Notifications: Wake up the device when a new message arrives.

Functional Requirements

3. Clarify non-functional requirements

  • Target Users: 1 Billion Daily Active Users (DAU).

  • Message Volume: 50 Billion messages per day (approx. 50 messages per user).

  • Latency: Extremely low (Real-time). Messages should arrive within 200ms if both users are online.

  • Availability: High (99.99%). The service is a utility; it must essentially never be down.

  • Consistency: Message ordering must be strict (FIFO) within a single chat.

  • Data Retention: Messages are stored permanently to support restoring history on new devices.

Non-Functional Requirements

4. Back of the envelope estimates

  • Traffic (QPS):
    • Total Messages = 50 Billion / day.
    • Seconds in a day $\approx$ 86,400.
    • Average QPS = $50,000,000,000 / 86,400 \approx$ 580,000 requests/sec.
    • Peak QPS (e.g., New Year's) $\approx$ 3x Average $\approx$ 1.7 Million requests/sec.
  • Storage (Text):
    • Average message size $\approx$ 100 Bytes.
    • Daily Text Storage = 50 Billion $\times$ 100 Bytes = 5 TB / day.
    • Yearly Text Storage = 5 TB $\times$ 365 $\approx$ 1.8 PB (Petabytes).
    • Conclusion: We need a database that scales horizontally (Sharding).
  • Storage (Media):
    • Assume 10% of messages are media. Average size $\approx$ 200 KB.
    • Daily Media Storage = 5 Billion $\times$ 200 KB = 1 PB / day.
    • Conclusion: Media must use Object Storage (S3), not the database.
  • Bandwidth:
    • Ingress (Media) $\approx$ 1 PB / day $\approx$ 11.5 GB/sec. Massive bandwidth usage; requires a CDN.

Back-of-the-Envelope Estimation

5. API design

For real-time chat, HTTP is too slow because of header overhead and lack of server-push capability.

We will use WebSockets (or a persistent TCP protocol) to keep a connection open between the client and server.

1. Connect (WebSocket Handshake)

  • GET /ws/connect
  • Headers: Authorization: Bearer <token>
  • Response: 101 Switching Protocols (Connection Established).

2. Send Message (WebSocket Event)

  • Client sends:
{
  "type": "send_message",
  "data": {
    "to_user_id": "user-uuid-2",
    "content": "encrypted-blob",
    "type": "TEXT"
  }
}

3. Upload Media (REST Endpoint)

  • POST /v1/media
  • Request: Multipart file (image/video).
  • Response:

{  
  "media\_id": "media-uuid-99",  
  "url": "https://s3.region.amazonaws.com/bucket/media-uuid-99.jpg"  
}

4. Acknowledge Message (WebSocket Event)

  • Client sends:
{  
  "type": "ack",  
  "data": { "message\_id": "msg-123", "status": "DELIVERED" }  
}

6. High level architecture

We will decouple the layer that holds the connections from the layer that processes the logic.

Client $\rightarrow$ Load Balancer $\rightarrow$ Gateway Service $\rightarrow$ Message Service $\rightarrow$ Database

  1. Client: The mobile app. Maintains a local SQLite database for chat history and a persistent WebSocket connection to the server.

  2. Load Balancer: Distributes millions of WebSocket connections across the Gateway servers.

  3. Gateway Service (Connection Holder):

    • A stateful server that holds the persistent TCP/WebSocket connection for users.
    • It maintains a map of UserID $\rightarrow$ Socket Connection.
    • It acts as a pipe: pushes data to clients and forwards requests to the Message Service.
  4. Session Cache (Redis): A global lookup that tells us "Which Gateway server is User B connected to?".

  5. Message Service: Stateless microservice. It handles message validation, persistence to the DB, and routing logic.

  6. Group Service: Manages group members and fan-out logic.

  7. Database (Cassandra): Stores chat history and offline message queues.

  8. Object Storage (S3): Stores images and videos.

High-Level Architecture

7. Data model

We need a database optimized for extremely high write throughput. SQL databases struggle to scale to billions of writes/day without complex sharding. We will use a Wide-Column NoSQL store (Cassandra).

1. Messages Table (Chat History)

  • Partition Key: chat_id (Unique ID for 1-on-1 or Group conversation). This keeps all messages for a chat on the same database node, making "Load History" extremely fast.
  • Clustering Key: message_id (TimeUUID). This sorts messages chronologically.
ColumnTypeDescription
chat_idUUIDPartition Key.
message_idTIMEUUIDClustering Key.
sender_idUUIDWho sent it.
contentBLOBEncrypted payload.
statusINTSent, Delivered, Read.

2. User_Unread_Queue (Offline Optimization)

  • Partition Key: user_id.
  • Clustering Key: message_id.
  • Purpose: When a user comes online, we query this table to quickly find only the messages they missed, rather than scanning the entire chat history.

3. Group_Members (SQL or NoSQL)

  • group_id (PK), user_id (PK). Used to look up "Who do I send this group message to?".

8. Core flows end to end

This section details exactly how data moves through the system. We will trace the path of a message step-by-step to understand how the components (Gateways, Database, Caches) interact.

Flow 1: User A sends a message to User B (Both Online)

This is the "happy path" where both users have the app open and are connected to the internet.

  1. Connection Established (Pre-requisite)
    • Both User A and User B have already established a long-lived WebSocket connection with the Gateway Service.
    • User A is connected to Gateway Node 1.
    • User B is connected to Gateway Node 2.
    • Redis (Session Cache) stores this mapping: User A -> Gateway 1 and User B -> Gateway 2.
  2. Message Sending (User A)
    • User A types "Hello" and presses send.
    • The mobile app generates a message_id locally (for deduplication) and sends a JSON payload over the existing WebSocket connection to Gateway Node 1.
    • Note: The UI immediately shows a single grey tick (meaning "Sending...") to make the app feel responsive, even before the server confirms.
  3. Processing and Persistence
    • Gateway Node 1 receives the payload. It is just a pipe, so it forwards the request to the Message Service (the brain of the system).
    • The Message Service first persists the message to the Cassandra Database (Messages Table) with a status of SENT.
    • Why Persist First? We save to disk before attempting delivery to ensure we never lose the message, even if the server crashes milliseconds later.
  4. Acknowledgment to Sender
    • Once Cassandra confirms the write, the Message Service tells Gateway Node 1 to send an ACK back to User A.
    • User A’s screen updates: The single grey tick becomes solid (Server Received).
  5. Routing and Delivery
    • The Message Service needs to find User B. It queries the Redis Session Cache: "Where is User B connected?"
    • Redis replies: "User B is on Gateway Node 2."
    • The Message Service forwards the message payload to Gateway Node 2.
    • Gateway Node 2 pushes the message down the open WebSocket connection to User B's device.
  6. Delivery Receipt
    • User B's app receives the message and immediately sends a background ACK: DELIVERED packet back to Gateway Node 2.
    • This Ack travels back: Gateway 2 -> Message Service -> Update Cassandra Status -> Gateway 1 -> User A.
    • User A’s screen updates: The tick becomes a double grey tick (Delivered).

Flow 1

Flow 2: User A sends a message to User B (User B is Offline)

This flow handles the case where the recipient has no internet connection or has the app closed.

  1. Send and Persist (Same as Flow 1)
    • User A sends the message.
    • The Message Service saves it to Cassandra as SENT.
    • User A receives the single grey tick (Server Confirmation).
  2. Routing Failure
    • The Message Service queries Redis for User B.
    • Redis returns NULL or an "Offline" status.
  3. Queuing for Later
    • The Message Service writes a new entry into the User_Unread_Queue (a specific table in Cassandra designed for fast lookups).
    • It stores user_id: B and message_id: XYZ.
  4. External Notification
    • Since the WebSocket path is dead, the Message Service calls the Push Notification Service.
    • This service sends a payload to Apple (APNS) or Google (FCM) to wake up User B’s phone with a system notification: "New message from User A."

Flow 2

Flow 3: User B comes Online (Synchronization)

This flow explains how User B catches up on missed messages when they open the app.

  1. Reconnection
    • User B opens the app. The client establishes a new WebSocket connection to a random Gateway Node (e.g., Gateway Node 3).
    • Gateway Node 3 updates Redis: User B -> Gateway 3.
  2. Checking the Queue
    • Upon connection, Gateway Node 3 (or the Message Service) automatically queries the User_Unread_Queue table in Cassandra for User B.
  3. Bulk Delivery
    • If the queue has messages (e.g., 5 unread messages), the service fetches the full message content from the Messages Table.
    • It sends all 5 messages down the WebSocket to User B in a batch.
    • User B's client sends ACK: DELIVERED for all of them.
  4. Cleanup
    • Once the specific ACKs are received, the system deletes those entries from the User_Unread_Queue so they aren't redelivered later.

Flow 3

9. Caching and read performance

  • Session Cache (Redis):
    • Data: Key: UserID, Value: Gateway_ID.
    • Why: This is hit for every message sent. It must be blazing fast (latency < 5ms).
    • TTL: Keys expire if the WebSocket connection is closed or heartbeats fail.
  • User Profile Cache:
    • Data: Display Name, Avatar URL, Status.
    • Why: Profile data rarely changes but is read often.
  • Chat History:
    • We do not cache chat history in Redis. It is too large (Petabytes).
    • Cassandra is optimized for sequential reads. When a user opens a chat, we fetch the last 50 messages from Cassandra efficiently using the chat_id.

10. Storage, indexing and media

  • Database (Cassandra):
    • Used for all text messages and metadata.
    • Indexing: The primary key (chat_id, message_id) allows us to query "Get messages for Chat X where time > Y".
  • Media Storage (S3 + CDN):
    • We never store binary images in Cassandra.
    • Flow:
      1. User uploads image to API Server $\rightarrow$ S3.
      2. API Server returns a URL.
      3. User sends a text message containing the URL.
      4. Receiver's phone sees the URL and downloads the image from a CDN (like CloudFront) to ensure fast download speeds globally.

11. Scaling strategies

  • Gateway Scaling:
    • To handle 1 Billion users, we need thousands of Gateway servers.
    • We use a Load Balancer to route new users to the least busy server.
    • If a Gateway crashes, the 100k users on it disconnect and automatically reconnect to a new Gateway.
  • Database Partitioning:
    • Cassandra partitions data automatically based on the chat_id.
    • This ensures even distribution of data across the cluster.
  • Group Chat Fan-out:
    • For a group with 200 people, the server must generate 200 outgoing messages.
    • To prevent blocking the sender, we use a Message Queue (Kafka). The Message Service publishes a "Group Message" event to Kafka. Background workers consume this event, look up the 200 members, and process the delivery for each member in parallel.

12. Reliability, failure handling and backpressure

  • No Single Point of Failure:
    • Gateways, Services, and Databases are all replicated across multiple Availability Zones (AZs).
  • Store and Forward:
    • We guarantee At-Least-Once delivery. If we are not sure if User B got the message (e.g., network timeout during Ack), we resend it. The client app handles deduplication using the message_id.
  • Backpressure:
    • If a user is being spammed (e.g., receiving 1000 messages/sec), the Gateway buffer will fill up. The Gateway can stop reading from the socket (TCP flow control), forcing the sender to slow down.

13. Security, privacy and abuse

  • End-to-End Encryption (E2EE):
    • This is the gold standard (Signal Protocol).
    • Keys are generated on the user's device. The server only holds the Public Key.
    • The server processes encrypted blobs (content). It cannot read the messages.
  • Authentication:
    • Phone number verification via SMS OTP.
    • WebSocket connections are authenticated via a secure, short-lived session token.
  • Abuse:
    • Rate limiting on the "Send Message" API to prevent bots.
    • Ability to Block users (the server drops messages from blocked users before delivery).

14. Bottlenecks and next steps

  • Bottleneck: Group Fan-out:
    • For very large groups (e.g., 10,000 users), the loop-based fan-out is too slow.
    • Next Step: Switch to a "Pull" model for large groups, where users poll for the latest messages when they open the group, rather than the server pushing every message.
  • Bottleneck: Global Latency:
    • A user in Australia messaging a user in the UK might experience lag if the data center is in the US.
    • Next Step: Deploy Gateways in multiple regions (Edge locations). Connect these regions via a high-speed private backbone network to reduce latency.

Summary

  1. Architecture: Stateful Gateways (WebSockets) + Stateless Services.

  2. Storage: Cassandra for write-heavy chat logs; S3 for media.

  3. Reliability: "Unread Queue" ensures messages are never lost if a user is offline.

  4. Performance: Routing via Redis Session Cache ensures millisecond delivery.