8.5 Design a messaging system (WhatsApp / Facebook Messenger) - The System Design Interview Handbook

1. Problem Definition and Scope

We are designing a global, real-time messaging application like Facebook Messenger where users can communicate text and media instantly.

The goal is to build a highly reliable system that delivers messages with low latency while handling billions of active users.

Main User Groups: Mobile users (iOS and Android).
Main Actions: Sending text messages, sharing images, and seeing message delivery status.
Scope:
- Included: 1-on-1 text chat, basic Group chat, Message receipts (Sent/Delivered/Read), Offline support, and Online status.
- Excluded: Voice/Video calling (VoIP), Status/Stories, Web client specifics, and deep cryptographic details of End-to-End Encryption (E2EE).

2. Clarify functional requirements

Must Have:

1-on-1 Messaging: Users can send and receive text messages in real-time.
Group Messaging: Users can broadcast messages to a group of people (up to 256 members).
Message Receipts:
- Sent: Reached the server (1 gray tick).
- Delivered: Reached the recipient's device (2 gray ticks).
- Read: User opened the chat (2 blue ticks).
Offline Support: If a recipient is offline, the message is queued and delivered immediately when they return.
Media Sharing: Users can upload and send images/videos.

Nice to Have:

Last Seen: Users can see when a contact was last online.
Push Notifications: Wake up the device when a new message arrives.

Functional Requirements

3. Clarify non-functional requirements

Target Users: 1 Billion Daily Active Users (DAU).
Message Volume: 50 Billion messages per day (approx. 50 messages per user).
Latency: Extremely low (Real-time). Messages should arrive within 200ms if both users are online.
Availability: High (99.99%). The service is a utility; it must essentially never be down.
Consistency: Message ordering must be strict (FIFO) within a single chat.
Data Retention: Messages are stored permanently to support restoring history on new devices.

Non-Functional Requirements

4. Back of the envelope estimates

Traffic (QPS):
- Total Messages = 50 Billion / day.
- Seconds in a day $\approx$ 86,400.
- Average QPS = $50,000,000,000 / 86,400 \approx$ 580,000 requests/sec.
- Peak QPS (e.g., New Year's) $\approx$ 3x Average $\approx$ 1.7 Million requests/sec.
Storage (Text):
- Average message size $\approx$ 100 Bytes.
- Daily Text Storage = 50 Billion $\times$ 100 Bytes = 5 TB / day.
- Yearly Text Storage = 5 TB $\times$ 365 $\approx$ 1.8 PB (Petabytes).
- Conclusion: We need a database that scales horizontally (Sharding).
Storage (Media):
- Assume 10% of messages are media. Average size $\approx$ 200 KB.
- Daily Media Storage = 5 Billion $\times$ 200 KB = 1 PB / day.
- Conclusion: Media must use Object Storage (S3), not the database.
Bandwidth:
- Ingress (Media) $\approx$ 1 PB / day $\approx$ 11.5 GB/sec. Massive bandwidth usage; requires a CDN.

Back-of-the-Envelope Estimation

5. API design

For real-time chat, HTTP is too slow because of header overhead and lack of server-push capability.

We will use WebSockets (or a persistent TCP protocol) to keep a connection open between the client and server.

1. Connect (WebSocket Handshake)

GET /ws/connect
Headers: Authorization: Bearer <token>
Response: 101 Switching Protocols (Connection Established).

2. Send Message (WebSocket Event)

Client sends:

{
  "type": "send_message",
  "data": {
    "to_user_id": "user-uuid-2",
    "content": "encrypted-blob",
    "type": "TEXT"
  }
}

3. Upload Media (REST Endpoint)

POST /v1/media
Request: Multipart file (image/video).
Response:


{  
  "media\_id": "media-uuid-99",  
  "url": "https://s3.region.amazonaws.com/bucket/media-uuid-99.jpg"  
}

4. Acknowledge Message (WebSocket Event)

Client sends:

{  
  "type": "ack",  
  "data": { "message\_id": "msg-123", "status": "DELIVERED" }  
}

6. High level architecture

We will decouple the layer that holds the connections from the layer that processes the logic.

Client $\rightarrow$ Load Balancer $\rightarrow$ Gateway Service $\rightarrow$ Message Service $\rightarrow$ Database

Client: The mobile app. Maintains a local SQLite database for chat history and a persistent WebSocket connection to the server.
Load Balancer: Distributes millions of WebSocket connections across the Gateway servers.
Gateway Service (Connection Holder):
- A stateful server that holds the persistent TCP/WebSocket connection for users.
- It maintains a map of UserID $\rightarrow$ Socket Connection.
- It acts as a pipe: pushes data to clients and forwards requests to the Message Service.
Session Cache (Redis): A global lookup that tells us "Which Gateway server is User B connected to?".
Message Service: Stateless microservice. It handles message validation, persistence to the DB, and routing logic.
Group Service: Manages group members and fan-out logic.
Database (Cassandra): Stores chat history and offline message queues.
Object Storage (S3): Stores images and videos.

High-Level Architecture

7. Data model

We need a database optimized for extremely high write throughput. SQL databases struggle to scale to billions of writes/day without complex sharding. We will use a Wide-Column NoSQL store (Cassandra).

1. Messages Table (Chat History)

Partition Key: chat_id (Unique ID for 1-on-1 or Group conversation). This keeps all messages for a chat on the same database node, making "Load History" extremely fast.
Clustering Key: message_id (TimeUUID). This sorts messages chronologically.

Column	Type	Description
chat_id	UUID	Partition Key.
message_id	TIMEUUID	Clustering Key.
sender_id	UUID	Who sent it.
content	BLOB	Encrypted payload.
status	INT	Sent, Delivered, Read.

2. User_Unread_Queue (Offline Optimization)

Partition Key: user_id.
Clustering Key: message_id.
Purpose: When a user comes online, we query this table to quickly find only the messages they missed, rather than scanning the entire chat history.

3. Group_Members (SQL or NoSQL)

group_id (PK), user_id (PK). Used to look up "Who do I send this group message to?".

8. Core flows end to end

This section details exactly how data moves through the system. We will trace the path of a message step-by-step to understand how the components (Gateways, Database, Caches) interact.

Flow 1: User A sends a message to User B (Both Online)

This is the "happy path" where both users have the app open and are connected to the internet.

Connection Established (Pre-requisite)
- Both User A and User B have already established a long-lived WebSocket connection with the Gateway Service.
- User A is connected to Gateway Node 1.
- User B is connected to Gateway Node 2.
- Redis (Session Cache) stores this mapping: User A -> Gateway 1 and User B -> Gateway 2.
Message Sending (User A)
- User A types "Hello" and presses send.
- The mobile app generates a message_id locally (for deduplication) and sends a JSON payload over the existing WebSocket connection to Gateway Node 1.
- Note: The UI immediately shows a single grey tick (meaning "Sending...") to make the app feel responsive, even before the server confirms.
Processing and Persistence
- Gateway Node 1 receives the payload. It is just a pipe, so it forwards the request to the Message Service (the brain of the system).
- The Message Service first persists the message to the Cassandra Database (Messages Table) with a status of SENT.
- Why Persist First? We save to disk before attempting delivery to ensure we never lose the message, even if the server crashes milliseconds later.
Acknowledgment to Sender
- Once Cassandra confirms the write, the Message Service tells Gateway Node 1 to send an ACK back to User A.
- User A’s screen updates: The single grey tick becomes solid (Server Received).
Routing and Delivery
- The Message Service needs to find User B. It queries the Redis Session Cache: "Where is User B connected?"
- Redis replies: "User B is on Gateway Node 2."
- The Message Service forwards the message payload to Gateway Node 2.
- Gateway Node 2 pushes the message down the open WebSocket connection to User B's device.
Delivery Receipt
- User B's app receives the message and immediately sends a background ACK: DELIVERED packet back to Gateway Node 2.
- This Ack travels back: Gateway 2 -> Message Service -> Update Cassandra Status -> Gateway 1 -> User A.
- User A’s screen updates: The tick becomes a double grey tick (Delivered).

Flow 1

Flow 2: User A sends a message to User B (User B is Offline)

This flow handles the case where the recipient has no internet connection or has the app closed.

Send and Persist (Same as Flow 1)
- User A sends the message.
- The Message Service saves it to Cassandra as SENT.
- User A receives the single grey tick (Server Confirmation).
Routing Failure
- The Message Service queries Redis for User B.
- Redis returns NULL or an "Offline" status.
Queuing for Later
- The Message Service writes a new entry into the User_Unread_Queue (a specific table in Cassandra designed for fast lookups).
- It stores user_id: B and message_id: XYZ.
External Notification
- Since the WebSocket path is dead, the Message Service calls the Push Notification Service.
- This service sends a payload to Apple (APNS) or Google (FCM) to wake up User B’s phone with a system notification: "New message from User A."

Flow 2

Flow 3: User B comes Online (Synchronization)

This flow explains how User B catches up on missed messages when they open the app.

Reconnection
- User B opens the app. The client establishes a new WebSocket connection to a random Gateway Node (e.g., Gateway Node 3).
- Gateway Node 3 updates Redis: User B -> Gateway 3.
Checking the Queue
- Upon connection, Gateway Node 3 (or the Message Service) automatically queries the User_Unread_Queue table in Cassandra for User B.
Bulk Delivery
- If the queue has messages (e.g., 5 unread messages), the service fetches the full message content from the Messages Table.
- It sends all 5 messages down the WebSocket to User B in a batch.
- User B's client sends ACK: DELIVERED for all of them.
Cleanup
- Once the specific ACKs are received, the system deletes those entries from the User_Unread_Queue so they aren't redelivered later.

Flow 3

9. Caching and read performance

Session Cache (Redis):
- Data: Key: UserID, Value: Gateway_ID.
- Why: This is hit for every message sent. It must be blazing fast (latency < 5ms).
- TTL: Keys expire if the WebSocket connection is closed or heartbeats fail.
User Profile Cache:
- Data: Display Name, Avatar URL, Status.
- Why: Profile data rarely changes but is read often.
Chat History:
- We do not cache chat history in Redis. It is too large (Petabytes).
- Cassandra is optimized for sequential reads. When a user opens a chat, we fetch the last 50 messages from Cassandra efficiently using the chat_id.

10. Storage, indexing and media

Database (Cassandra):
- Used for all text messages and metadata.
- Indexing: The primary key (chat_id, message_id) allows us to query "Get messages for Chat X where time > Y".
Media Storage (S3 + CDN):
- We never store binary images in Cassandra.
- Flow:
  1. User uploads image to API Server $\rightarrow$ S3.
  2. API Server returns a URL.
  3. User sends a text message containing the URL.
  4. Receiver's phone sees the URL and downloads the image from a CDN (like CloudFront) to ensure fast download speeds globally.

11. Scaling strategies

Gateway Scaling:
- To handle 1 Billion users, we need thousands of Gateway servers.
- We use a Load Balancer to route new users to the least busy server.
- If a Gateway crashes, the 100k users on it disconnect and automatically reconnect to a new Gateway.
Database Partitioning:
- Cassandra partitions data automatically based on the chat_id.
- This ensures even distribution of data across the cluster.
Group Chat Fan-out:
- For a group with 200 people, the server must generate 200 outgoing messages.
- To prevent blocking the sender, we use a Message Queue (Kafka). The Message Service publishes a "Group Message" event to Kafka. Background workers consume this event, look up the 200 members, and process the delivery for each member in parallel.

12. Reliability, failure handling and backpressure

No Single Point of Failure:
- Gateways, Services, and Databases are all replicated across multiple Availability Zones (AZs).
Store and Forward:
- We guarantee At-Least-Once delivery. If we are not sure if User B got the message (e.g., network timeout during Ack), we resend it. The client app handles deduplication using the message_id.
Backpressure:
- If a user is being spammed (e.g., receiving 1000 messages/sec), the Gateway buffer will fill up. The Gateway can stop reading from the socket (TCP flow control), forcing the sender to slow down.

13. Security, privacy and abuse

End-to-End Encryption (E2EE):
- This is the gold standard (Signal Protocol).
- Keys are generated on the user's device. The server only holds the Public Key.
- The server processes encrypted blobs (content). It cannot read the messages.
Authentication:
- Phone number verification via SMS OTP.
- WebSocket connections are authenticated via a secure, short-lived session token.
Abuse:
- Rate limiting on the "Send Message" API to prevent bots.
- Ability to Block users (the server drops messages from blocked users before delivery).

14. Bottlenecks and next steps

Bottleneck: Group Fan-out:
- For very large groups (e.g., 10,000 users), the loop-based fan-out is too slow.
- Next Step: Switch to a "Pull" model for large groups, where users poll for the latest messages when they open the group, rather than the server pushing every message.
Bottleneck: Global Latency:
- A user in Australia messaging a user in the UK might experience lag if the data center is in the US.
- Next Step: Deploy Gateways in multiple regions (Edge locations). Connect these regions via a high-speed private backbone network to reduce latency.

Summary

Architecture: Stateful Gateways (WebSockets) + Stateless Services.
Storage: Cassandra for write-heavy chat logs; S3 for media.
Reliability: "Unread Queue" ensures messages are never lost if a user is offline.
Performance: Routing via Redis Session Cache ensures millisecond delivery.