1. Problem Definition and Scope
We are designing a global, real-time messaging application where users can communicate text and media instantly.
The goal is to build a highly reliable system that delivers messages with low latency while handling billions of active users.
-
Main User Groups: Mobile users (iOS and Android).
-
Main Actions: Sending text messages, sharing images, and seeing message delivery status.
-
Scope:
- Included: 1-on-1 text chat, basic Group chat, Message receipts (Sent/Delivered/Read), Offline support, and Online status.
- Excluded: Voice/Video calling (VoIP), Status/Stories, Web client specifics, and deep cryptographic details of End-to-End Encryption (E2EE).
2. Clarify functional requirements
Must Have:
-
1-on-1 Messaging: Users can send and receive text messages in real-time.
-
Group Messaging: Users can broadcast messages to a group of people (up to 256 members).
-
Message Receipts:
- Sent: Reached the server (1 gray tick).
- Delivered: Reached the recipient's device (2 gray ticks).
- Read: User opened the chat (2 blue ticks).
-
Offline Support: If a recipient is offline, the message is queued and delivered immediately when they return.
-
Media Sharing: Users can upload and send images/videos.
Nice to Have:
- Last Seen: Users can see when a contact was last online.
- Push Notifications: Wake up the device when a new message arrives.
Functional Requirements
3. Clarify non-functional requirements
-
Target Users: 1 Billion Daily Active Users (DAU).
-
Message Volume: 50 Billion messages per day (approx. 50 messages per user).
-
Latency: Extremely low (Real-time). Messages should arrive within 200ms if both users are online.
-
Availability: High (99.99%). The service is a utility; it must essentially never be down.
-
Consistency: Message ordering must be strict (FIFO) within a single chat.
-
Data Retention: Messages are stored permanently to support restoring history on new devices.
Non-Functional Requirements
4. Back of the envelope estimates
- Traffic (QPS):
- Total Messages = 50 Billion / day.
- Seconds in a day $\approx$ 86,400.
- Average QPS = $50,000,000,000 / 86,400 \approx$ 580,000 requests/sec.
- Peak QPS (e.g., New Year's) $\approx$ 3x Average $\approx$ 1.7 Million requests/sec.
- Storage (Text):
- Average message size $\approx$ 100 Bytes.
- Daily Text Storage = 50 Billion $\times$ 100 Bytes = 5 TB / day.
- Yearly Text Storage = 5 TB $\times$ 365 $\approx$ 1.8 PB (Petabytes).
- Conclusion: We need a database that scales horizontally (Sharding).
- Storage (Media):
- Assume 10% of messages are media. Average size $\approx$ 200 KB.
- Daily Media Storage = 5 Billion $\times$ 200 KB = 1 PB / day.
- Conclusion: Media must use Object Storage (S3), not the database.
- Bandwidth:
- Ingress (Media) $\approx$ 1 PB / day $\approx$ 11.5 GB/sec. Massive bandwidth usage; requires a CDN.
Back-of-the-Envelope Estimation
5. API design
For real-time chat, HTTP is too slow because of header overhead and lack of server-push capability.
We will use WebSockets (or a persistent TCP protocol) to keep a connection open between the client and server.
1. Connect (WebSocket Handshake)
- GET /ws/connect
- Headers: Authorization: Bearer <token>
- Response: 101 Switching Protocols (Connection Established).
2. Send Message (WebSocket Event)
- Client sends:
{
"type": "send_message",
"data": {
"to_user_id": "user-uuid-2",
"content": "encrypted-blob",
"type": "TEXT"
}
}
3. Upload Media (REST Endpoint)
- POST /v1/media
- Request: Multipart file (image/video).
- Response:
{
"media\_id": "media-uuid-99",
"url": "https://s3.region.amazonaws.com/bucket/media-uuid-99.jpg"
}
4. Acknowledge Message (WebSocket Event)
- Client sends:
{
"type": "ack",
"data": { "message\_id": "msg-123", "status": "DELIVERED" }
}
6. High level architecture
We will decouple the layer that holds the connections from the layer that processes the logic.
Client $\rightarrow$ Load Balancer $\rightarrow$ Gateway Service $\rightarrow$ Message Service $\rightarrow$ Database
-
Client: The mobile app. Maintains a local SQLite database for chat history and a persistent WebSocket connection to the server.
-
Load Balancer: Distributes millions of WebSocket connections across the Gateway servers.
-
Gateway Service (Connection Holder):
- A stateful server that holds the persistent TCP/WebSocket connection for users.
- It maintains a map of UserID $\rightarrow$ Socket Connection.
- It acts as a pipe: pushes data to clients and forwards requests to the Message Service.
-
Session Cache (Redis): A global lookup that tells us "Which Gateway server is User B connected to?".
-
Message Service: Stateless microservice. It handles message validation, persistence to the DB, and routing logic.
-
Group Service: Manages group members and fan-out logic.
-
Database (Cassandra): Stores chat history and offline message queues.
-
Object Storage (S3): Stores images and videos.
High-Level Architecture
7. Data model
We need a database optimized for extremely high write throughput. SQL databases struggle to scale to billions of writes/day without complex sharding. We will use a Wide-Column NoSQL store (Cassandra).
1. Messages Table (Chat History)
- Partition Key: chat_id (Unique ID for 1-on-1 or Group conversation). This keeps all messages for a chat on the same database node, making "Load History" extremely fast.
- Clustering Key: message_id (TimeUUID). This sorts messages chronologically.
| Column | Type | Description |
|---|---|---|
| chat_id | UUID | Partition Key. |
| message_id | TIMEUUID | Clustering Key. |
| sender_id | UUID | Who sent it. |
| content | BLOB | Encrypted payload. |
| status | INT | Sent, Delivered, Read. |
2. User_Unread_Queue (Offline Optimization)
- Partition Key: user_id.
- Clustering Key: message_id.
- Purpose: When a user comes online, we query this table to quickly find only the messages they missed, rather than scanning the entire chat history.
3. Group_Members (SQL or NoSQL)
- group_id (PK), user_id (PK). Used to look up "Who do I send this group message to?".
8. Core flows end to end
This section details exactly how data moves through the system. We will trace the path of a message step-by-step to understand how the components (Gateways, Database, Caches) interact.
Flow 1: User A sends a message to User B (Both Online)
This is the "happy path" where both users have the app open and are connected to the internet.
- Connection Established (Pre-requisite)
- Both User A and User B have already established a long-lived WebSocket connection with the Gateway Service.
- User A is connected to Gateway Node 1.
- User B is connected to Gateway Node 2.
- Redis (Session Cache) stores this mapping: User A -> Gateway 1 and User B -> Gateway 2.
- Message Sending (User A)
- User A types "Hello" and presses send.
- The mobile app generates a message_id locally (for deduplication) and sends a JSON payload over the existing WebSocket connection to Gateway Node 1.
- Note: The UI immediately shows a single grey tick (meaning "Sending...") to make the app feel responsive, even before the server confirms.
- Processing and Persistence
- Gateway Node 1 receives the payload. It is just a pipe, so it forwards the request to the Message Service (the brain of the system).
- The Message Service first persists the message to the Cassandra Database (Messages Table) with a status of SENT.
- Why Persist First? We save to disk before attempting delivery to ensure we never lose the message, even if the server crashes milliseconds later.
- Acknowledgment to Sender
- Once Cassandra confirms the write, the Message Service tells Gateway Node 1 to send an ACK back to User A.
- User A’s screen updates: The single grey tick becomes solid (Server Received).
- Routing and Delivery
- The Message Service needs to find User B. It queries the Redis Session Cache: "Where is User B connected?"
- Redis replies: "User B is on Gateway Node 2."
- The Message Service forwards the message payload to Gateway Node 2.
- Gateway Node 2 pushes the message down the open WebSocket connection to User B's device.
- Delivery Receipt
- User B's app receives the message and immediately sends a background ACK: DELIVERED packet back to Gateway Node 2.
- This Ack travels back: Gateway 2 -> Message Service -> Update Cassandra Status -> Gateway 1 -> User A.
- User A’s screen updates: The tick becomes a double grey tick (Delivered).
Flow 1
Flow 2: User A sends a message to User B (User B is Offline)
This flow handles the case where the recipient has no internet connection or has the app closed.
- Send and Persist (Same as Flow 1)
- User A sends the message.
- The Message Service saves it to Cassandra as SENT.
- User A receives the single grey tick (Server Confirmation).
- Routing Failure
- The Message Service queries Redis for User B.
- Redis returns NULL or an "Offline" status.
- Queuing for Later
- The Message Service writes a new entry into the User_Unread_Queue (a specific table in Cassandra designed for fast lookups).
- It stores user_id: B and message_id: XYZ.
- External Notification
- Since the WebSocket path is dead, the Message Service calls the Push Notification Service.
- This service sends a payload to Apple (APNS) or Google (FCM) to wake up User B’s phone with a system notification: "New message from User A."
Flow 2
Flow 3: User B comes Online (Synchronization)
This flow explains how User B catches up on missed messages when they open the app.
- Reconnection
- User B opens the app. The client establishes a new WebSocket connection to a random Gateway Node (e.g., Gateway Node 3).
- Gateway Node 3 updates Redis: User B -> Gateway 3.
- Checking the Queue
- Upon connection, Gateway Node 3 (or the Message Service) automatically queries the User_Unread_Queue table in Cassandra for User B.
- Bulk Delivery
- If the queue has messages (e.g., 5 unread messages), the service fetches the full message content from the Messages Table.
- It sends all 5 messages down the WebSocket to User B in a batch.
- User B's client sends ACK: DELIVERED for all of them.
- Cleanup
- Once the specific ACKs are received, the system deletes those entries from the User_Unread_Queue so they aren't redelivered later.
Flow 3
9. Caching and read performance
- Session Cache (Redis):
- Data: Key: UserID, Value: Gateway_ID.
- Why: This is hit for every message sent. It must be blazing fast (latency < 5ms).
- TTL: Keys expire if the WebSocket connection is closed or heartbeats fail.
- User Profile Cache:
- Data: Display Name, Avatar URL, Status.
- Why: Profile data rarely changes but is read often.
- Chat History:
- We do not cache chat history in Redis. It is too large (Petabytes).
- Cassandra is optimized for sequential reads. When a user opens a chat, we fetch the last 50 messages from Cassandra efficiently using the chat_id.
10. Storage, indexing and media
- Database (Cassandra):
- Used for all text messages and metadata.
- Indexing: The primary key (chat_id, message_id) allows us to query "Get messages for Chat X where time > Y".
- Media Storage (S3 + CDN):
- We never store binary images in Cassandra.
- Flow:
- User uploads image to API Server $\rightarrow$ S3.
- API Server returns a URL.
- User sends a text message containing the URL.
- Receiver's phone sees the URL and downloads the image from a CDN (like CloudFront) to ensure fast download speeds globally.
11. Scaling strategies
- Gateway Scaling:
- To handle 1 Billion users, we need thousands of Gateway servers.
- We use a Load Balancer to route new users to the least busy server.
- If a Gateway crashes, the 100k users on it disconnect and automatically reconnect to a new Gateway.
- Database Partitioning:
- Cassandra partitions data automatically based on the chat_id.
- This ensures even distribution of data across the cluster.
- Group Chat Fan-out:
- For a group with 200 people, the server must generate 200 outgoing messages.
- To prevent blocking the sender, we use a Message Queue (Kafka). The Message Service publishes a "Group Message" event to Kafka. Background workers consume this event, look up the 200 members, and process the delivery for each member in parallel.
12. Reliability, failure handling and backpressure
- No Single Point of Failure:
- Gateways, Services, and Databases are all replicated across multiple Availability Zones (AZs).
- Store and Forward:
- We guarantee At-Least-Once delivery. If we are not sure if User B got the message (e.g., network timeout during Ack), we resend it. The client app handles deduplication using the message_id.
- Backpressure:
- If a user is being spammed (e.g., receiving 1000 messages/sec), the Gateway buffer will fill up. The Gateway can stop reading from the socket (TCP flow control), forcing the sender to slow down.
13. Security, privacy and abuse
- End-to-End Encryption (E2EE):
- This is the gold standard (Signal Protocol).
- Keys are generated on the user's device. The server only holds the Public Key.
- The server processes encrypted blobs (content). It cannot read the messages.
- Authentication:
- Phone number verification via SMS OTP.
- WebSocket connections are authenticated via a secure, short-lived session token.
- Abuse:
- Rate limiting on the "Send Message" API to prevent bots.
- Ability to Block users (the server drops messages from blocked users before delivery).
14. Bottlenecks and next steps
- Bottleneck: Group Fan-out:
- For very large groups (e.g., 10,000 users), the loop-based fan-out is too slow.
- Next Step: Switch to a "Pull" model for large groups, where users poll for the latest messages when they open the group, rather than the server pushing every message.
- Bottleneck: Global Latency:
- A user in Australia messaging a user in the UK might experience lag if the data center is in the US.
- Next Step: Deploy Gateways in multiple regions (Edge locations). Connect these regions via a high-speed private backbone network to reduce latency.
Summary
-
Architecture: Stateful Gateways (WebSockets) + Stateless Services.
-
Storage: Cassandra for write-heavy chat logs; S3 for media.
-
Reliability: "Unread Queue" ensures messages are never lost if a user is offline.
-
Performance: Routing via Redis Session Cache ensures millisecond delivery.