1. Problem Definition and Scope
We are designing the backend for the "Home Timeline" of a Twitter-like social network. This is the primary feed where a user sees a stream of tweets posted by the people they follow, arranged in reverse chronological order (newest first).
-
Users: Hundreds of millions of active users.
-
Main Actions: Posting tweets, scrolling through the feed, and following other users.
-
Scope:
-
In Scope: Posting text/image tweets, following users, generating the home timeline (fan-out), and viewing the timeline.
-
Out of Scope: Search, Direct Messages, Trending Topics, User Profiles, Notifications, and complex algorithmic ranking (we will focus on time-based sorting).
-
2. Clarify functional requirements
Must Have:
-
Post Tweet: Users can post messages (up to 280 chars) with optional images.
-
View Home Timeline: Users can view an aggregated list of tweets from followees, sorted by time.
-
Pagination: Users can scroll infinitely to load older tweets.
-
Follow/Unfollow: Users can subscribe to other users' updates.
-
Low Latency: The timeline must load very fast (< 200ms).
Nice to Have:
-
Media Support: Efficient upload and serving of images/videos.
-
Mentions/Hashtags: Basic storage of metadata (parsing logic is secondary).
Functional Requirements
3. Clarify non-functional requirements
-
Target Users: 300 Million Daily Active Users (DAU).
-
Read/Write Pattern: Extremely Read Heavy. Users view their timeline ~50 to 100 times more often than they post.
- Ratio: ~100:1.
-
Latency Goals:
- Reads: < 200ms (Critical).
- Writes: < 500ms. Propagation to followers can be eventually consistent (a few seconds delay is fine).
-
Availability: High (99.99%). It is better to show a slightly stale timeline than an error page (AP system).
-
Data Retention: Tweets are permanent. Timeline caches are transient.
Non-Functional Requirements
4. Back of the envelope estimates
- Traffic Estimates:
- Writes (Tweets): Average 2 tweets per user/day.
- 300M x 2 = 600M tweets/day.
- Average TPS = 600M / 86400 ≈ 7,000 tweets/sec.
- Peak TPS = ≈ 15,000 tweets/sec.
- Reads (Timeline): Average 20 timeline visits per user/day.
- 300M x 20 = 6B reads/day.
- Average QPS = 6B / 86400 ≈ 70,000 reads/sec.
- Peak QPS = ≈ 150,000 reads/sec.
- Writes (Tweets): Average 2 tweets per user/day.
- Storage Estimates:
- Text: Assume 300 bytes per tweet (id, text, metadata).
- 600M x 300 bytes ≈ 180 GB/day.
- 180 GB x 365 ≈ 65 TB/year. (Requires database sharding).
- Media: Assume 20% of tweets have images (avg 500KB).
- 120M x 500 KB = 60 TB/day.
- Requires Object Storage (S3).
- Text: Assume 300 bytes per tweet (id, text, metadata).
Back-of-the-Envelope Estimation
5. API design
We will use a REST API.
- Post Tweet
- POST /v1/tweets
- Params: content (string), media_ids (list of strings).
- Response: 201 Created
{ "tweet_id": "16705050", "timestamp": 1679000000 }
- Get Home Timeline
- GET /v1/timeline/home
- Params: user_id (from auth), limit (default 20), cursor (ID of the last tweet seen).
- Response: 200 OK
{
"data": [ { "tweet_id": "...", "content": "...", "author": {...} } ],
"next_cursor": "16705040"
}
- Follow User
- POST /v1/users/{target_id}/follow
- Response: 200 OK.
6. High level architecture
We will use a microservices architecture with a "Fan-out on Write" approach for most users.
Components:
-
Client (App/Web): Fetches data and uploads media.
-
Load Balancer: Distributes traffic.
-
Tweet Service: Handles posting and storage of tweets.
-
User Graph Service: Manages who follows whom (SQL DB).
-
Timeline Service: Retrieves the pre-computed feed from Redis.
-
Fan-out Workers: Asynchronously push new tweet IDs to followers' timeline caches.
-
Redis Cache: Stores the "Home Timeline" (list of Tweet IDs) for active users.
-
Object Storage (S3): Stores images and videos.
High-Level Architecture
7. Data model
We split data based on access patterns.
1. Tweets (NoSQL - Cassandra or DynamoDB)
- Optimized for high write throughput and simple key-value lookups.
- Table: Tweets
- tweet_id (PK) - Snowflake ID (Time-sortable).
- user_id (Partition Key).
- content (Text).
- media_urls (List).
2. User Graph (SQL - PostgreSQL)
- Relational data is best for graph edges.
- Table: Follows
- follower_id, followee_id.
- Primary Key: (follower_id, followee_id).
- Index: (followee_id, follower_id) for "Who follows X?".
3. Timeline Cache (Redis)
- Key: timeline:{user_id}
- Value: List of tweet_ids (e.g., [105, 104, 102]).
- Stores IDs only, not full content, to save RAM.
8. Core flows end to end
This section describes the "Life of a Tweet."
To handle millions of users, we cannot just query the database every time someone reloads their feed. Instead, we split our system into two distinct paths: a Write Path (Posting) and a Read Path (Viewing).
The core design philosophy is "Fan-out on Write": We do the heavy processing work when the tweet is posted (asynchronously) so that reading the timeline is effortless and instant.
Flow 1: Posting a Tweet (The Write Path)
Goal: Securely save the tweet and return "Success" to the user immediately, while distributing it to followers in the background.
We split this into Synchronous (User waits) and Asynchronous (Background) steps.
Part A: The Synchronous Path (User Experience)
- Media Upload (The "Sidecar" Pattern):
-
If the tweet has an image/video, the client does not send the file to our main API server (which would block threads).
-
Instead, the client requests a Presigned URL from the backend and uploads the binary file directly to S3 (Object Storage). S3 returns a reference key.
- Request:
- The client sends a
POST /tweetsrequest with the text and the S3 media key.
- Persist & Ack:
-
The Tweet Service generates a unique Snowflake ID. (We use Snowflake IDs because they are time-sortable, avoiding the need for a separate
created_atindex). -
It saves the tweet metadata to the Tweet DB (Cassandra).
-
Crucial Step: The service returns
201 Createdto the user immediately. -
Why? The user feels the app is fast (< 200ms). They do not wait for the tweet to appear in their followers' feeds.
Part B: The Asynchronous Path (Fan-out)
- Event Stream:
- After saving, the Tweet Service pushes an event (
NewTweet: {tweet_id, author_id}) to a Kafka topic.
- Fan-out Workers (The Heavy Lifting):
-
Workers consume the message and query the User Graph Service to find the author's followers.
-
The worker iterates through the followers and pushes the new
tweet_idinto the Redis Timeline List of each follower. -
Redis Operation:
LPUSH timeline:{follower_id} {tweet_id}
Mental Model: Imagine the author drops a letter in a mailbox (DB). A team of workers (Fan-out) then makes photocopies of the address of that letter and places a copy into the private mailbox (Redis List) of every person following the author.
Flow 1
Flow 2: Reading the Home Timeline (The Read Path)
Goal: Load the feed in under 200ms.
Since we pre-computed the lists in Flow 1, the Read Flow is primarily about fetching and formatting. We use a pattern called "Hydration."
- Retrieve IDs (The Skeleton):
- User sends
GET /timeline. - The Timeline Service fetches the pre-computed list of tweet IDs from the user's Redis Cache (
timeline:{user_id}). - Result: A list of numbers, e.g.,
[105, 104, 102].
- Hydration (The Meat):
- The app cannot display just numbers. It needs text, handles, and avatars.
- The service performs a Multi-Get request to the Tweet Object Cache (Memcached) to fetch the content for these IDs in parallel.
- Result: Returns
{id: 105, content: "Hello", author: "..."}.
- Response:
- The service assembles the final JSON and returns it to the client.
Why separate IDs and Content? If a viral tweet appears in 1 million timelines, storing the full text 1 million times in Redis would waste massive amounts of RAM. By storing the ID (8 bytes) in the Redis Lists and the Content (300 bytes) once in the Object Cache, we save significant memory. This is called Normalization.
Flow 2
9. Caching and read performance
Since the system is read-heavy (100:1), we rely heavily on caching.
- Timeline Cache (Redis):
- Stores Tweet IDs only.
- Size: Keep only the last 800 IDs per user. If a user scrolls deeper, we query the DB (slower path).
- TTL: Expire cache if user is inactive for 14 days. Rebuild on login.
- Tweet Object Cache (Memcached/Redis):
- Stores tweet_id -> { content, author, media_url }.
- Since viral tweets appear in millions of timelines, this cache reduces DB load significantly.
10. Storage, indexing and media
- ID Generation (Snowflake):
- We use 64-bit IDs that contain a timestamp.
- This allows us to sort tweets by ID naturally (ORDER BY id DESC), avoiding expensive sorting on a separate created_at column.
- Media Storage:
- Images/Videos go to S3.
- CDN: All media is served via a CDN (CloudFront) so users fetch data from a server near them, reducing latency.
11. Scaling strategies
The "Celebrity" Problem (Hot Partition)
The "Push" model breaks for celebrities (e.g., a user with 100M followers). Writing one tweet would require 100M Redis updates, causing massive lag ("Thundering Herd").
Solution: Hybrid Model (Push + Pull)
-
Normal Users (Push): If an author has < 100k followers, push their tweets to followers' caches (Standard Fan-out).
-
Celebrities (Pull): If an author has > 100k followers, do not push. Save to DB only.
-
Read Time: When a user loads their timeline:
- Fetch the pre-computed list (normal friends) from Redis.
- Fetch recent tweets from the "Celebrities" they follow (Pull) from a separate cache/DB.
- Merge the two lists in memory by timestamp.
Sharding:
- Tweet DB: Sharded by tweet_id (Snowflake) to spread write load.
- User Graph: Sharded by follower_id.
12. Reliability, failure handling and backpressure
-
Message Queues: Kafka decouples posting from fan-out. If workers fall behind (backpressure), the API remains fast, and tweets just appear slightly later.
-
Redundancy: Redis Cluster and Database Replicas (Master-Slave). If a node fails, a replica takes over.
-
Circuit Breakers: If the Tweet Service or DB is down, the Timeline Service can return a degraded response (e.g., "Timeline currently unavailable") instead of crashing the app.
13. Security, privacy and abuse
-
Authentication: OAuth 2.0 / JWT.
-
Rate Limiting: Use Redis to count requests. Limit users to X tweets/min to prevent spam bots.
-
Privacy: Before pushing a tweet to a follower's cache, the worker must check if the author is "Private" and if the follow relationship is valid.
14. Bottlenecks and next steps
- Bottleneck: Cost of RAM. Storing timeline lists for 300M users is expensive.
- Next Step: Optimize by only caching for active users (e.g., logged in last 3 days).
- Bottleneck: Feed Relevance. As users follow more people, a pure chronological feed becomes noisy.
- Next Step: Introduce a Ranking Service. Instead of just merging lists by time, we pass the candidates to a Machine Learning model to score and re-order them based on relevance.
Summary
We designed a read-optimized system using Redis Timelines and Async Fan-out.
We handled scale using a Hybrid Push/Pull architecture and Snowflake IDs for efficient sorting.