1. Problem statement and scope
We are designing a cloud-based file storage and synchronization service similar to Dropbox or Google Drive.
Users want to store files in a designated local folder that automatically backs up to the cloud and synchronizes those exact files across all their other devices (like a laptop, desktop, and phone).
The main user groups are individuals who care about having reliable, anywhere-access to their files without worrying about manual backups.
The main actions they care about are adding files, updating them, and seeing those changes reflect instantly everywhere.
To keep the scope manageable, we will focus strictly on the core file upload, download, and real-time synchronization engine for individual users.
We will not cover complex enterprise permission models, real-time collaborative document editing (like Google Docs), or billing systems.
2. Clarify functional requirements
Must have:
- User can upload a file from any device to the cloud.
- User can download a file from the cloud to their device.
- Files automatically synchronize across all of a user’s logged-in devices in near real-time.
- System handles large files efficiently by only uploading the changed parts of a file (delta sync) and resuming interrupted transfers.
Nice to have:
- File versioning so users can restore a previous state of a modified file.
- Offline support, allowing users to make local file changes that automatically sync when the internet reconnects.
Functional Requirements
3. Clarify non-functional requirements
-
Target users: 50 million Daily Active Users (DAU).
-
Read and write pattern: Read-heavy overall. Users typically upload or edit a file once, which triggers automatic sync downloads on their other connected devices. Let’s assume a 3:1 read-to-write ratio.
-
Latency goals: Metadata updates (like renaming a file or moving a folder) should feel instant and reflect in under 500ms. File transfer latency is bound by the user’s network speed, but our system should maximize their available bandwidth.
-
Availability targets: High availability, targeting 99.99% (”four nines”) uptime for the core sync operations to ensure users can always access their backups.
-
Consistency preference: Strong consistency is strictly required for metadata (folder structures, file names). If a user renames a folder on their phone, their laptop must see the exact same state to avoid corrupted directories. Eventual consistency is acceptable for the actual file bytes.
-
Data retention: Files are kept indefinitely until explicitly deleted by the user. Deleted files and older versions are retained for 30 days.
Non-Functional Requirements
4. Back-of-the-envelope estimates
Let’s do some quick math to size our system.
- Estimate QPS for main operations:
- Assume 50 million DAU, and each active user uploads or modifies 4 files per day on average.
- Write QPS: 50M * 4 = 200 million writes/day. 200M / 86,400 seconds ≈ 2,300 write QPS (average).
- Read QPS: With a 3:1 ratio, we expect 600 million reads/day. 600M / 86,400 seconds ≈ 6,900 read QPS (average).
- Estimate peak traffic versus average:
- Assume peak traffic is double the average. Peak Write QPS ≈ 4,600. Peak Read QPS ≈ 13,800.
- Estimate needed storage per day and per year:
- Assume the average file size (or size of a delta update) is 2 MB.
- Storage per day: 200 million files * 2 MB = 400 Terabytes (TB) of new storage per day.
- Storage per year: 400 TB * 365 days = ~146 Petabytes (PB) per year.
- Estimate rough bandwidth needs:
- Ingress (Writes): 400 TB/day / 86,400 = ~4.6 Gigabytes/second (GB/s).
- Egress (Reads): 1,200 TB/day / 86,400 = ~13.8 GB/s.
Back-of-the-envelope Estimation
5. API design
We will use a REST API over HTTPS for standard metadata operations, and WebSockets for real-time sync notifications.
To handle large files efficiently, files are split into 4MB chunks (blocks) on the client before interacting with the API.
- POST /v1/files/upload_request
- Request parameters:
parent_folder_id(string),file_name(string),file_size(int),chunk_hashes(list of strings). - Response body:
missing_chunk_hashes(list of strings),upload_urls(list of pre-signed S3 URLs). - Main error cases: 400 Bad Request, 401 Unauthorized.
- Request parameters:
- POST /v1/files/commit
- Request parameters:
file_id,version_hash,chunk_hashes. - Response body:
success(boolean),new_version_id(string). - Main error cases: 409 Conflict (if another device modified the file at the same time).
- Request parameters:
- GET /v1/files/metadata
- Request parameters:
file_idorfolder_id. - Response body:
file_id,latest_version,size,chunk_hashes. - Main error cases: 404 Not Found.
- Request parameters:
- WebSocket /v1/sync/notifications
- A persistent connection where the server pushes JSON events. Example:
{ "event": "file_updated", "file_path": "/docs/resume.pdf" }.
- A persistent connection where the server pushes JSON events. Example:
6. High-level architecture
To handle this efficiently, we separate the heavy lifting of moving bytes (files) from the complex logic of managing folder structures (metadata).
Client -> Load balancer -> API Gateway -> Metadata Servers -> Cache -> SQL Database Client -> Object Storage (Direct upload/download) Metadata Servers -> Message Queue -> Notification Servers -> Client (WebSockets)
High-level Architecture
- Clients: Desktop, mobile, and web apps. They monitor local folders, split files into chunks, calculate hashes, and talk to our backend.
- Load balancers: Distribute incoming HTTP requests and WebSocket connections evenly across our servers.
- API Gateway: Handles user authentication, rate limiting, and routes requests to the correct internal service.
- Metadata Servers: Application servers that manage the folder hierarchy, file names, and versions. They read from and write to the SQL database.
- Notification Servers: Maintain long-lived WebSocket connections with clients to instantly push “sync needed” alerts.
- Cache (Redis): Stores frequently accessed folder contents and user profiles to reduce database read load.
- SQL Database: Stores the file system hierarchy, user data, and file-to-chunk mappings.
- Message Queue (Kafka): Decouples the Metadata servers from the Notification servers to process events asynchronously.
- Object storage (AWS S3): Massive, highly scalable storage for the actual binary file chunks. Clients read and write directly to it.
7. Data model
We will use a SQL (Relational) Database (like PostgreSQL) for the metadata. File systems are inherently hierarchical (folders inside folders).
Operations like renaming or moving a folder require strict ACID guarantees.
If two devices try to rename a folder at the exact same millisecond, a relational database uses transactions and locks to prevent data corruption.
Main Tables:
- Workspaces:
workspace_id(PK),user_id(Index). Represents a user’s root drive. - Files:
file_id(PK),workspace_id(Index),parent_id(Index),name,is_folder,latest_version_id. - File_Versions:
version_id(PK),file_id,updated_at. - Chunks:
chunk_hash(PK),s3_path,size. (Allows global deduplication across the whole system). - Version_Chunks:
version_id(Index),chunk_hash(Index),chunk_order(int). Maps a specific file version to its specific chunks.
Media Storage: The actual file bytes are never stored in the SQL database. They are split into 4MB chunks, hashed, and stored in Object Storage.
The SQL database only stores the pointers (chunk_hash and s3_path) to support our core queries of assembling a file.
8. Core flows end to end
Here is exactly how the system behaves for the most critical features, broken down step by step.
Flow 1: Uploading a new file (with deduplication)
- Client chunking: The user drops a 10MB file into their local Dropbox folder. The client app silently splits this file into three chunks (two 4MB chunks, one 2MB chunk) and calculates a secure SHA-256 hash for each.
- Upload request: The client calls the
/upload_requestAPI with the 3 chunk hashes. - Deduplication: The Metadata server checks the SQL database and replies: “I already have chunk 1 and 2 in the system, just send me chunk 3.” (This happens if the user copied a file, or if another user uploaded the exact same public file).
- Direct upload: The server provides a pre-signed S3 URL for chunk 3. The client uploads only chunk 3 directly to Object Storage. This bypasses our app servers entirely, saving massive bandwidth. Meanwhile, the user sees a “Syncing...” icon on their desktop.
- Commit: Once the upload is successful, the client calls the
/commitAPI. The database creates a new file version and links it to all three chunk hashes. The user sees a green “Synced” checkmark. - Async trigger: The Metadata server drops a “File Updated” event into the Message Queue to be processed asynchronously.
Flow 1
Flow 2: Syncing updates to a second device (Delta sync)
- Listening: The user’s laptop has an open WebSocket connection to a Notification server.
- Notification: The Notification server consumes the “File Updated” event from the Message Queue and sends a tiny ping to the laptop: “File X has changed.”
- Fetching metadata: The laptop calls the API server to get the new metadata for File X. It looks at the chunk hashes and compares them to its local version. It realizes it already has chunks 1 and 2 locally, but needs chunk 3 (the delta).
- Delta download: The laptop requests a download URL for chunk 3 and fetches it directly from Object Storage.
- Reassembly: Once downloaded, the laptop stitches chunks 1, 2, and 3 together locally to reconstruct the 10MB file and saves it to the disk.
Flow 2
Flow 3: Handling offline edits and conflicts
- Offline work: A user edits a text document while on an airplane with no Wi-Fi. The client queues all changes locally.
- Reconnecting: When internet is restored, the client sends its queued
/upload_requestand/committo the Metadata server. - Conflict detection: The server checks the SQL database. If the file’s
latest_versionmoved forward while the user was offline (meaning another device changed it), the server rejects the commit and returns a409 Conflicterror. - Resolution: The client handles this gracefully. It renames the local offline version to “Document (Conflicted copy).txt”, uploads it as a brand new file, and then downloads the server’s latest version of the original “Document.txt”. No data is lost.
Flow 3
9. Caching and read performance
- What we cache: User sessions, permissions, and recent folder contents (e.g., a list of file metadata objects inside a specific folder). We do not cache actual file bytes, as Object Storage is already optimized for this.
- Where it sits: A Redis cluster placed between the Metadata servers and the SQL database.
- Cache keys and values: A key like
workspace:{id}:folder:{id}:contents. The value is a JSON array of the files inside that folder. - Update and invalidation rules: We use a cache-aside strategy with strict invalidation. Whenever a file is added, renamed, or deleted, the Metadata server updates the database and immediately deletes the corresponding folder’s key from the cache. The next client read will fetch fresh data from the DB and re-cache it.
- Eviction policy: We use LRU (Least Recently Used). Users typically interact with a small set of active files, so old folders from years ago can safely fall out of the cache to save memory.
10. Storage, indexing, and media
- Primary data storage: Our PostgreSQL database handles all metadata.
- Indexes: The heaviest queries are “give me all files in this folder.” By indexing
workspace_idandparent_id, the database can return directory structures in milliseconds. - Media storage: Storing file chunks in Object Storage (like AWS S3) is the secret sauce of this design. It is highly durable and allows clients to upload exactly where they left off if a connection drops.
- Serving media with a CDN: We generally do not use a CDN for private user files. CDNs are pull-based and cache assets at the edge, which is great for public viral videos, but useless (and a security risk) for private tax documents accessed by only one person. Clients pull chunks directly from Object Storage.
- Trade-offs: Chunking files saves massive upload bandwidth and storage costs (via deduplication), but it increases the complexity of our database, as every file now requires multiple rows mapping it to its chunks.
11. Scaling Strategies
- First simple version: A single monolithic API server, one SQL database, and local disk storage can handle a few thousand users.
- Database replication: As read traffic (sync checks) grows, we add Read Replicas to our SQL database. Writes go to the primary node, while reads go to the replicas.
- Sharding strategy: Eventually, a single primary database cannot handle 4,600+ complex write QPS. We will shard the database by
workspace_id. This ensures all files, folders, and versions for a specific user live on the exact same database shard, keeping folder operations fast and ACID-compliant. - Routing to shards: We use a lightweight metadata lookup service (or a Consistent Hashing ring on the API Gateway) that tells the server, “User 123’s metadata lives on Database Shard #7”.
- Separating read and write paths: We deploy separate auto-scaling groups for Metadata servers versus Notification servers, as WebSockets require high memory for open connections, while Metadata servers require high CPU.
- Queues and worker pools: We use message queues and background workers to handle bursts of heavy work, such as generating image thumbnails or permanently deleting garbage chunks from Object Storage when files are permanently deleted.
12. Reliability, failure handling, and backpressure
- Removing single points of failure: Every layer (Gateways, API servers, Caches, DBs) is horizontally scaled across multiple Availability Zones (AZs). If one data center loses power, traffic routes to another.
- Timeouts, retries, and idempotency: Mobile networks are flaky. If a chunk upload fails, the client retries with exponential backoff (waiting 1s, 2s, 4s, 8s). Because chunks are identified by their secure hash, the upload is an idempotent operation—uploading the same chunk twice by accident is perfectly safe and will not duplicate data.
- Circuit breakers: If Object Storage experiences an outage and latency spikes, our Metadata servers will trip a circuit breaker. Instead of keeping connections open and crashing the whole backend, the API immediately returns a
503 Service Unavailableto the client until storage recovers. - Overload behavior: Under extreme load (e.g., millions reconnecting after an ISP outage), the API gateway will implement rate limiting (returning
429 Too Many Requests). Notification servers will shed non-essential work, dropping WebSockets and forcing clients to gracefully degrade to HTTP long-polling every few minutes.
13. Security, privacy, and abuse
- Authentication and authorization: Clients use OAuth 2.0 to obtain short-lived access tokens. Every API call verifies the token and checks if the user has permission to access that
workspace_id. - Encryption: All data in transit uses TLS (HTTPS/WSS). All file chunks in Object Storage are encrypted at rest using AES-256 via a Key Management Service (KMS).
- Handling sensitive data: Pre-signed S3 URLs used for uploading and downloading chunks must expire very quickly (e.g., within 10 minutes) so they cannot be leaked or reused maliciously.
- Rate limiting and abuse: We enforce hard storage limits at the database level so free-tier users cannot upload infinite data. Background workers scan newly uploaded chunks against known malware databases to prevent virus distribution.
- Privacy considerations: Because we use global data deduplication, we must ensure a malicious user cannot simply guess a hash and download a file they don’t own. The API must always verify the user has a database record linking them to that chunk hash before providing a download URL.
14. Bottlenecks and next steps
- Bottleneck: Managing millions of open WebSocket connections. Keeping WebSockets open for 50 million DAU requires massive server memory and OS file descriptors.
- Next step: Optimize the Notification server tier by writing it in a highly concurrent language like Go or Erlang. Strip out all business logic from this layer; it should do nothing but hold lightweight connections and pass tiny pings.
- Bottleneck: The chunk mapping tables growing infinitely. Storing multiple rows per file version means the
Version_Chunkstable will quickly grow to billions of rows, slowing down database inserts.- Next step: Implement an aggressive background archiving job. Move the metadata of deleted files or 30-day-old versions out of the hot SQL database and into a cheaper, slower cold storage system (like AWS Athena).
- Bottleneck: Very large shared folders. Sharding by
workspace_idis great, but a massive enterprise account with millions of files might overload a single database shard.- Next step: Transition highly active shared folders to their own dedicated database shards (sharding by
folder_id) or migrate to a globally distributed NewSQL database (like Google Spanner) that automatically manages hot spots.
- Next step: Transition highly active shared folders to their own dedicated database shards (sharding by
Summary of the design:
- Chunking is the secret sauce: Splitting files into 4MB blocks saves massive bandwidth and storage via deduplication, delta syncs, and effortless resumption of broken uploads.
- Separation of concerns: We keep heavy binary file bytes strictly in Object Storage and lightweight structure in a SQL database.
- Strong metadata consistency: Relational databases sharded by user guarantee that users do not suffer data corruption during concurrent syncs.
- Real-time sync: Decoupling uploads from notifications via WebSockets and message queues ensures a seamless, instant experience across multiple devices.