Here is a step-by-step system design for a ride-hailing platform like Uber or Lyft.
1. Problem Definition and Scope
We are designing a two-sided marketplace that connects Riders (demand) with Drivers (supply) in real-time. The system must track driver locations, match riders to nearby drivers, and manage the trip lifecycle.
- Main User Groups:
-
Riders: Request rides, track vehicle arrival, and pay.
-
Drivers: Share location, receive ride offers, and navigate.
-
- Scope:
-
We will focus on the Dispatch System (Matching) and Location System (Tracking).
-
We will cover the flow from "Requesting a Ride" to "Trip Completion."
-
Out of Scope: Shared rides (Pool), Food delivery (Eats), User registration, and complex Payment Gateways (we assume a generic payment service).
-
2. Clarify functional requirements
Must Have:
-
Driver Location Updates: Drivers send GPS coordinates every few seconds.
-
Nearby Drivers: Riders can see available cars on a map before requesting.
-
Request Ride: Riders can enter a destination and book a trip.
-
Matching (Dispatch): The system finds the best available driver and offers the ride.
-
Trip Management: Drivers can accept/decline rides; system tracks status (Arrived, In Progress, Ended).
-
Fare Calculation: Price is estimated based on distance and time.
Nice to Have:
-
Surge Pricing: Dynamic pricing during high demand.
-
Ride History: Viewing past trip details.
-
Scheduled Rides: Booking in advance.
Functional Requirements
3. Clarify non-functional requirements
- Scale:
- 10 Million Daily Active Riders.
- 1 Million Active Drivers online at peak.
- Latency:
- Matching: < 2 seconds to find a driver.
- Map Updates: Real-time feel (driver icon moves within seconds of real life).
- Availability: 99.99%. Downtime causes immediate financial loss and safety issues.1
- Consistency:
- Strong Consistency: Required for Trip State (a ride cannot be double-booked).
- Eventual Consistency: Acceptable for Location Data (if the map view is 2 seconds behind, it's acceptable).
Non-Functional Requirements
4. Back of the envelope estimates
Traffic (QPS):
- Location Updates (Write Heavy):
- 1 million drivers.
- Update every 4 seconds.
- $1,000,000 / 4 = 250,000$ Writes Per Second (QPS).
- Insight: This is too high for a standard SQL database; we need a specialized throughput strategy.
- Ride Requests:
- 10 million rides / day.
- $10,000,000 / 86,400 \approx 115$ requests/sec avg.
- Peak (5x) $\approx 600$ requests/sec.
- Insight: Manageable by standard web servers.
Storage:
-
Trip Data: 1KB per trip. $10\text{M} \times 1\text{KB} = 10\text{GB/day}$.
-
Location History: If we archive all GPS points for analytics:
- 250k QPS $\times$ 100 bytes $\approx$ 25 MB/s.
- $25 \text{ MB/s} \times 86,400 \approx 2.1 \text{ TB/day}$.
- We need "Cold Storage" (like S3) for this history.
Back-of-the-envelope Estimation
5. API design
We use WebSockets for the high-frequency location stream and REST for transactional actions.
1. Update Location (Driver -> Server)
- Protocol: WebSocket
- Payload: { "lat": 37.77, "lng": -122.41, "status": "available" }
- Response: ACK
2. Request Ride (Rider -> Server)
- Method: POST /v1/trips
- Body: { "pickup_lat": ..., "pickup_lng": ..., "dest_lat": ..., "service_type": "standard" }
- Response: { "trip_id": "t123", "status": "matching", "eta": 5 }
3. Accept Ride (Driver -> Server)
- Method: POST /v1/trips/{trip_id}/accept
- Body: { "driver_id": "d456" }
- Response: 200 OK or 409 Conflict (if ride is no longer available).
4. Trip Status Update
- Method: PATCH /v1/trips/{trip_id}
- Body: { "status": "picked_up" }
6. High level architecture
We separate the system into two distinct parts: the Location System (handles high-volume, ephemeral data) and the Trip System (handles business logic and transactions).
High-level Architecture
Components:
-
Load Balancer: Routes HTTP requests and terminates SSL.2
-
WebSocket Gateway: Maintains persistent connections with millions of drivers.
-
Location Service: Ingests the 250k QPS stream. Updates the cache.
-
Redis (Geospatial): Stores the current location of every active driver.
-
Dispatch Service: The "Brain." It queries Redis to find nearby drivers and sends offers.
-
Trip Service: Manages the state machine (Requested -> Matched -> In Progress) and writes to SQL.
-
PostgreSQL: Stores durable data (User profiles, Trip logs, Payments).
7. Data model
1. Primary Database (PostgreSQL)
Used for critical data requiring ACID guarantees.
- Trips Table:
- trip_id (PK)
- rider_id (FK)
- driver_id (FK) - Nullable initially
- status (Enum: REQUESTED, MATCHING, IN_PROGRESS, COMPLETED, CANCELLED)
- pickup_loc, dropoff_loc
- fare
- created_at, updated_at
2. Ephemeral Storage (Redis)
Used for high-speed spatial lookups.
- Key: driver_locations
- Value: Geospatial Set (Driver ID, Lat, Long).
- Logic: We use GEOADD to update driver positions and GEORADIUS to find drivers within X kilometers.
3. History Storage (Cassandra or S3)
Used for analytics (debugging accidents, calculating demand maps).
- The Location Service batches updates to a Kafka queue, which dumps data into a data lake. We do not query this for live features.
8. Core flows end to end
This section outlines the two most critical operational loops in a ride-sharing system: tracking where drivers are (Flow 1) and connecting them with riders (Flow 2).
Flow 1: Driver Location Update (The "Heartbeat")
This flow is designed to answer the question: "Who is available right now?" It favors high throughput and low latency over permanent storage.
1. Emit: The 4-Second Interval
-
The Logic: The Driver App wakes up every 4 seconds to grab GPS coordinates.
-
Why not every 1 second? Real-time accuracy is good, but sending data every second drains the driver's battery and consumes massive amounts of cellular data bandwidth.
-
Why not every 30 seconds? A car moving at 60 km/h (37 mph) travels 500 meters in 30 seconds. If the data is that old, the dispatch system might assign a ride to a driver who has already driven past the pickup point. 4 seconds is the "Goldilocks" zone—fresh enough for matching, but efficient enough for hardware.
2. Send: The WebSocket Connection
-
The Mechanism: Instead of opening a new HTTP connection for every update (which involves a heavy TCP handshake overhead), the app keeps a persistent WebSocket connection open.
-
Payload: The app sends a lightweight JSON packet: {driver_id, lat, lng, timestamp}.
-
Efficiency: This reduces latency to milliseconds, ensuring the server sees the car exactly where it is now, not where it was 2 seconds ago due to network lag.
3. Update: Redis & GEOADD
-
The Storage Choice: We do not write these updates to a traditional SQL database (like PostgreSQL or MySQL) immediately. SQL databases are too slow for millions of write operations per second.
-
The Solution: We use Redis, an in-memory data structure store.
-
The Command (GEOADD): Redis has specialized commands for geospatial data. GEOADD takes the coordinates and converts them into a Geohash (a string representation of coordinates).
- Technical Detail: This allows Redis to store the driver in a sorted set, enabling extremely fast queries like "Find points within 2km of X."
4. Expire: The TTL (Time To Live) Strategy
-
The Problem: Mobile networks are unreliable. Drivers drive into tunnels, their phones die, or the app crashes. If a driver disconnects without hitting "Log Off," they become a "Ghost Driver"—appearing on the map but unable to accept rides.
-
The Fix: Every time the location updates, we set a TTL of 15 seconds on that driver's record in Redis.
-
How it works:
- 0s: Driver sends update. Redis Key expiry set to 15s.
- 4s: Driver sends update. Redis Key expiry reset to 15s.
- 8s: Driver enters a tunnel (loses internet).
- 23s: No update received. The TTL hits 0. Redis automatically deletes the key.
-
Result: The driver vanishes from the "Available" pool automatically. No complex "disconnect" logic is needed.
Flow 1
Flow 2: Matching a Ride (The "Dispatch")
This flow manages the transaction of a ride. It requires strict data consistency (ACID) because money and contracts are involved.
1. Request: Trip Creation
- Action: The Rider taps "Request."
- State: The Trip Service creates a row in the primary database (SQL).
- trip_id: 101
- rider_id: 50
- status: REQUESTED
- Importance: This is the "source of truth." If the system crashes, we still have a record that this rider wants a ride.
2. Search: The Spatial Query
- Query: The Dispatch Service asks Redis (using GEORADIUS or GEOSEARCH): "Give me the IDs of drivers within a 3km radius of this lat/long."
- Speed: Because Redis stores locations in memory using Geohashes, it can return the 10 closest IDs in roughly 1-2 milliseconds, even with millions of drivers.
3. Filter: Business Logic
Redis only knows where drivers are, not who they are. The service must filter the list of IDs returned by Redis:
- Availability: Is the driver's status actually AVAILABLE in the main DB? (Maybe they just accepted another ride 1 millisecond ago).
- Blocking: Has this driver previously blocked this rider (or vice versa)?
- Vehicle Type: Did the rider request "UberXL" but the closest driver is "UberX"?
4. Offer: The Push
- Action: The system picks the absolute closest driver (Driver A) after filtering.
- Transport: It sends a message via the WebSocket Gateway to Driver A's phone: "New Ride! 3 mins away. $15 fare. Accept?"
5. Driver Decision: The Critical "Lock"
This is the most complex part of the flow due to Race Conditions.
Scenario:
Driver A sees the offer.
- If Accepted:
- Request hits Trip Service.
- Database Transaction: The service runs a query similar to:
- SQL
UPDATE trips
SET status \= 'MATCHED', driver\_id \= 'DriverA'
WHERE trip\_id \= 101 AND status \= 'REQUESTED';
- The Check: The WHERE status = 'REQUESTED' is vital. If another process (or error) already matched this trip, the update affects 0 rows. The transaction fails, and we tell Driver A "Sorry, ride no longer available."
- Success: If the row updates, we lock the deal and notify the rider.
- If Ignored/Declined:
-
The system waits for a timeout (e.g., 15s) or an explicit "Decline" button press.
-
The Dispatch Service removes Driver A from the candidate list.
-
It picks the next closest driver (Driver B) from the original list (or re-queries Redis) and repeats the loop.
-
Flow 2
9. Caching and read performance
- Driver Locations (Hot Data):
- We treat Redis as the source of truth for "current location." We do not read from SQL for map views. This ensures sub-millisecond read times.
- ETA Calculations:
- Routing (calculating drive time) is expensive. We cache ETAs for common routes (e.g., "SFO Airport" to "Downtown") in a dedicated cache cluster to avoid calling the Routing Engine repeatedly.
- User Profiles:
- Rider/Driver names and photos are cached in a standard LRU Redis cache to reduce load on Postgres.
10. Storage, indexing and media
- Geospatial Indexing:
- To efficiently find "drivers near X," we cannot scan the whole list.
- Geohashing: We divide the world into a grid. Each cell has a unique ID. Drivers in the same cell share the same ID prefix.
- Implementation: Redis GEO commands handle this mathematically (using Geohash or similar). It turns a 2D search into a fast O(log N) lookup.
- Media:
- Driver profile photos and car images are stored in Object Storage (S3).
- They are served via CDN (CloudFront) so images load instantly even if the user is on a slow network.
11. Scaling strategies
1. Geographic Sharding (The "Golden Key"):
- Ride-sharing is a regional problem. A request in New York does not affect London.
- We shard our Location Service, Redis, and Dispatch Service by Region (e.g., US-East, EU-West, Asia-Pacific).
- This isolates failures. If the "US-East" cluster goes down, Europe is unaffected.
2. Batch Matching (Optimizing High Load):
- In high-density areas (e.g., a stadium after a concert), "Greedy Matching" (first driver gets the ride) is inefficient.
- Strategy: We accumulate requests for a short window (e.g., 2 seconds) and run a "Bipartite Matching" algorithm to pair the pool of riders to the pool of drivers in a way that minimizes total wait time for everyone.
3. Database Replication:
- PostgreSQL uses a Leader-Follower setup.
- Writes (Ride Requests) go to the Leader.
- Reads (View History, View Profile) go to Followers.
12. Reliability, failure handling and backpressure
- Graceful Degradation:
- If the Google Maps API (Routing) fails, we fall back to Euclidean Distance (straight line). It's less accurate for ETAs but keeps the business running.
- Circuit Breakers:
- If the Dispatch Service is overwhelmed, the Trip Service will stop sending requests and return "High Demand, please wait" to riders instead of crashing the backend.
- Redundancy:
- Redis runs in Cluster mode with replicas. If a primary node fails, a replica is promoted automatically to prevent losing location data.
13. Security, privacy and abuse
- Privacy (Phone Masking):
- Riders and Drivers never see real phone numbers. We use a proxy service (like Twilio) to bridge calls anonymously.
- Data Fuzzing:
- In the analytics warehouse, we "fuzz" (randomize) the start/end coordinates of trips so employees cannot look up exactly where a celebrity or user lives.
- Fraud:
- GPS Spoofing: We run algorithms to detect "Teleportation" (impossible travel speeds) and ban drivers using fake GPS apps to farm bonuses.
14. Bottlenecks and next steps
- Bottleneck: Redis Memory.
- Risk: If we store too much metadata with the location, Redis fills up.
- Fix: Store only DriverID + Lat/Long in Redis. Fetch other details (Car type, Name) from a separate cache only when needed.
- Bottleneck: Map Service Costs.
- Risk: Paying external providers (Google/Mapbox) for every ETA is expensive.
- Next Step: Build an internal routing engine using OpenStreetMap (OSRM) to calculate routes for free.
Summary
- We solved the write-heavy location problem using Redis/Geospatial.
- We solved scale using Regional Sharding.
- We ensured reliability via WebSockets and Graceful Fallbacks.