8.9 Design a ride-sharing service (Uber / Lyft) - The System Design Interview Handbook

Here is a step-by-step system design for a ride-hailing platform like Uber or Lyft.

1. Problem Definition and Scope

We are designing a two-sided marketplace that connects Riders (demand) with Drivers (supply) in real-time. The system must track driver locations, match riders to nearby drivers, and manage the trip lifecycle.

Main User Groups:
- Riders: Request rides, track vehicle arrival, and pay.
- Drivers: Share location, receive ride offers, and navigate.
Scope:
- We will focus on the Dispatch System (Matching) and Location System (Tracking).
- We will cover the flow from "Requesting a Ride" to "Trip Completion."
- Out of Scope: Shared rides (Pool), Food delivery (Eats), User registration, and complex Payment Gateways (we assume a generic payment service).

2. Clarify functional requirements

Must Have:

Driver Location Updates: Drivers send GPS coordinates every few seconds.
Nearby Drivers: Riders can see available cars on a map before requesting.
Request Ride: Riders can enter a destination and book a trip.
Matching (Dispatch): The system finds the best available driver and offers the ride.
Trip Management: Drivers can accept/decline rides; system tracks status (Arrived, In Progress, Ended).
Fare Calculation: Price is estimated based on distance and time.

Nice to Have:

Surge Pricing: Dynamic pricing during high demand.
Ride History: Viewing past trip details.
Scheduled Rides: Booking in advance.

Functional Requirements

3. Clarify non-functional requirements

Scale:
- 10 Million Daily Active Riders.
- 1 Million Active Drivers online at peak.
Latency:
- Matching: < 2 seconds to find a driver.
- Map Updates: Real-time feel (driver icon moves within seconds of real life).
Availability: 99.99%. Downtime causes immediate financial loss and safety issues.1
Consistency:
- Strong Consistency: Required for Trip State (a ride cannot be double-booked).
- Eventual Consistency: Acceptable for Location Data (if the map view is 2 seconds behind, it's acceptable).

Non-Functional Requirements

4. Back of the envelope estimates

Traffic (QPS):

Location Updates (Write Heavy):
- 1 million drivers.
- Update every 4 seconds.
- $1,000,000 / 4 = 250,000$ Writes Per Second (QPS).
- Insight: This is too high for a standard SQL database; we need a specialized throughput strategy.
Ride Requests:
- 10 million rides / day.
- $10,000,000 / 86,400 \approx 115$ requests/sec avg.
- Peak (5x) $\approx 600$ requests/sec.
- Insight: Manageable by standard web servers.

Storage:

Trip Data: 1KB per trip. $10\text{M} \times 1\text{KB} = 10\text{GB/day}$.
Location History: If we archive all GPS points for analytics:
- 250k QPS $\times$ 100 bytes $\approx$ 25 MB/s.
- $25 \text{ MB/s} \times 86,400 \approx 2.1 \text{ TB/day}$.
- We need "Cold Storage" (like S3) for this history.

Back-of-the-envelope Estimation

5. API design

We use WebSockets for the high-frequency location stream and REST for transactional actions.

1. Update Location (Driver -> Server)

Protocol: WebSocket
Payload: { "lat": 37.77, "lng": -122.41, "status": "available" }
Response: ACK

2. Request Ride (Rider -> Server)

Method: POST /v1/trips
Body: { "pickup_lat": ..., "pickup_lng": ..., "dest_lat": ..., "service_type": "standard" }
Response: { "trip_id": "t123", "status": "matching", "eta": 5 }

3. Accept Ride (Driver -> Server)

Method: POST /v1/trips/{trip_id}/accept
Body: { "driver_id": "d456" }
Response: 200 OK or 409 Conflict (if ride is no longer available).

4. Trip Status Update

Method: PATCH /v1/trips/{trip_id}
Body: { "status": "picked_up" }

6. High level architecture

We separate the system into two distinct parts: the Location System (handles high-volume, ephemeral data) and the Trip System (handles business logic and transactions).

High-level Architecture

Components:

Load Balancer: Routes HTTP requests and terminates SSL.2
WebSocket Gateway: Maintains persistent connections with millions of drivers.
Location Service: Ingests the 250k QPS stream. Updates the cache.
Redis (Geospatial): Stores the current location of every active driver.
Dispatch Service: The "Brain." It queries Redis to find nearby drivers and sends offers.
Trip Service: Manages the state machine (Requested -> Matched -> In Progress) and writes to SQL.
PostgreSQL: Stores durable data (User profiles, Trip logs, Payments).

7. Data model

1. Primary Database (PostgreSQL)

Used for critical data requiring ACID guarantees.

Trips Table:
- trip_id (PK)
- rider_id (FK)
- driver_id (FK) - Nullable initially
- status (Enum: REQUESTED, MATCHING, IN_PROGRESS, COMPLETED, CANCELLED)
- pickup_loc, dropoff_loc
- fare
- created_at, updated_at

2. Ephemeral Storage (Redis)

Used for high-speed spatial lookups.

Key: driver_locations
Value: Geospatial Set (Driver ID, Lat, Long).
Logic: We use GEOADD to update driver positions and GEORADIUS to find drivers within X kilometers.

3. History Storage (Cassandra or S3)

Used for analytics (debugging accidents, calculating demand maps).

The Location Service batches updates to a Kafka queue, which dumps data into a data lake. We do not query this for live features.

8. Core flows end to end

This section outlines the two most critical operational loops in a ride-sharing system: tracking where drivers are (Flow 1) and connecting them with riders (Flow 2).

Flow 1: Driver Location Update (The "Heartbeat")

This flow is designed to answer the question: "Who is available right now?" It favors high throughput and low latency over permanent storage.

1. Emit: The 4-Second Interval

The Logic: The Driver App wakes up every 4 seconds to grab GPS coordinates.
Why not every 1 second? Real-time accuracy is good, but sending data every second drains the driver's battery and consumes massive amounts of cellular data bandwidth.
Why not every 30 seconds? A car moving at 60 km/h (37 mph) travels 500 meters in 30 seconds. If the data is that old, the dispatch system might assign a ride to a driver who has already driven past the pickup point. 4 seconds is the "Goldilocks" zone—fresh enough for matching, but efficient enough for hardware.

2. Send: The WebSocket Connection

The Mechanism: Instead of opening a new HTTP connection for every update (which involves a heavy TCP handshake overhead), the app keeps a persistent WebSocket connection open.
Payload: The app sends a lightweight JSON packet: {driver_id, lat, lng, timestamp}.
Efficiency: This reduces latency to milliseconds, ensuring the server sees the car exactly where it is now, not where it was 2 seconds ago due to network lag.

3. Update: Redis & GEOADD

The Storage Choice: We do not write these updates to a traditional SQL database (like PostgreSQL or MySQL) immediately. SQL databases are too slow for millions of write operations per second.
The Solution: We use Redis, an in-memory data structure store.
The Command (GEOADD): Redis has specialized commands for geospatial data. GEOADD takes the coordinates and converts them into a Geohash (a string representation of coordinates).
- Technical Detail: This allows Redis to store the driver in a sorted set, enabling extremely fast queries like "Find points within 2km of X."

4. Expire: The TTL (Time To Live) Strategy

The Problem: Mobile networks are unreliable. Drivers drive into tunnels, their phones die, or the app crashes. If a driver disconnects without hitting "Log Off," they become a "Ghost Driver"—appearing on the map but unable to accept rides.
The Fix: Every time the location updates, we set a TTL of 15 seconds on that driver's record in Redis.
How it works:
- 0s: Driver sends update. Redis Key expiry set to 15s.
- 4s: Driver sends update. Redis Key expiry reset to 15s.
- 8s: Driver enters a tunnel (loses internet).
- 23s: No update received. The TTL hits 0. Redis automatically deletes the key.
Result: The driver vanishes from the "Available" pool automatically. No complex "disconnect" logic is needed.

Flow 1

Flow 2: Matching a Ride (The "Dispatch")

This flow manages the transaction of a ride. It requires strict data consistency (ACID) because money and contracts are involved.

1. Request: Trip Creation

Action: The Rider taps "Request."
State: The Trip Service creates a row in the primary database (SQL).
- trip_id: 101
- rider_id: 50
- status: REQUESTED
Importance: This is the "source of truth." If the system crashes, we still have a record that this rider wants a ride.

2. Search: The Spatial Query

Query: The Dispatch Service asks Redis (using GEORADIUS or GEOSEARCH): "Give me the IDs of drivers within a 3km radius of this lat/long."
Speed: Because Redis stores locations in memory using Geohashes, it can return the 10 closest IDs in roughly 1-2 milliseconds, even with millions of drivers.

3. Filter: Business Logic

Redis only knows where drivers are, not who they are. The service must filter the list of IDs returned by Redis:

Availability: Is the driver's status actually AVAILABLE in the main DB? (Maybe they just accepted another ride 1 millisecond ago).
Blocking: Has this driver previously blocked this rider (or vice versa)?
Vehicle Type: Did the rider request "UberXL" but the closest driver is "UberX"?

4. Offer: The Push

Action: The system picks the absolute closest driver (Driver A) after filtering.
Transport: It sends a message via the WebSocket Gateway to Driver A's phone: "New Ride! 3 mins away. $15 fare. Accept?"

5. Driver Decision: The Critical "Lock"

This is the most complex part of the flow due to Race Conditions.

Scenario:

Driver A sees the offer.

If Accepted:
1. Request hits Trip Service.
2. Database Transaction: The service runs a query similar to:
3. SQL

    UPDATE trips

    SET status \= 'MATCHED', driver\_id \= 'DriverA'

    WHERE trip\_id \= 101 AND status \= 'REQUESTED';

The Check: The WHERE status = 'REQUESTED' is vital. If another process (or error) already matched this trip, the update affects 0 rows. The transaction fails, and we tell Driver A "Sorry, ride no longer available."
Success: If the row updates, we lock the deal and notify the rider.

If Ignored/Declined:
1. The system waits for a timeout (e.g., 15s) or an explicit "Decline" button press.
2. The Dispatch Service removes Driver A from the candidate list.
3. It picks the next closest driver (Driver B) from the original list (or re-queries Redis) and repeats the loop.

Flow 2

9. Caching and read performance

Driver Locations (Hot Data):
- We treat Redis as the source of truth for "current location." We do not read from SQL for map views. This ensures sub-millisecond read times.
ETA Calculations:
- Routing (calculating drive time) is expensive. We cache ETAs for common routes (e.g., "SFO Airport" to "Downtown") in a dedicated cache cluster to avoid calling the Routing Engine repeatedly.
User Profiles:
- Rider/Driver names and photos are cached in a standard LRU Redis cache to reduce load on Postgres.

10. Storage, indexing and media

Geospatial Indexing:
- To efficiently find "drivers near X," we cannot scan the whole list.
- Geohashing: We divide the world into a grid. Each cell has a unique ID. Drivers in the same cell share the same ID prefix.
- Implementation: Redis GEO commands handle this mathematically (using Geohash or similar). It turns a 2D search into a fast O(log N) lookup.
Media:
- Driver profile photos and car images are stored in Object Storage (S3).
- They are served via CDN (CloudFront) so images load instantly even if the user is on a slow network.

11. Scaling strategies

1. Geographic Sharding (The "Golden Key"):

Ride-sharing is a regional problem. A request in New York does not affect London.
We shard our Location Service, Redis, and Dispatch Service by Region (e.g., US-East, EU-West, Asia-Pacific).
This isolates failures. If the "US-East" cluster goes down, Europe is unaffected.

2. Batch Matching (Optimizing High Load):

In high-density areas (e.g., a stadium after a concert), "Greedy Matching" (first driver gets the ride) is inefficient.
Strategy: We accumulate requests for a short window (e.g., 2 seconds) and run a "Bipartite Matching" algorithm to pair the pool of riders to the pool of drivers in a way that minimizes total wait time for everyone.

3. Database Replication:

PostgreSQL uses a Leader-Follower setup.
Writes (Ride Requests) go to the Leader.
Reads (View History, View Profile) go to Followers.

12. Reliability, failure handling and backpressure

Graceful Degradation:
- If the Google Maps API (Routing) fails, we fall back to Euclidean Distance (straight line). It's less accurate for ETAs but keeps the business running.
Circuit Breakers:
- If the Dispatch Service is overwhelmed, the Trip Service will stop sending requests and return "High Demand, please wait" to riders instead of crashing the backend.
Redundancy:
- Redis runs in Cluster mode with replicas. If a primary node fails, a replica is promoted automatically to prevent losing location data.

13. Security, privacy and abuse

Privacy (Phone Masking):
- Riders and Drivers never see real phone numbers. We use a proxy service (like Twilio) to bridge calls anonymously.
Data Fuzzing:
- In the analytics warehouse, we "fuzz" (randomize) the start/end coordinates of trips so employees cannot look up exactly where a celebrity or user lives.
Fraud:
- GPS Spoofing: We run algorithms to detect "Teleportation" (impossible travel speeds) and ban drivers using fake GPS apps to farm bonuses.

14. Bottlenecks and next steps

Bottleneck: Redis Memory.
- Risk: If we store too much metadata with the location, Redis fills up.
- Fix: Store only DriverID + Lat/Long in Redis. Fetch other details (Car type, Name) from a separate cache only when needed.
Bottleneck: Map Service Costs.
- Risk: Paying external providers (Google/Mapbox) for every ETA is expensive.
- Next Step: Build an internal routing engine using OpenStreetMap (OSRM) to calculate routes for free.

Summary

We solved the write-heavy location problem using Redis/Geospatial.
- We solved scale using Regional Sharding.
- We ensured reliability via WebSockets and Graceful Fallbacks.