Chapter 8: System Design Interview Mastery

8.9 Design a ride-sharing service (Uber / Lyft)

Here is a step-by-step system design for a ride-hailing platform like Uber or Lyft.

1. Problem Definition and Scope

We are designing a two-sided marketplace that connects Riders (demand) with Drivers (supply) in real-time. The system must track driver locations, match riders to nearby drivers, and manage the trip lifecycle.

  • Main User Groups:
    • Riders: Request rides, track vehicle arrival, and pay.

    • Drivers: Share location, receive ride offers, and navigate.

  • Scope:
    • We will focus on the Dispatch System (Matching) and Location System (Tracking).

    • We will cover the flow from "Requesting a Ride" to "Trip Completion."

    • Out of Scope: Shared rides (Pool), Food delivery (Eats), User registration, and complex Payment Gateways (we assume a generic payment service).

2. Clarify functional requirements

Must Have:

  • Driver Location Updates: Drivers send GPS coordinates every few seconds.

  • Nearby Drivers: Riders can see available cars on a map before requesting.

  • Request Ride: Riders can enter a destination and book a trip.

  • Matching (Dispatch): The system finds the best available driver and offers the ride.

  • Trip Management: Drivers can accept/decline rides; system tracks status (Arrived, In Progress, Ended).

  • Fare Calculation: Price is estimated based on distance and time.

Nice to Have:

  • Surge Pricing: Dynamic pricing during high demand.

  • Ride History: Viewing past trip details.

  • Scheduled Rides: Booking in advance.

Functional Requirements

3. Clarify non-functional requirements

  • Scale:
    • 10 Million Daily Active Riders.
    • 1 Million Active Drivers online at peak.
  • Latency:
    • Matching: < 2 seconds to find a driver.
    • Map Updates: Real-time feel (driver icon moves within seconds of real life).
  • Availability: 99.99%. Downtime causes immediate financial loss and safety issues.1
  • Consistency:
    • Strong Consistency: Required for Trip State (a ride cannot be double-booked).
    • Eventual Consistency: Acceptable for Location Data (if the map view is 2 seconds behind, it's acceptable).

Non-Functional Requirements

4. Back of the envelope estimates

Traffic (QPS):

  • Location Updates (Write Heavy):
    • 1 million drivers.
    • Update every 4 seconds.
    • $1,000,000 / 4 = 250,000$ Writes Per Second (QPS).
    • Insight: This is too high for a standard SQL database; we need a specialized throughput strategy.
  • Ride Requests:
    • 10 million rides / day.
    • $10,000,000 / 86,400 \approx 115$ requests/sec avg.
    • Peak (5x) $\approx 600$ requests/sec.
    • Insight: Manageable by standard web servers.

Storage:

  • Trip Data: 1KB per trip. $10\text{M} \times 1\text{KB} = 10\text{GB/day}$.

  • Location History: If we archive all GPS points for analytics:

    • 250k QPS $\times$ 100 bytes $\approx$ 25 MB/s.
    • $25 \text{ MB/s} \times 86,400 \approx 2.1 \text{ TB/day}$.
    • We need "Cold Storage" (like S3) for this history.

Back-of-the-envelope Estimation

5. API design

We use WebSockets for the high-frequency location stream and REST for transactional actions.

1. Update Location (Driver -> Server)

  • Protocol: WebSocket
  • Payload: { "lat": 37.77, "lng": -122.41, "status": "available" }
  • Response: ACK

2. Request Ride (Rider -> Server)

  • Method: POST /v1/trips
  • Body: { "pickup_lat": ..., "pickup_lng": ..., "dest_lat": ..., "service_type": "standard" }
  • Response: { "trip_id": "t123", "status": "matching", "eta": 5 }

3. Accept Ride (Driver -> Server)

  • Method: POST /v1/trips/{trip_id}/accept
  • Body: { "driver_id": "d456" }
  • Response: 200 OK or 409 Conflict (if ride is no longer available).

4. Trip Status Update

  • Method: PATCH /v1/trips/{trip_id}
  • Body: { "status": "picked_up" }

6. High level architecture

We separate the system into two distinct parts: the Location System (handles high-volume, ephemeral data) and the Trip System (handles business logic and transactions).

High-level Architecture

Components:

  1. Load Balancer: Routes HTTP requests and terminates SSL.2

  2. WebSocket Gateway: Maintains persistent connections with millions of drivers.

  3. Location Service: Ingests the 250k QPS stream. Updates the cache.

  4. Redis (Geospatial): Stores the current location of every active driver.

  5. Dispatch Service: The "Brain." It queries Redis to find nearby drivers and sends offers.

  6. Trip Service: Manages the state machine (Requested -> Matched -> In Progress) and writes to SQL.

  7. PostgreSQL: Stores durable data (User profiles, Trip logs, Payments).

7. Data model

1. Primary Database (PostgreSQL)

Used for critical data requiring ACID guarantees.

  • Trips Table:
    • trip_id (PK)
    • rider_id (FK)
    • driver_id (FK) - Nullable initially
    • status (Enum: REQUESTED, MATCHING, IN_PROGRESS, COMPLETED, CANCELLED)
    • pickup_loc, dropoff_loc
    • fare
    • created_at, updated_at

2. Ephemeral Storage (Redis)

Used for high-speed spatial lookups.

  • Key: driver_locations
  • Value: Geospatial Set (Driver ID, Lat, Long).
  • Logic: We use GEOADD to update driver positions and GEORADIUS to find drivers within X kilometers.

3. History Storage (Cassandra or S3)

Used for analytics (debugging accidents, calculating demand maps).

  • The Location Service batches updates to a Kafka queue, which dumps data into a data lake. We do not query this for live features.

8. Core flows end to end

This section outlines the two most critical operational loops in a ride-sharing system: tracking where drivers are (Flow 1) and connecting them with riders (Flow 2).

Flow 1: Driver Location Update (The "Heartbeat")

This flow is designed to answer the question: "Who is available right now?" It favors high throughput and low latency over permanent storage.

1. Emit: The 4-Second Interval

  • The Logic: The Driver App wakes up every 4 seconds to grab GPS coordinates.

  • Why not every 1 second? Real-time accuracy is good, but sending data every second drains the driver's battery and consumes massive amounts of cellular data bandwidth.

  • Why not every 30 seconds? A car moving at 60 km/h (37 mph) travels 500 meters in 30 seconds. If the data is that old, the dispatch system might assign a ride to a driver who has already driven past the pickup point. 4 seconds is the "Goldilocks" zone—fresh enough for matching, but efficient enough for hardware.

2. Send: The WebSocket Connection

  • The Mechanism: Instead of opening a new HTTP connection for every update (which involves a heavy TCP handshake overhead), the app keeps a persistent WebSocket connection open.

  • Payload: The app sends a lightweight JSON packet: {driver_id, lat, lng, timestamp}.

  • Efficiency: This reduces latency to milliseconds, ensuring the server sees the car exactly where it is now, not where it was 2 seconds ago due to network lag.

3. Update: Redis & GEOADD

  • The Storage Choice: We do not write these updates to a traditional SQL database (like PostgreSQL or MySQL) immediately. SQL databases are too slow for millions of write operations per second.

  • The Solution: We use Redis, an in-memory data structure store.

  • The Command (GEOADD): Redis has specialized commands for geospatial data. GEOADD takes the coordinates and converts them into a Geohash (a string representation of coordinates).

    • Technical Detail: This allows Redis to store the driver in a sorted set, enabling extremely fast queries like "Find points within 2km of X."

4. Expire: The TTL (Time To Live) Strategy

  • The Problem: Mobile networks are unreliable. Drivers drive into tunnels, their phones die, or the app crashes. If a driver disconnects without hitting "Log Off," they become a "Ghost Driver"—appearing on the map but unable to accept rides.

  • The Fix: Every time the location updates, we set a TTL of 15 seconds on that driver's record in Redis.

  • How it works:

    • 0s: Driver sends update. Redis Key expiry set to 15s.
    • 4s: Driver sends update. Redis Key expiry reset to 15s.
    • 8s: Driver enters a tunnel (loses internet).
    • 23s: No update received. The TTL hits 0. Redis automatically deletes the key.
  • Result: The driver vanishes from the "Available" pool automatically. No complex "disconnect" logic is needed.

Flow 1

Flow 2: Matching a Ride (The "Dispatch")

This flow manages the transaction of a ride. It requires strict data consistency (ACID) because money and contracts are involved.

1. Request: Trip Creation

  • Action: The Rider taps "Request."
  • State: The Trip Service creates a row in the primary database (SQL).
    • trip_id: 101
    • rider_id: 50
    • status: REQUESTED
  • Importance: This is the "source of truth." If the system crashes, we still have a record that this rider wants a ride.

2. Search: The Spatial Query

  • Query: The Dispatch Service asks Redis (using GEORADIUS or GEOSEARCH): "Give me the IDs of drivers within a 3km radius of this lat/long."
  • Speed: Because Redis stores locations in memory using Geohashes, it can return the 10 closest IDs in roughly 1-2 milliseconds, even with millions of drivers.

3. Filter: Business Logic

Redis only knows where drivers are, not who they are. The service must filter the list of IDs returned by Redis:

  • Availability: Is the driver's status actually AVAILABLE in the main DB? (Maybe they just accepted another ride 1 millisecond ago).
  • Blocking: Has this driver previously blocked this rider (or vice versa)?
  • Vehicle Type: Did the rider request "UberXL" but the closest driver is "UberX"?

4. Offer: The Push

  • Action: The system picks the absolute closest driver (Driver A) after filtering.
  • Transport: It sends a message via the WebSocket Gateway to Driver A's phone: "New Ride! 3 mins away. $15 fare. Accept?"

5. Driver Decision: The Critical "Lock"

This is the most complex part of the flow due to Race Conditions.

Scenario:

Driver A sees the offer.

  • If Accepted:
    1. Request hits Trip Service.
    2. Database Transaction: The service runs a query similar to:
    3. SQL
    UPDATE trips

    SET status \= 'MATCHED', driver\_id \= 'DriverA'

    WHERE trip\_id \= 101 AND status \= 'REQUESTED';
  1. The Check: The WHERE status = 'REQUESTED' is vital. If another process (or error) already matched this trip, the update affects 0 rows. The transaction fails, and we tell Driver A "Sorry, ride no longer available."
  2. Success: If the row updates, we lock the deal and notify the rider.
  • If Ignored/Declined:
    1. The system waits for a timeout (e.g., 15s) or an explicit "Decline" button press.

    2. The Dispatch Service removes Driver A from the candidate list.

    3. It picks the next closest driver (Driver B) from the original list (or re-queries Redis) and repeats the loop.

Flow 2

9. Caching and read performance

  • Driver Locations (Hot Data):
    • We treat Redis as the source of truth for "current location." We do not read from SQL for map views. This ensures sub-millisecond read times.
  • ETA Calculations:
    • Routing (calculating drive time) is expensive. We cache ETAs for common routes (e.g., "SFO Airport" to "Downtown") in a dedicated cache cluster to avoid calling the Routing Engine repeatedly.
  • User Profiles:
    • Rider/Driver names and photos are cached in a standard LRU Redis cache to reduce load on Postgres.

10. Storage, indexing and media

  • Geospatial Indexing:
    • To efficiently find "drivers near X," we cannot scan the whole list.
    • Geohashing: We divide the world into a grid. Each cell has a unique ID. Drivers in the same cell share the same ID prefix.
    • Implementation: Redis GEO commands handle this mathematically (using Geohash or similar). It turns a 2D search into a fast O(log N) lookup.
  • Media:
    • Driver profile photos and car images are stored in Object Storage (S3).
    • They are served via CDN (CloudFront) so images load instantly even if the user is on a slow network.

11. Scaling strategies

1. Geographic Sharding (The "Golden Key"):

  • Ride-sharing is a regional problem. A request in New York does not affect London.
  • We shard our Location Service, Redis, and Dispatch Service by Region (e.g., US-East, EU-West, Asia-Pacific).
  • This isolates failures. If the "US-East" cluster goes down, Europe is unaffected.

2. Batch Matching (Optimizing High Load):

  • In high-density areas (e.g., a stadium after a concert), "Greedy Matching" (first driver gets the ride) is inefficient.
  • Strategy: We accumulate requests for a short window (e.g., 2 seconds) and run a "Bipartite Matching" algorithm to pair the pool of riders to the pool of drivers in a way that minimizes total wait time for everyone.

3. Database Replication:

  • PostgreSQL uses a Leader-Follower setup.
  • Writes (Ride Requests) go to the Leader.
  • Reads (View History, View Profile) go to Followers.

12. Reliability, failure handling and backpressure

  • Graceful Degradation:
    • If the Google Maps API (Routing) fails, we fall back to Euclidean Distance (straight line). It's less accurate for ETAs but keeps the business running.
  • Circuit Breakers:
    • If the Dispatch Service is overwhelmed, the Trip Service will stop sending requests and return "High Demand, please wait" to riders instead of crashing the backend.
  • Redundancy:
    • Redis runs in Cluster mode with replicas. If a primary node fails, a replica is promoted automatically to prevent losing location data.

13. Security, privacy and abuse

  • Privacy (Phone Masking):
    • Riders and Drivers never see real phone numbers. We use a proxy service (like Twilio) to bridge calls anonymously.
  • Data Fuzzing:
    • In the analytics warehouse, we "fuzz" (randomize) the start/end coordinates of trips so employees cannot look up exactly where a celebrity or user lives.
  • Fraud:
    • GPS Spoofing: We run algorithms to detect "Teleportation" (impossible travel speeds) and ban drivers using fake GPS apps to farm bonuses.

14. Bottlenecks and next steps

  • Bottleneck: Redis Memory.
    • Risk: If we store too much metadata with the location, Redis fills up.
    • Fix: Store only DriverID + Lat/Long in Redis. Fetch other details (Car type, Name) from a separate cache only when needed.
  • Bottleneck: Map Service Costs.
    • Risk: Paying external providers (Google/Mapbox) for every ETA is expensive.
    • Next Step: Build an internal routing engine using OpenStreetMap (OSRM) to calculate routes for free.

Summary

  • We solved the write-heavy location problem using Redis/Geospatial.
    • We solved scale using Regional Sharding.
    • We ensured reliability via WebSockets and Graceful Fallbacks.