Chapter 8: System Design Interview Mastery

8.11 Design a food delivery system (DoorDash / Uber Eats)

1. Restate the Problem and Pick the Scope

We are designing a food delivery platform, similar to DoorDash or Uber Eats, where customers can browse nearby restaurants, place food orders, and have a delivery driver bring the food to their door. The system coordinates three different user types in real time: customers ordering food, restaurants preparing it, and drivers delivering it.

Main user groups and actions:

  • Customers -- browse restaurants, view menus, place orders, track delivery in real time, and pay.
  • Restaurants -- receive incoming orders, update order status (accepted, preparing, ready for pickup), and manage their menu.
  • Drivers (Dashers) -- go online, get matched to delivery requests, navigate to the restaurant, pick up food, and deliver it to the customer.

Scope decisions:

  • We will focus on the core order flow: restaurant discovery, placing an order, driver matching, and real-time order tracking.
  • We will NOT cover: restaurant onboarding/management portal, detailed menu management CMS, promotional campaigns/coupons engine, tipping, ratings/reviews, customer support chat, or the payment processing internals (we treat payments as an external service). These can be layered on later.

2. Clarify Functional Requirements

Must-Have Features

  • A customer can search for nearby restaurants based on their delivery address (location-based discovery).
  • A customer can view a restaurant's menu with item names, descriptions, prices, and availability.
  • A customer can add items to a cart, customize them (e.g., "no onions"), and place an order.
  • The system calculates an order total including item prices, delivery fee, taxes, and estimated delivery time.
  • After an order is placed, the system sends it to the restaurant for acceptance.
  • The system matches an available nearby driver to pick up the order.
  • The customer can track the order status in real time: confirmed, preparing, driver en route to restaurant, picked up, driver en route to customer, delivered.
  • The customer can see the driver's live location on a map during delivery.
  • The driver can accept or decline delivery requests and update status (arrived at restaurant, picked up, delivered).
  • The restaurant can accept or reject an incoming order and mark it as "ready for pickup."

Nice-to-Have Features

  • Estimated delivery time shown before placing the order, updated dynamically during delivery.
  • Push notifications for key status changes (order confirmed, driver assigned, food delivered).
  • Order history for customers.

Functional Requirements

3. Clarify Non-Functional Requirements

MetricAssumption / Target
Monthly active users (MAU)50 million customers; 500K restaurants; 2 million drivers
Daily active users (DAU)10 million customers; 300K restaurants; 500K drivers
Orders per day5 million (each active customer orders ~0.5 times/day on average)
Read:Write ratio~10:1 -- restaurant browsing and menu views far exceed order placements
Order placement latency< 500 ms p99 -- the customer should see confirmation quickly
Restaurant/menu search latency< 200 ms p99 -- browsing must feel instant
Location update latency< 1 second -- driver location must stream to the customer in near real time
Availability99.99% (four nines) -- orders involve money and real-world logistics; downtime directly loses revenue
ConsistencyStrong consistency for orders and payments (no double charges, no lost orders); eventual consistency is acceptable for restaurant search results and driver location
Data retentionOrders retained for 3 years (financial/legal); location data retained for 30 days

4. Back-of-the-Envelope Estimates

Write QPS (order placements)

Orders/day = 5 million
Order write QPS = 5M / 86,400 ≈ 58 QPS (average)
Peak QPS = 58 × 5 ≈ ~290 QPS (dinner rush)

Each order triggers multiple writes: order record, payment, restaurant notification, driver match. Effective write QPS is ~4x the order rate.

Effective peak write QPS ≈ 290 × 4 ≈ ~1,160 QPS

Read QPS (restaurant browsing, menu views)

Read:Write ratio is 10:1 based on order writes.

Read QPS (avg) = 58 × 10 = 580 QPS
Peak read QPS = 580 × 5 = ~2,900 QPS

These are API-level reads. Each "browse nearby restaurants" call may fan out to multiple DB queries.

Driver location updates

500K active drivers, each sending a GPS ping every 5 seconds.

Location update QPS = 500,000 / 5 = 100,000 QPS

This is the highest-throughput write in the system. It requires a specialized store optimized for rapid, ephemeral writes (not a traditional relational DB).

Storage

Order data: Each order record is ~2 KB (items, addresses, status history, payment reference).

Order storage/day = 5M × 2 KB = 10 GB/day
Order storage/year = 10 GB × 365 = ~3.6 TB/year

Driver location: Each location ping is ~50 bytes (driver_id, lat, lng, timestamp). We only keep 30 days.

Location data/day = 100,000 QPS × 50 bytes × 86,400 = ~432 GB/day
Location data/month = 432 GB × 30 = ~13 TB

Location data is high-volume but short-lived. A time-series or in-memory store works well.

Restaurant and menu data: ~500K restaurants × 50 menu items × 500 bytes = ~12.5 GB total. This is small and mostly static.

Back-of-the-envelope estimation

5. API Design

5.1 Search Nearby Restaurants

FieldValue
Method & PathGET /api/v1/restaurants?lat={lat}&lng={lng}&radius=5km&cuisine=pizza&page=1
Success (200){ "restaurants": [ { "id", "name", "cuisine_type", "rating", "delivery_fee", "estimated_time", "distance", "thumbnail_url" } ... ], "next_page": 2 }
Error codes400 -- missing location

5.2 Get Restaurant Menu

FieldValue
Method & PathGET /api/v1/restaurants/{restaurant_id}/menu
Success (200){ "restaurant": { "name", "address", "hours" }, "categories": [ { "name": "Appetizers", "items": [ { "id", "name", "price", "description", "available", "customizations" } ] } ] }
Error codes404 -- restaurant not found

5.3 Place an Order

FieldValue
Method & PathPOST /api/v1/orders
Request body{ "restaurant_id": "...", "items": [ { "item_id", "quantity", "customizations": ["no onions"] } ], "delivery_address": { "lat", "lng", "formatted" }, "payment_method_id": "..." }
Success (201){ "order_id": "...", "status": "pending", "estimated_delivery": "35-45 min", "total": 28.50 }
Error codes400 -- empty cart, invalid items; 402 -- payment failed; 422 -- restaurant closed

5.4 Get Order Status

FieldValue
Method & PathGET /api/v1/orders/{order_id}
Success (200){ "order_id", "status", "restaurant", "items", "driver": { "name", "lat", "lng" }, "estimated_delivery", "status_history": [...] }
Error codes404 -- order not found

5.5 Real-Time Order Tracking (WebSocket)

FieldValue
ConnectionWSS /ws/orders/{order_id}/track
Server pushes{ "type": "status_update", "status": "driver_picked_up" } and { "type": "location_update", "lat": 31.52, "lng": 74.35 }

5.6 Driver: Update Location

FieldValue
Method & PathPOST /api/v1/drivers/location
Request body{ "lat": 31.52, "lng": 74.35, "timestamp": "..." }
Success (200){ "ack": true }

5.7 Driver: Accept/Decline Delivery Request

FieldValue
Method & PathPOST /api/v1/drivers/deliveries/{delivery_id}/accept or /decline
Success (200){ "delivery_id", "restaurant_address", "customer_address", "pickup_eta" }

5.8 Restaurant: Update Order Status

FieldValue
Method & PathPATCH /api/v1/restaurants/orders/{order_id}
Request body{ "status": "accepted" } (or "preparing", "ready_for_pickup", "rejected")
Success (200){ "order_id", "status" }

6. High-Level Architecture

High-level Architecture

Component Responsibilities

  • API Gateway + Load Balancer -- routes requests to the correct microservice, terminates TLS, enforces rate limits, handles authentication.
  • Restaurant Service -- handles restaurant search (geo queries), menu retrieval, and restaurant-side order management. Reads are heavy and cacheable.
  • Order Service -- the most critical service. Handles cart validation, order creation, payment orchestration, and order state machine transitions. Every state change emits an event to Kafka.
  • Driver Matching Service -- the "brain" of delivery logistics. When an order is ready for driver assignment, this service finds the best available driver nearby, sends them a delivery request, and handles accept/decline/timeout logic.
  • Location Service -- ingests high-frequency GPS pings from drivers (100K QPS) and stores them in Redis with geospatial indexes. This data powers both driver matching (find nearby drivers) and live tracking (stream driver location to customer).
  • Tracking Service -- manages WebSocket connections from customers tracking their orders. It subscribes to order status events (from Kafka) and driver location updates (from Redis pub/sub or polling), and pushes them to the connected client.
  • Notification Service -- a background consumer that listens to order events on Kafka and sends push notifications (e.g., "Your order has been picked up!").
  • Kafka (Event Bus) -- the backbone for async communication. Order events flow through Kafka so that downstream services (matching, tracking, notifications) are decoupled from the order service.

7. Data Model

Database Choice

  • PostgreSQL for orders, restaurants, menus, and users. These are structured, relational, and benefit from ACID transactions (especially orders + payments).
  • Redis for driver locations (geospatial index using GEOADD/GEORADIUS), restaurant search cache, and ephemeral real-time data.
  • Kafka for the event stream connecting services.

Table: users (customers and drivers)

ColumnTypeNotes
user_idBIGINT PKSnowflake ID
roleENUM'customer', 'driver', 'restaurant_owner'
nameVARCHAR
emailVARCHARUNIQUE
phoneVARCHAR
default_addressJSONB{ "lat", "lng", "formatted" }
created_atTIMESTAMP

Table: restaurants

ColumnTypeNotes
restaurant_idBIGINT PK
nameVARCHAR
addressTEXT
locationPOINT (PostGIS)Geospatial index for nearby search
cuisine_typeVARCHARIndex for filtering
ratingDECIMAL
is_openBOOLEAN
delivery_radius_kmINTEGER
operating_hoursJSONB

Table: menu_items

ColumnTypeNotes
item_idBIGINT PK
restaurant_idBIGINT FKIndex -- fetch all items for a restaurant
nameVARCHAR
descriptionTEXT
priceDECIMAL
categoryVARCHARe.g., "Appetizers", "Entrees"
availableBOOLEAN
customizationsJSONB[ { "name": "Size", "options": ["Small","Large"] } ]

Table: orders

ColumnTypeNotes
order_idBIGINT PKSnowflake ID
customer_idBIGINT FKIndex -- order history
restaurant_idBIGINT FKIndex -- restaurant's active orders
driver_idBIGINT FKNullable until assigned
statusENUMpending, confirmed, preparing, ready, picked_up, delivering, delivered, cancelled
itemsJSONBSnapshot of ordered items with prices
delivery_addressJSONB{ "lat", "lng", "formatted" }
subtotalDECIMAL
delivery_feeDECIMAL
taxDECIMAL
totalDECIMAL
payment_idVARCHARReference to external payment
estimated_delivery_atTIMESTAMP
created_atTIMESTAMP
updated_atTIMESTAMP

Table: order_status_history

ColumnTypeNotes
idBIGINT PK
order_idBIGINT FKIndex
statusENUM
changed_atTIMESTAMP
changed_byBIGINTUser who triggered the change

Driver Locations in Redis

Key: driver_locations (Redis sorted set with geospatial index)
Value: GEOADD driver_locations {lng} {lat} {driver_id}
Query: GEORADIUS driver_locations {restaurant_lng} {restaurant_lat} 5 km

Each driver's latest location is stored as a geospatial entry. The GEORADIUS command finds all drivers within a given radius of a restaurant, which powers the driver matching algorithm.

Indexes Summary

  • restaurants: PostGIS spatial index on location -- powers "nearby restaurants" queries.
  • menu_items: Index on restaurant_id -- fetches all items for a menu.
  • orders: Index on customer_id -- order history. Index on restaurant_id -- restaurant's active orders. Index on driver_id -- driver's current delivery. Index on status -- find active orders.
  • order_status_history: Index on order_id -- full timeline of an order.

8. Core Flows -- End to End

Flow 1: Customer Places an Order

This is the most critical flow in the entire system. It involves money, coordinates three parties, and must be reliable above all else.

  • Step 1 -- Customer browses and adds items to cart. The customer opens the app, searches for nearby restaurants (a geo query against the Restaurant Service), taps on a restaurant, and views the menu. They add items to a local cart on their device. No server writes happen yet -- the cart is client-side until checkout.

  • Step 2 -- Customer taps "Place Order." The app sends POST /api/v1/orders to the API gateway with the cart contents, delivery address, and payment method. The load balancer routes the request to an Order Service instance.

  • Step 3 -- Order Service validates the order. The service checks: Is the restaurant currently open? Are all menu items still available? Is the delivery address within the restaurant's delivery radius? Are item prices correct (prevents client-side tampering)? If any check fails, it returns a descriptive 400/422 error to the customer immediately.

  • Step 4 -- Order Service calculates totals. The service computes the subtotal (sum of item prices), delivery fee (based on distance), taxes, and the final total. It also estimates delivery time based on historical data (restaurant preparation time + estimated drive time).

  • Step 5 -- Order Service initiates payment. The service calls the external Payment Service to authorize (not yet capture) the total amount on the customer's payment method. If payment authorization fails, the service returns 402 to the customer. No order is created.

    Why authorize, not charge? We place a hold on the funds. We only capture (actually charge) after the restaurant accepts the order. If the restaurant rejects it, we release the hold and the customer is never charged.

  • Step 6 -- Order Service writes the order to the database. With payment authorized, the service inserts a new row into the orders table with status pending and all the order details. It also writes an initial entry to order_status_history. This is a PostgreSQL transaction -- either the order is fully created, or nothing is written.

  • Step 7 -- Order Service publishes an "order_created" event to Kafka. This event contains the order_id, restaurant_id, and all relevant details. This is the trigger for everything that happens next. The service does not wait for downstream processing.

  • Step 8 -- Return success to the customer. The Order Service responds with HTTP 201, the order_id, status "pending", and the estimated delivery time. The customer sees a confirmation screen: "Your order has been placed! Waiting for restaurant confirmation." Total time from "Place Order" tap to this screen: ~300-500 ms.

  • Step 9 -- Restaurant receives the order (async). The Restaurant Service (or a dedicated Order Routing worker) consumes the "order_created" event from Kafka. It sends a push notification and an in-app alert to the restaurant's tablet app: "New order #12345 -- 2x Margherita Pizza, 1x Garlic Bread." The restaurant owner sees the order and taps "Accept" or "Reject."

  • Step 10 -- Restaurant accepts the order. The restaurant app calls PATCH /api/v1/restaurants/orders/{order_id} with status "accepted." The Order Service updates the order status to confirmed, captures the payment (calls the Payment Service to finalize the charge), and publishes an "order_confirmed" event to Kafka. If the restaurant rejects the order, the status changes to cancelled, the payment hold is released, and the customer is notified.

  • Step 11 -- Driver matching begins (async). The Driver Matching Service consumes the "order_confirmed" event. It now needs to find a driver. The detailed matching flow is described in Flow 2 below.

  • Step 12 -- Customer is notified of confirmation. The Tracking Service (via WebSocket) and the Notification Service (via push notification) both consume the "order_confirmed" event. The customer's app updates to show "Order confirmed! Restaurant is preparing your food."

  • What the customer sees, step by step:

    1. Taps "Place Order" -- sees a loading spinner for ~500 ms.
    2. Confirmation screen: "Order placed! Waiting for restaurant."
    3. Within 1-3 minutes: "Order confirmed! Preparing your food."
    4. Within 5-15 minutes: "Driver assigned! [Driver name] is heading to the restaurant."

Flow 1

Flow 2: Driver Matching and Assignment

This is the logistical heart of the system. It runs every time an order is confirmed and needs a driver.

  • Step 1 -- Driver Matching Service receives the trigger. It consumes the "order_confirmed" event from Kafka. The event contains the restaurant's location and the estimated pickup time.

  • Step 2 -- Find nearby available drivers. The service queries Redis: GEORADIUS driver_locations {restaurant_lng} {restaurant_lat} 5 km COUNT 20 ASC. This returns up to 20 drivers within 5 km of the restaurant, sorted by distance. The service filters this list: only drivers who are currently "online" and not already on a delivery (checked against a driver_status hash in Redis).

  • Step 3 -- Score and rank candidates. The matching algorithm considers: distance to the restaurant (shorter is better), driver's direction of travel (a driver heading toward the restaurant is preferred), the driver's acceptance rate (drivers who frequently decline are ranked lower), and estimated time to arrive at the restaurant. The system picks the top candidate.

  • Step 4 -- Send a delivery request to the chosen driver. The service sends a push notification and an in-app alert to the driver: "New delivery! Pickup: Mario's Pizza (0.8 km away). Dropoff: 123 Main St. Estimated earnings: $8.50. Accept?" The driver has 30 seconds to respond.

  • Step 5 -- Driver accepts. The driver taps "Accept" in their app, which calls POST /api/v1/drivers/deliveries/{delivery_id}/accept. The Matching Service updates the order's driver_id in the database and changes the status to driver_assigned. It publishes a "driver_assigned" event to Kafka.

    What if the driver declines or times out? The Matching Service moves to the next candidate from the ranked list and sends them the request. If no driver accepts after 3 rounds (or 90 seconds total), the system widens the search radius to 10 km and tries again. If still no driver is found, the order is flagged, and the customer is notified of a delay. In extreme cases, the customer can choose to cancel for a full refund.

  • Step 6 -- Customer and restaurant are notified. The Tracking Service pushes a WebSocket update to the customer: "Driver [name] has been assigned and is heading to the restaurant." The restaurant app also shows the driver's name and ETA. The customer can now see the driver's live location on the map.

  • Step 7 -- Driver navigates to the restaurant. The driver app shows turn-by-turn directions (using a maps API). Every 5 seconds, the driver's app sends a GPS ping to the Location Service, which updates Redis. The Tracking Service polls or subscribes to these updates and streams the driver's location to the customer's WebSocket connection.

Flow 2

Flow 3: Real-Time Order Tracking

This flow describes how the customer sees live updates from order confirmation through delivery.

  • Step 1 -- Customer opens the tracking screen. The app establishes a WebSocket connection: WSS /ws/orders/{order_id}/track. The Tracking Service accepts the connection, authenticates the user, and subscribes to two data streams for this order:

    • Order status events from Kafka (e.g., "order_confirmed", "driver_assigned", "picked_up", "delivered").
    • Driver location updates from the Location Service (either via Redis pub/sub or by polling Redis every 2-3 seconds).
  • Step 2 -- Status updates flow to the customer. When the restaurant marks the order as "ready for pickup," the Order Service publishes an event to Kafka. The Tracking Service consumes it and pushes a WebSocket message to the customer: { "type": "status_update", "status": "ready_for_pickup" }. The app updates the progress bar and status text.

  • Step 3 -- Location updates flow to the customer. While the driver is en route, the Location Service receives GPS pings and writes them to Redis. The Tracking Service reads the driver's latest location every 2-3 seconds and pushes it to the customer: { "type": "location_update", "lat": 31.5204, "lng": 74.3587 }. The app moves the driver's icon on the map smoothly.

    Why poll Redis instead of using Kafka for location? Location updates are extremely high frequency (100K QPS across all drivers) and ephemeral. Kafka is designed for durable event streams. For location, we only care about the latest position, not the full history. Redis geospatial queries are ideal: fast reads, automatic overwrite of old data, and built-in geo commands.

  • Step 4 -- Driver arrives at the restaurant. The driver taps "Arrived at restaurant" in their app. The status updates to driver_at_restaurant. The customer sees: "Your driver has arrived at the restaurant."

  • Step 5 -- Driver picks up the food. The driver taps "Picked up." Status becomes picked_up. The customer sees: "Your food is on the way!" Now the driver's live location becomes the primary focus of the tracking screen.

  • Step 6 -- Driver delivers the food. The driver arrives at the customer's address and taps "Delivered." Status becomes delivered. The customer receives a push notification: "Your food has been delivered! Enjoy your meal." The WebSocket connection is closed. The order is complete.

  • Step 7 -- Post-delivery processing (async). Background workers finalize the order: mark the payment as captured (if not already done at confirmation), calculate driver earnings, update the driver's delivery count, and log the completed delivery for analytics. None of this affects the customer experience.

  • What the customer sees on the tracking screen:

    • A progress bar: Placed --> Confirmed --> Preparing --> Driver Assigned --> Picked Up --> On the Way --> Delivered.
    • A map showing the driver's live location (updated every 2-3 seconds).
    • An estimated delivery time that adjusts dynamically based on the driver's real-time progress.

Flow 3

9. Caching and Read Performance

What We Cache

  • Restaurant listings by location (Redis): restaurants:geo:{city_zone} -- a pre-computed list of open restaurants in each geographic zone. Updated every few minutes. This avoids hitting PostgreSQL with geo queries on every app open.
  • Restaurant menus (Redis): menu:{restaurant_id} -- the full menu JSON. Menus change infrequently (a few times per day), so caching with a 10-minute TTL is very effective.
  • Driver locations (Redis geospatial): driver_locations -- this IS the primary store for current driver positions. It is not a cache in front of a database; it is the real-time source of truth. Historical locations are written to cold storage separately.
  • Active order status (Redis hash): order:{order_id} -- the current status and key details of active orders. Read by the tracking service on every WebSocket push cycle.

Where the Cache Sits

Redis sits between the application services and PostgreSQL. For restaurant browsing, the read path is: App -> Restaurant Service -> Redis (cache hit) -> return. On cache miss: App -> Restaurant Service -> PostgreSQL (+ PostGIS) -> populate Redis -> return.

For driver locations, Redis IS the primary read/write store. No database in the hot path.

Cache Update and Invalidation

  • Restaurant cache: Updated when a restaurant changes its hours, availability, or menu. The Restaurant Service invalidates the cache key on write. TTL of 10 minutes as a safety net.
  • Menu cache: Invalidated when a restaurant updates an item. Since menu changes are infrequent, a short TTL (5-10 min) plus explicit invalidation on write is sufficient.
  • Order cache: Updated on every status transition by the Order Service. Deleted when the order is completed or after 24 hours.
  • Driver locations: Overwritten every 5 seconds by the driver's GPS ping. No explicit invalidation needed -- the data is always fresh.

Eviction Policy

LRU (Least Recently Used) on the Redis instance level. Restaurant and menu data for less-popular restaurants gets evicted during peak hours, freeing memory for hot data. A cache miss simply results in a database read.

10. Storage, Indexing, and Media

Primary Data Storage

  • PostgreSQL for orders, users, restaurants, menu items. At ~3.6 TB/year for order data, a sharded PostgreSQL setup handles this for years.
  • Redis for driver locations and real-time caches.
  • Time-series or cold storage for historical location data (13 TB/month). After 30 days, this is archived or deleted.

Indexes

  • restaurants: PostGIS spatial index on location -- essential for "nearby" queries. Index on cuisine_type for filtering.
  • menu_items: Index on restaurant_id -- fetch entire menu.
  • orders: Index on (customer_id, created_at DESC) -- order history. Index on (restaurant_id, status) -- restaurant's active orders. Index on (driver_id, status) -- driver's current delivery.

Media Storage

Restaurant photos, menu item images, and driver profile pictures are stored in S3 (object storage). The database stores only the CDN URL.

Serving path: All image URLs in API responses point to a CDN (e.g., https://cdn.delivery.io/restaurants/123/cover.jpg). The CDN operates in pull-based mode: first request pulls from S3, subsequent requests served from edge. Restaurant images are relatively static and cache extremely well.

Trade-offs

  • Cost: Driver location data is the highest-volume write. Redis handles it in memory (expensive per GB but fast). We keep only 30 days of history to control costs.
  • Write load: 100K QPS for location writes. Redis handles this easily. Order writes (~1,160 peak QPS) are modest for PostgreSQL.
  • Read latency: Sub-millisecond from Redis for locations and menus. ~5-10 ms from PostgreSQL for order lookups. CDN serves images in 10-50 ms from edge.

11. Scaling Strategies

Version 1: Simple Setup

For a single-city launch serving tens of thousands of users:

  • A single PostgreSQL instance for all tables.
  • A single Redis instance for caching and driver locations.
  • A few app server instances behind a load balancer.
  • A small Kafka cluster (3 brokers) for events.
  • A handful of matching and notification workers.

Growing the System

Database replication: Add PostgreSQL read replicas. The Restaurant Service and Tracking Service read from replicas. All writes go to the primary.

Database sharding: At millions of orders per day, shard the orders table by order_id (or by a hash of customer_id for order history locality). Restaurant data is small enough to remain on a single instance (or be replicated fully to each region).

Geographic partitioning: A food delivery service is inherently local -- an order in New York has nothing to do with an order in London. We can partition the entire stack by city or region: each region has its own set of services, databases, Redis instances, and Kafka clusters. This dramatically reduces cross-region complexity and lets us scale each city independently.

Redis scaling for driver locations: As we expand to more cities with more drivers, shard the Redis geospatial index by city/zone. Each city's drivers live in a separate Redis instance. This keeps GEORADIUS queries fast (smaller dataset per instance).

Separating read and write paths: The restaurant browsing path (high-frequency reads) and the order placement path (critical writes) have very different requirements. We scale them independently: many read-optimized instances for browsing, fewer but more reliable instances for order processing.

Handling Bursts

  • Kafka absorbs spikes. During dinner rush, order events spike 5x. Kafka buffers the events, and downstream consumers (matching, notifications) process them at a steady rate. No events are lost.
  • Driver matching queue. If more orders come in than drivers can be matched to immediately, the matching service queues requests and processes them in priority order (older orders first, high-value orders prioritized).

12. Reliability, Failure Handling, and Backpressure

Removing Single Points of Failure

  • App servers: Multiple instances of each service, auto-scaling behind the load balancer.
  • PostgreSQL: Primary + synchronous standby in a different AZ. Automatic failover via managed service or Patroni.
  • Redis: Redis Sentinel for failover. If the driver location Redis goes down, matching is temporarily degraded (we fall back to a wider broadcast to all nearby drivers rather than precise ranking).
  • Kafka: 3-broker cluster with replication factor 3. Tolerates single-broker failure without data loss.
  • WebSocket servers: Stateful (they hold connections), so we need sticky sessions at the load balancer. If a WebSocket server dies, clients reconnect to another instance and re-subscribe.

Timeouts, Retries, and Idempotency

  • Payment calls: 5-second timeout. Retry up to 2 times with exponential backoff. Each payment call uses an idempotency key (the order_id) so retries never double-charge.
  • Driver matching requests: 30-second timeout per driver. If the driver does not respond, automatically move to the next candidate.
  • Order creation: The POST /api/v1/orders endpoint uses a client-generated idempotency key (in a header). If the customer's app retries due to a network glitch, the server returns the existing order instead of creating a duplicate.

Circuit Breakers

If the Payment Service becomes slow or unresponsive, a circuit breaker on the Order Service stops sending new payment requests after a threshold of failures. Orders are queued and retried when the circuit closes. Customers see "Order is being processed" rather than an error.

Behavior Under Overload

  • Rate limiting: Per-user rate limits on order placement (e.g., 5 orders/hour) and API calls.
  • Shed non-essential work: Under extreme load, disable push notifications and view-count tracking. Order placement and driver matching are never shed -- they involve money and real-world logistics.
  • Degrade matching quality: If the matching service is overloaded, use a simpler algorithm (nearest available driver, no scoring) rather than failing entirely.
  • Queue backpressure: If Kafka consumers fall behind, new orders are still accepted (Kafka buffers them) but downstream processing slows. Alerts fire, and we add more consumer instances.

13. Security, Privacy, and Abuse

Authentication and Authorization

  • All API requests require a valid JWT or session token issued after login.
  • Role-based access: customers can only see their own orders, restaurants can only manage their own orders, drivers can only see their assigned deliveries.
  • The restaurant tablet app uses a separate API key tied to the restaurant account.

Encryption

  • In transit: HTTPS everywhere (TLS 1.3). WebSocket connections use WSS.
  • At rest: PostgreSQL and S3 encryption enabled via the cloud provider. Redis data is in-memory and transient; encryption at rest is less critical but can be enabled.

Handling Sensitive Data

  • Customer addresses are sensitive. They are stored encrypted in the database and only decrypted when needed by the Order Service or shown to the assigned driver.
  • Payment information is never stored in our system. We use a PCI-compliant payment gateway (e.g., Stripe) and store only a payment method token.
  • Driver location history is retained for only 30 days and is access-controlled. It is used for dispute resolution and fraud detection, not for general analytics.

Abuse Protection

  • Rate limiting: Per-IP and per-user rate limits on all endpoints. Aggressive limits on order placement to prevent fraud.
  • Fraud detection: Flag orders with suspicious patterns: very high value, unusual delivery distance, frequent cancellations, or mismatched billing/delivery addresses. A fraud scoring model can run asynchronously after order creation and flag orders for manual review.
  • Driver fraud: Monitor for GPS spoofing (sudden location jumps, impossibly fast travel), fake delivery confirmations, and collusion between drivers and restaurants.
  • Restaurant fraud: Monitor for excessive cancellations or artificially inflated menu prices.

Privacy Notes

  • A food delivery system handles three types of sensitive location data: customer home addresses, driver real-time GPS, and restaurant addresses. Each must be handled carefully under privacy regulations (GDPR, CCPA).
  • Customer addresses should only be visible to the assigned driver during the active delivery window, not stored longer than necessary.
  • Driver location data should be anonymized or deleted after the retention period.

14. Bottlenecks and Next Steps

Main Bottlenecks and Risks

  • Driver matching under surge demand. During peak hours (Friday dinner), there may be far more orders than available drivers. Mitigation: Implement surge pricing to incentivize more drivers to go online. Queue orders by wait time and prioritize fairly. Next step: Build a demand prediction model that proactively alerts drivers of upcoming high-demand periods by area.

  • Driver location ingestion at scale (100K QPS). This is the highest-throughput component. Mitigation: Redis geospatial handles this well. Next step: Shard by city/zone. If Redis becomes a bottleneck, use a dedicated geospatial database or a custom in-memory service.

  • Order state machine complexity. An order transitions through many states (pending, confirmed, preparing, ready, picked_up, delivering, delivered, cancelled, refunded), and each transition involves multiple services. A bug in state management can cause lost orders or double charges. Next step: Implement a formal state machine library with strict transition rules and comprehensive event logging for auditability.

  • WebSocket connection management. Millions of customers tracking orders simultaneously means millions of open WebSocket connections. Next step: Use a dedicated WebSocket gateway (e.g., a service mesh or a managed service) that can scale horizontally and handle connection persistence across deployments.

  • Payment failure recovery. If the payment service goes down mid-order, the system must gracefully handle partial states. Next step: Implement a saga pattern with compensating transactions (e.g., if payment capture fails after restaurant confirms, queue a retry rather than cancelling the order immediately).

Design Summary

AspectDecisionKey Trade-off
Architecture styleEvent-driven microservices connected by KafkaDecoupled and scalable, but adds operational complexity
Driver location storeRedis geospatial indexFast reads and writes (sub-ms), but data is ephemeral (no persistence for history)
Order processingSynchronous for creation + payment auth; async for matching, notifications, trackingCustomer gets fast confirmation; background work handles logistics
Driver matchingGeo-query + scoring algorithm, with timeout and fallbackOptimizes for speed and quality; degrades gracefully under load
Feed/tracking deliveryWebSocket for real-time push; Kafka for event distributionLow-latency user experience; decoupled backend processing
Scaling approachGeographic partitioning (per city/region) + database shardingNatural fit for a local service; each city scales independently

This design is built around one core insight: a food delivery system is a real-time coordination problem between three parties (customer, restaurant, driver), and the order is the central entity that connects them.

By using an event-driven architecture with Kafka as the backbone, we decouple the order creation (which must be fast and reliable) from the downstream logistics (matching, tracking, notifications), allowing each piece to scale and fail independently.

The driver location system is handled separately with Redis because it has fundamentally different performance characteristics (100K writes/second, ephemeral data, geo queries) from the rest of the application.