1. Restate the Problem and Pick the Scope
We are designing a food delivery platform, similar to DoorDash or Uber Eats, where customers can browse nearby restaurants, place food orders, and have a delivery driver bring the food to their door. The system coordinates three different user types in real time: customers ordering food, restaurants preparing it, and drivers delivering it.
Main user groups and actions:
- Customers -- browse restaurants, view menus, place orders, track delivery in real time, and pay.
- Restaurants -- receive incoming orders, update order status (accepted, preparing, ready for pickup), and manage their menu.
- Drivers (Dashers) -- go online, get matched to delivery requests, navigate to the restaurant, pick up food, and deliver it to the customer.
Scope decisions:
- We will focus on the core order flow: restaurant discovery, placing an order, driver matching, and real-time order tracking.
- We will NOT cover: restaurant onboarding/management portal, detailed menu management CMS, promotional campaigns/coupons engine, tipping, ratings/reviews, customer support chat, or the payment processing internals (we treat payments as an external service). These can be layered on later.
2. Clarify Functional Requirements
Must-Have Features
- A customer can search for nearby restaurants based on their delivery address (location-based discovery).
- A customer can view a restaurant's menu with item names, descriptions, prices, and availability.
- A customer can add items to a cart, customize them (e.g., "no onions"), and place an order.
- The system calculates an order total including item prices, delivery fee, taxes, and estimated delivery time.
- After an order is placed, the system sends it to the restaurant for acceptance.
- The system matches an available nearby driver to pick up the order.
- The customer can track the order status in real time: confirmed, preparing, driver en route to restaurant, picked up, driver en route to customer, delivered.
- The customer can see the driver's live location on a map during delivery.
- The driver can accept or decline delivery requests and update status (arrived at restaurant, picked up, delivered).
- The restaurant can accept or reject an incoming order and mark it as "ready for pickup."
Nice-to-Have Features
- Estimated delivery time shown before placing the order, updated dynamically during delivery.
- Push notifications for key status changes (order confirmed, driver assigned, food delivered).
- Order history for customers.

3. Clarify Non-Functional Requirements
| Metric | Assumption / Target |
|---|---|
| Monthly active users (MAU) | 50 million customers; 500K restaurants; 2 million drivers |
| Daily active users (DAU) | 10 million customers; 300K restaurants; 500K drivers |
| Orders per day | 5 million (each active customer orders ~0.5 times/day on average) |
| Read:Write ratio | ~10:1 -- restaurant browsing and menu views far exceed order placements |
| Order placement latency | < 500 ms p99 -- the customer should see confirmation quickly |
| Restaurant/menu search latency | < 200 ms p99 -- browsing must feel instant |
| Location update latency | < 1 second -- driver location must stream to the customer in near real time |
| Availability | 99.99% (four nines) -- orders involve money and real-world logistics; downtime directly loses revenue |
| Consistency | Strong consistency for orders and payments (no double charges, no lost orders); eventual consistency is acceptable for restaurant search results and driver location |
| Data retention | Orders retained for 3 years (financial/legal); location data retained for 30 days |
4. Back-of-the-Envelope Estimates
Write QPS (order placements)
Orders/day = 5 million
Order write QPS = 5M / 86,400 ≈ 58 QPS (average)
Peak QPS = 58 × 5 ≈ ~290 QPS (dinner rush)
Each order triggers multiple writes: order record, payment, restaurant notification, driver match. Effective write QPS is ~4x the order rate.
Effective peak write QPS ≈ 290 × 4 ≈ ~1,160 QPS
Read QPS (restaurant browsing, menu views)
Read:Write ratio is 10:1 based on order writes.
Read QPS (avg) = 58 × 10 = 580 QPS
Peak read QPS = 580 × 5 = ~2,900 QPS
These are API-level reads. Each "browse nearby restaurants" call may fan out to multiple DB queries.
Driver location updates
500K active drivers, each sending a GPS ping every 5 seconds.
Location update QPS = 500,000 / 5 = 100,000 QPS
This is the highest-throughput write in the system. It requires a specialized store optimized for rapid, ephemeral writes (not a traditional relational DB).
Storage
Order data: Each order record is ~2 KB (items, addresses, status history, payment reference).
Order storage/day = 5M × 2 KB = 10 GB/day
Order storage/year = 10 GB × 365 = ~3.6 TB/year
Driver location: Each location ping is ~50 bytes (driver_id, lat, lng, timestamp). We only keep 30 days.
Location data/day = 100,000 QPS × 50 bytes × 86,400 = ~432 GB/day
Location data/month = 432 GB × 30 = ~13 TB
Location data is high-volume but short-lived. A time-series or in-memory store works well.
Restaurant and menu data: ~500K restaurants × 50 menu items × 500 bytes = ~12.5 GB total. This is small and mostly static.

5. API Design
5.1 Search Nearby Restaurants
| Field | Value |
|---|---|
| Method & Path | GET /api/v1/restaurants?lat={lat}&lng={lng}&radius=5km&cuisine=pizza&page=1 |
| Success (200) | { "restaurants": [ { "id", "name", "cuisine_type", "rating", "delivery_fee", "estimated_time", "distance", "thumbnail_url" } ... ], "next_page": 2 } |
| Error codes | 400 -- missing location |
5.2 Get Restaurant Menu
| Field | Value |
|---|---|
| Method & Path | GET /api/v1/restaurants/{restaurant_id}/menu |
| Success (200) | { "restaurant": { "name", "address", "hours" }, "categories": [ { "name": "Appetizers", "items": [ { "id", "name", "price", "description", "available", "customizations" } ] } ] } |
| Error codes | 404 -- restaurant not found |
5.3 Place an Order
| Field | Value |
|---|---|
| Method & Path | POST /api/v1/orders |
| Request body | { "restaurant_id": "...", "items": [ { "item_id", "quantity", "customizations": ["no onions"] } ], "delivery_address": { "lat", "lng", "formatted" }, "payment_method_id": "..." } |
| Success (201) | { "order_id": "...", "status": "pending", "estimated_delivery": "35-45 min", "total": 28.50 } |
| Error codes | 400 -- empty cart, invalid items; 402 -- payment failed; 422 -- restaurant closed |
5.4 Get Order Status
| Field | Value |
|---|---|
| Method & Path | GET /api/v1/orders/{order_id} |
| Success (200) | { "order_id", "status", "restaurant", "items", "driver": { "name", "lat", "lng" }, "estimated_delivery", "status_history": [...] } |
| Error codes | 404 -- order not found |
5.5 Real-Time Order Tracking (WebSocket)
| Field | Value |
|---|---|
| Connection | WSS /ws/orders/{order_id}/track |
| Server pushes | { "type": "status_update", "status": "driver_picked_up" } and { "type": "location_update", "lat": 31.52, "lng": 74.35 } |
5.6 Driver: Update Location
| Field | Value |
|---|---|
| Method & Path | POST /api/v1/drivers/location |
| Request body | { "lat": 31.52, "lng": 74.35, "timestamp": "..." } |
| Success (200) | { "ack": true } |
5.7 Driver: Accept/Decline Delivery Request
| Field | Value |
|---|---|
| Method & Path | POST /api/v1/drivers/deliveries/{delivery_id}/accept or /decline |
| Success (200) | { "delivery_id", "restaurant_address", "customer_address", "pickup_eta" } |
5.8 Restaurant: Update Order Status
| Field | Value |
|---|---|
| Method & Path | PATCH /api/v1/restaurants/orders/{order_id} |
| Request body | { "status": "accepted" } (or "preparing", "ready_for_pickup", "rejected") |
| Success (200) | { "order_id", "status" } |
6. High-Level Architecture

Component Responsibilities
- API Gateway + Load Balancer -- routes requests to the correct microservice, terminates TLS, enforces rate limits, handles authentication.
- Restaurant Service -- handles restaurant search (geo queries), menu retrieval, and restaurant-side order management. Reads are heavy and cacheable.
- Order Service -- the most critical service. Handles cart validation, order creation, payment orchestration, and order state machine transitions. Every state change emits an event to Kafka.
- Driver Matching Service -- the "brain" of delivery logistics. When an order is ready for driver assignment, this service finds the best available driver nearby, sends them a delivery request, and handles accept/decline/timeout logic.
- Location Service -- ingests high-frequency GPS pings from drivers (100K QPS) and stores them in Redis with geospatial indexes. This data powers both driver matching (find nearby drivers) and live tracking (stream driver location to customer).
- Tracking Service -- manages WebSocket connections from customers tracking their orders. It subscribes to order status events (from Kafka) and driver location updates (from Redis pub/sub or polling), and pushes them to the connected client.
- Notification Service -- a background consumer that listens to order events on Kafka and sends push notifications (e.g., "Your order has been picked up!").
- Kafka (Event Bus) -- the backbone for async communication. Order events flow through Kafka so that downstream services (matching, tracking, notifications) are decoupled from the order service.
7. Data Model
Database Choice
- PostgreSQL for orders, restaurants, menus, and users. These are structured, relational, and benefit from ACID transactions (especially orders + payments).
- Redis for driver locations (geospatial index using
GEOADD/GEORADIUS), restaurant search cache, and ephemeral real-time data. - Kafka for the event stream connecting services.
Table: users (customers and drivers)
| Column | Type | Notes |
|---|---|---|
| user_id | BIGINT PK | Snowflake ID |
| role | ENUM | 'customer', 'driver', 'restaurant_owner' |
| name | VARCHAR | |
| VARCHAR | UNIQUE | |
| phone | VARCHAR | |
| default_address | JSONB | { "lat", "lng", "formatted" } |
| created_at | TIMESTAMP |
Table: restaurants
| Column | Type | Notes |
|---|---|---|
| restaurant_id | BIGINT PK | |
| name | VARCHAR | |
| address | TEXT | |
| location | POINT (PostGIS) | Geospatial index for nearby search |
| cuisine_type | VARCHAR | Index for filtering |
| rating | DECIMAL | |
| is_open | BOOLEAN | |
| delivery_radius_km | INTEGER | |
| operating_hours | JSONB |
Table: menu_items
| Column | Type | Notes |
|---|---|---|
| item_id | BIGINT PK | |
| restaurant_id | BIGINT FK | Index -- fetch all items for a restaurant |
| name | VARCHAR | |
| description | TEXT | |
| price | DECIMAL | |
| category | VARCHAR | e.g., "Appetizers", "Entrees" |
| available | BOOLEAN | |
| customizations | JSONB | [ { "name": "Size", "options": ["Small","Large"] } ] |
Table: orders
| Column | Type | Notes |
|---|---|---|
| order_id | BIGINT PK | Snowflake ID |
| customer_id | BIGINT FK | Index -- order history |
| restaurant_id | BIGINT FK | Index -- restaurant's active orders |
| driver_id | BIGINT FK | Nullable until assigned |
| status | ENUM | pending, confirmed, preparing, ready, picked_up, delivering, delivered, cancelled |
| items | JSONB | Snapshot of ordered items with prices |
| delivery_address | JSONB | { "lat", "lng", "formatted" } |
| subtotal | DECIMAL | |
| delivery_fee | DECIMAL | |
| tax | DECIMAL | |
| total | DECIMAL | |
| payment_id | VARCHAR | Reference to external payment |
| estimated_delivery_at | TIMESTAMP | |
| created_at | TIMESTAMP | |
| updated_at | TIMESTAMP |
Table: order_status_history
| Column | Type | Notes |
|---|---|---|
| id | BIGINT PK | |
| order_id | BIGINT FK | Index |
| status | ENUM | |
| changed_at | TIMESTAMP | |
| changed_by | BIGINT | User who triggered the change |
Driver Locations in Redis
Key: driver_locations (Redis sorted set with geospatial index)
Value: GEOADD driver_locations {lng} {lat} {driver_id}
Query: GEORADIUS driver_locations {restaurant_lng} {restaurant_lat} 5 km
Each driver's latest location is stored as a geospatial entry. The GEORADIUS command finds all drivers within a given radius of a restaurant, which powers the driver matching algorithm.
Indexes Summary
- restaurants: PostGIS spatial index on
location-- powers "nearby restaurants" queries. - menu_items: Index on
restaurant_id-- fetches all items for a menu. - orders: Index on
customer_id-- order history. Index onrestaurant_id-- restaurant's active orders. Index ondriver_id-- driver's current delivery. Index onstatus-- find active orders. - order_status_history: Index on
order_id-- full timeline of an order.
8. Core Flows -- End to End
Flow 1: Customer Places an Order
This is the most critical flow in the entire system. It involves money, coordinates three parties, and must be reliable above all else.
-
Step 1 -- Customer browses and adds items to cart. The customer opens the app, searches for nearby restaurants (a geo query against the Restaurant Service), taps on a restaurant, and views the menu. They add items to a local cart on their device. No server writes happen yet -- the cart is client-side until checkout.
-
Step 2 -- Customer taps "Place Order." The app sends
POST /api/v1/ordersto the API gateway with the cart contents, delivery address, and payment method. The load balancer routes the request to an Order Service instance. -
Step 3 -- Order Service validates the order. The service checks: Is the restaurant currently open? Are all menu items still available? Is the delivery address within the restaurant's delivery radius? Are item prices correct (prevents client-side tampering)? If any check fails, it returns a descriptive 400/422 error to the customer immediately.
-
Step 4 -- Order Service calculates totals. The service computes the subtotal (sum of item prices), delivery fee (based on distance), taxes, and the final total. It also estimates delivery time based on historical data (restaurant preparation time + estimated drive time).
-
Step 5 -- Order Service initiates payment. The service calls the external Payment Service to authorize (not yet capture) the total amount on the customer's payment method. If payment authorization fails, the service returns 402 to the customer. No order is created.
Why authorize, not charge? We place a hold on the funds. We only capture (actually charge) after the restaurant accepts the order. If the restaurant rejects it, we release the hold and the customer is never charged.
-
Step 6 -- Order Service writes the order to the database. With payment authorized, the service inserts a new row into the
orderstable with statuspendingand all the order details. It also writes an initial entry toorder_status_history. This is a PostgreSQL transaction -- either the order is fully created, or nothing is written. -
Step 7 -- Order Service publishes an "order_created" event to Kafka. This event contains the order_id, restaurant_id, and all relevant details. This is the trigger for everything that happens next. The service does not wait for downstream processing.
-
Step 8 -- Return success to the customer. The Order Service responds with HTTP 201, the order_id, status "pending", and the estimated delivery time. The customer sees a confirmation screen: "Your order has been placed! Waiting for restaurant confirmation." Total time from "Place Order" tap to this screen: ~300-500 ms.
-
Step 9 -- Restaurant receives the order (async). The Restaurant Service (or a dedicated Order Routing worker) consumes the "order_created" event from Kafka. It sends a push notification and an in-app alert to the restaurant's tablet app: "New order #12345 -- 2x Margherita Pizza, 1x Garlic Bread." The restaurant owner sees the order and taps "Accept" or "Reject."
-
Step 10 -- Restaurant accepts the order. The restaurant app calls
PATCH /api/v1/restaurants/orders/{order_id}with status "accepted." The Order Service updates the order status toconfirmed, captures the payment (calls the Payment Service to finalize the charge), and publishes an "order_confirmed" event to Kafka. If the restaurant rejects the order, the status changes tocancelled, the payment hold is released, and the customer is notified. -
Step 11 -- Driver matching begins (async). The Driver Matching Service consumes the "order_confirmed" event. It now needs to find a driver. The detailed matching flow is described in Flow 2 below.
-
Step 12 -- Customer is notified of confirmation. The Tracking Service (via WebSocket) and the Notification Service (via push notification) both consume the "order_confirmed" event. The customer's app updates to show "Order confirmed! Restaurant is preparing your food."
-
What the customer sees, step by step:
- Taps "Place Order" -- sees a loading spinner for ~500 ms.
- Confirmation screen: "Order placed! Waiting for restaurant."
- Within 1-3 minutes: "Order confirmed! Preparing your food."
- Within 5-15 minutes: "Driver assigned! [Driver name] is heading to the restaurant."

Flow 2: Driver Matching and Assignment
This is the logistical heart of the system. It runs every time an order is confirmed and needs a driver.
-
Step 1 -- Driver Matching Service receives the trigger. It consumes the "order_confirmed" event from Kafka. The event contains the restaurant's location and the estimated pickup time.
-
Step 2 -- Find nearby available drivers. The service queries Redis:
GEORADIUS driver_locations {restaurant_lng} {restaurant_lat} 5 km COUNT 20 ASC. This returns up to 20 drivers within 5 km of the restaurant, sorted by distance. The service filters this list: only drivers who are currently "online" and not already on a delivery (checked against adriver_statushash in Redis). -
Step 3 -- Score and rank candidates. The matching algorithm considers: distance to the restaurant (shorter is better), driver's direction of travel (a driver heading toward the restaurant is preferred), the driver's acceptance rate (drivers who frequently decline are ranked lower), and estimated time to arrive at the restaurant. The system picks the top candidate.
-
Step 4 -- Send a delivery request to the chosen driver. The service sends a push notification and an in-app alert to the driver: "New delivery! Pickup: Mario's Pizza (0.8 km away). Dropoff: 123 Main St. Estimated earnings: $8.50. Accept?" The driver has 30 seconds to respond.
-
Step 5 -- Driver accepts. The driver taps "Accept" in their app, which calls
POST /api/v1/drivers/deliveries/{delivery_id}/accept. The Matching Service updates the order'sdriver_idin the database and changes the status todriver_assigned. It publishes a "driver_assigned" event to Kafka.What if the driver declines or times out? The Matching Service moves to the next candidate from the ranked list and sends them the request. If no driver accepts after 3 rounds (or 90 seconds total), the system widens the search radius to 10 km and tries again. If still no driver is found, the order is flagged, and the customer is notified of a delay. In extreme cases, the customer can choose to cancel for a full refund.
-
Step 6 -- Customer and restaurant are notified. The Tracking Service pushes a WebSocket update to the customer: "Driver [name] has been assigned and is heading to the restaurant." The restaurant app also shows the driver's name and ETA. The customer can now see the driver's live location on the map.
-
Step 7 -- Driver navigates to the restaurant. The driver app shows turn-by-turn directions (using a maps API). Every 5 seconds, the driver's app sends a GPS ping to the Location Service, which updates Redis. The Tracking Service polls or subscribes to these updates and streams the driver's location to the customer's WebSocket connection.

Flow 3: Real-Time Order Tracking
This flow describes how the customer sees live updates from order confirmation through delivery.
-
Step 1 -- Customer opens the tracking screen. The app establishes a WebSocket connection:
WSS /ws/orders/{order_id}/track. The Tracking Service accepts the connection, authenticates the user, and subscribes to two data streams for this order:- Order status events from Kafka (e.g., "order_confirmed", "driver_assigned", "picked_up", "delivered").
- Driver location updates from the Location Service (either via Redis pub/sub or by polling Redis every 2-3 seconds).
-
Step 2 -- Status updates flow to the customer. When the restaurant marks the order as "ready for pickup," the Order Service publishes an event to Kafka. The Tracking Service consumes it and pushes a WebSocket message to the customer:
{ "type": "status_update", "status": "ready_for_pickup" }. The app updates the progress bar and status text. -
Step 3 -- Location updates flow to the customer. While the driver is en route, the Location Service receives GPS pings and writes them to Redis. The Tracking Service reads the driver's latest location every 2-3 seconds and pushes it to the customer:
{ "type": "location_update", "lat": 31.5204, "lng": 74.3587 }. The app moves the driver's icon on the map smoothly.Why poll Redis instead of using Kafka for location? Location updates are extremely high frequency (100K QPS across all drivers) and ephemeral. Kafka is designed for durable event streams. For location, we only care about the latest position, not the full history. Redis geospatial queries are ideal: fast reads, automatic overwrite of old data, and built-in geo commands.
-
Step 4 -- Driver arrives at the restaurant. The driver taps "Arrived at restaurant" in their app. The status updates to
driver_at_restaurant. The customer sees: "Your driver has arrived at the restaurant." -
Step 5 -- Driver picks up the food. The driver taps "Picked up." Status becomes
picked_up. The customer sees: "Your food is on the way!" Now the driver's live location becomes the primary focus of the tracking screen. -
Step 6 -- Driver delivers the food. The driver arrives at the customer's address and taps "Delivered." Status becomes
delivered. The customer receives a push notification: "Your food has been delivered! Enjoy your meal." The WebSocket connection is closed. The order is complete. -
Step 7 -- Post-delivery processing (async). Background workers finalize the order: mark the payment as captured (if not already done at confirmation), calculate driver earnings, update the driver's delivery count, and log the completed delivery for analytics. None of this affects the customer experience.
-
What the customer sees on the tracking screen:
- A progress bar: Placed --> Confirmed --> Preparing --> Driver Assigned --> Picked Up --> On the Way --> Delivered.
- A map showing the driver's live location (updated every 2-3 seconds).
- An estimated delivery time that adjusts dynamically based on the driver's real-time progress.

9. Caching and Read Performance
What We Cache
- Restaurant listings by location (Redis):
restaurants:geo:{city_zone}-- a pre-computed list of open restaurants in each geographic zone. Updated every few minutes. This avoids hitting PostgreSQL with geo queries on every app open. - Restaurant menus (Redis):
menu:{restaurant_id}-- the full menu JSON. Menus change infrequently (a few times per day), so caching with a 10-minute TTL is very effective. - Driver locations (Redis geospatial):
driver_locations-- this IS the primary store for current driver positions. It is not a cache in front of a database; it is the real-time source of truth. Historical locations are written to cold storage separately. - Active order status (Redis hash):
order:{order_id}-- the current status and key details of active orders. Read by the tracking service on every WebSocket push cycle.
Where the Cache Sits
Redis sits between the application services and PostgreSQL. For restaurant browsing, the read path is: App -> Restaurant Service -> Redis (cache hit) -> return. On cache miss: App -> Restaurant Service -> PostgreSQL (+ PostGIS) -> populate Redis -> return.
For driver locations, Redis IS the primary read/write store. No database in the hot path.
Cache Update and Invalidation
- Restaurant cache: Updated when a restaurant changes its hours, availability, or menu. The Restaurant Service invalidates the cache key on write. TTL of 10 minutes as a safety net.
- Menu cache: Invalidated when a restaurant updates an item. Since menu changes are infrequent, a short TTL (5-10 min) plus explicit invalidation on write is sufficient.
- Order cache: Updated on every status transition by the Order Service. Deleted when the order is completed or after 24 hours.
- Driver locations: Overwritten every 5 seconds by the driver's GPS ping. No explicit invalidation needed -- the data is always fresh.
Eviction Policy
LRU (Least Recently Used) on the Redis instance level. Restaurant and menu data for less-popular restaurants gets evicted during peak hours, freeing memory for hot data. A cache miss simply results in a database read.
10. Storage, Indexing, and Media
Primary Data Storage
- PostgreSQL for orders, users, restaurants, menu items. At ~3.6 TB/year for order data, a sharded PostgreSQL setup handles this for years.
- Redis for driver locations and real-time caches.
- Time-series or cold storage for historical location data (13 TB/month). After 30 days, this is archived or deleted.
Indexes
- restaurants: PostGIS spatial index on
location-- essential for "nearby" queries. Index oncuisine_typefor filtering. - menu_items: Index on
restaurant_id-- fetch entire menu. - orders: Index on
(customer_id, created_at DESC)-- order history. Index on(restaurant_id, status)-- restaurant's active orders. Index on(driver_id, status)-- driver's current delivery.
Media Storage
Restaurant photos, menu item images, and driver profile pictures are stored in S3 (object storage). The database stores only the CDN URL.
Serving path: All image URLs in API responses point to a CDN (e.g., https://cdn.delivery.io/restaurants/123/cover.jpg). The CDN operates in pull-based mode: first request pulls from S3, subsequent requests served from edge. Restaurant images are relatively static and cache extremely well.
Trade-offs
- Cost: Driver location data is the highest-volume write. Redis handles it in memory (expensive per GB but fast). We keep only 30 days of history to control costs.
- Write load: 100K QPS for location writes. Redis handles this easily. Order writes (~1,160 peak QPS) are modest for PostgreSQL.
- Read latency: Sub-millisecond from Redis for locations and menus. ~5-10 ms from PostgreSQL for order lookups. CDN serves images in 10-50 ms from edge.
11. Scaling Strategies
Version 1: Simple Setup
For a single-city launch serving tens of thousands of users:
- A single PostgreSQL instance for all tables.
- A single Redis instance for caching and driver locations.
- A few app server instances behind a load balancer.
- A small Kafka cluster (3 brokers) for events.
- A handful of matching and notification workers.
Growing the System
Database replication: Add PostgreSQL read replicas. The Restaurant Service and Tracking Service read from replicas. All writes go to the primary.
Database sharding: At millions of orders per day, shard the orders table by order_id (or by a hash of customer_id for order history locality). Restaurant data is small enough to remain on a single instance (or be replicated fully to each region).
Geographic partitioning: A food delivery service is inherently local -- an order in New York has nothing to do with an order in London. We can partition the entire stack by city or region: each region has its own set of services, databases, Redis instances, and Kafka clusters. This dramatically reduces cross-region complexity and lets us scale each city independently.
Redis scaling for driver locations: As we expand to more cities with more drivers, shard the Redis geospatial index by city/zone. Each city's drivers live in a separate Redis instance. This keeps GEORADIUS queries fast (smaller dataset per instance).
Separating read and write paths: The restaurant browsing path (high-frequency reads) and the order placement path (critical writes) have very different requirements. We scale them independently: many read-optimized instances for browsing, fewer but more reliable instances for order processing.
Handling Bursts
- Kafka absorbs spikes. During dinner rush, order events spike 5x. Kafka buffers the events, and downstream consumers (matching, notifications) process them at a steady rate. No events are lost.
- Driver matching queue. If more orders come in than drivers can be matched to immediately, the matching service queues requests and processes them in priority order (older orders first, high-value orders prioritized).
12. Reliability, Failure Handling, and Backpressure
Removing Single Points of Failure
- App servers: Multiple instances of each service, auto-scaling behind the load balancer.
- PostgreSQL: Primary + synchronous standby in a different AZ. Automatic failover via managed service or Patroni.
- Redis: Redis Sentinel for failover. If the driver location Redis goes down, matching is temporarily degraded (we fall back to a wider broadcast to all nearby drivers rather than precise ranking).
- Kafka: 3-broker cluster with replication factor 3. Tolerates single-broker failure without data loss.
- WebSocket servers: Stateful (they hold connections), so we need sticky sessions at the load balancer. If a WebSocket server dies, clients reconnect to another instance and re-subscribe.
Timeouts, Retries, and Idempotency
- Payment calls: 5-second timeout. Retry up to 2 times with exponential backoff. Each payment call uses an idempotency key (the order_id) so retries never double-charge.
- Driver matching requests: 30-second timeout per driver. If the driver does not respond, automatically move to the next candidate.
- Order creation: The
POST /api/v1/ordersendpoint uses a client-generated idempotency key (in a header). If the customer's app retries due to a network glitch, the server returns the existing order instead of creating a duplicate.
Circuit Breakers
If the Payment Service becomes slow or unresponsive, a circuit breaker on the Order Service stops sending new payment requests after a threshold of failures. Orders are queued and retried when the circuit closes. Customers see "Order is being processed" rather than an error.
Behavior Under Overload
- Rate limiting: Per-user rate limits on order placement (e.g., 5 orders/hour) and API calls.
- Shed non-essential work: Under extreme load, disable push notifications and view-count tracking. Order placement and driver matching are never shed -- they involve money and real-world logistics.
- Degrade matching quality: If the matching service is overloaded, use a simpler algorithm (nearest available driver, no scoring) rather than failing entirely.
- Queue backpressure: If Kafka consumers fall behind, new orders are still accepted (Kafka buffers them) but downstream processing slows. Alerts fire, and we add more consumer instances.
13. Security, Privacy, and Abuse
Authentication and Authorization
- All API requests require a valid JWT or session token issued after login.
- Role-based access: customers can only see their own orders, restaurants can only manage their own orders, drivers can only see their assigned deliveries.
- The restaurant tablet app uses a separate API key tied to the restaurant account.
Encryption
- In transit: HTTPS everywhere (TLS 1.3). WebSocket connections use WSS.
- At rest: PostgreSQL and S3 encryption enabled via the cloud provider. Redis data is in-memory and transient; encryption at rest is less critical but can be enabled.
Handling Sensitive Data
- Customer addresses are sensitive. They are stored encrypted in the database and only decrypted when needed by the Order Service or shown to the assigned driver.
- Payment information is never stored in our system. We use a PCI-compliant payment gateway (e.g., Stripe) and store only a payment method token.
- Driver location history is retained for only 30 days and is access-controlled. It is used for dispute resolution and fraud detection, not for general analytics.
Abuse Protection
- Rate limiting: Per-IP and per-user rate limits on all endpoints. Aggressive limits on order placement to prevent fraud.
- Fraud detection: Flag orders with suspicious patterns: very high value, unusual delivery distance, frequent cancellations, or mismatched billing/delivery addresses. A fraud scoring model can run asynchronously after order creation and flag orders for manual review.
- Driver fraud: Monitor for GPS spoofing (sudden location jumps, impossibly fast travel), fake delivery confirmations, and collusion between drivers and restaurants.
- Restaurant fraud: Monitor for excessive cancellations or artificially inflated menu prices.
Privacy Notes
- A food delivery system handles three types of sensitive location data: customer home addresses, driver real-time GPS, and restaurant addresses. Each must be handled carefully under privacy regulations (GDPR, CCPA).
- Customer addresses should only be visible to the assigned driver during the active delivery window, not stored longer than necessary.
- Driver location data should be anonymized or deleted after the retention period.
14. Bottlenecks and Next Steps
Main Bottlenecks and Risks
-
Driver matching under surge demand. During peak hours (Friday dinner), there may be far more orders than available drivers. Mitigation: Implement surge pricing to incentivize more drivers to go online. Queue orders by wait time and prioritize fairly. Next step: Build a demand prediction model that proactively alerts drivers of upcoming high-demand periods by area.
-
Driver location ingestion at scale (100K QPS). This is the highest-throughput component. Mitigation: Redis geospatial handles this well. Next step: Shard by city/zone. If Redis becomes a bottleneck, use a dedicated geospatial database or a custom in-memory service.
-
Order state machine complexity. An order transitions through many states (pending, confirmed, preparing, ready, picked_up, delivering, delivered, cancelled, refunded), and each transition involves multiple services. A bug in state management can cause lost orders or double charges. Next step: Implement a formal state machine library with strict transition rules and comprehensive event logging for auditability.
-
WebSocket connection management. Millions of customers tracking orders simultaneously means millions of open WebSocket connections. Next step: Use a dedicated WebSocket gateway (e.g., a service mesh or a managed service) that can scale horizontally and handle connection persistence across deployments.
-
Payment failure recovery. If the payment service goes down mid-order, the system must gracefully handle partial states. Next step: Implement a saga pattern with compensating transactions (e.g., if payment capture fails after restaurant confirms, queue a retry rather than cancelling the order immediately).
Design Summary
| Aspect | Decision | Key Trade-off |
|---|---|---|
| Architecture style | Event-driven microservices connected by Kafka | Decoupled and scalable, but adds operational complexity |
| Driver location store | Redis geospatial index | Fast reads and writes (sub-ms), but data is ephemeral (no persistence for history) |
| Order processing | Synchronous for creation + payment auth; async for matching, notifications, tracking | Customer gets fast confirmation; background work handles logistics |
| Driver matching | Geo-query + scoring algorithm, with timeout and fallback | Optimizes for speed and quality; degrades gracefully under load |
| Feed/tracking delivery | WebSocket for real-time push; Kafka for event distribution | Low-latency user experience; decoupled backend processing |
| Scaling approach | Geographic partitioning (per city/region) + database sharding | Natural fit for a local service; each city scales independently |
This design is built around one core insight: a food delivery system is a real-time coordination problem between three parties (customer, restaurant, driver), and the order is the central entity that connects them.
By using an event-driven architecture with Kafka as the backbone, we decouple the order creation (which must be fast and reliable) from the downstream logistics (matching, tracking, notifications), allowing each piece to scale and fail independently.
The driver location system is handled separately with Redis because it has fundamentally different performance characteristics (100K writes/second, ephemeral data, geo queries) from the rest of the application.