Chapter 8: System Design Interview Mastery

8.10 Design a hotel/event booking system (Booking.com / BookMyShow / Ticketmaster)

1. Restate the problem and pick the scope

We are designing a large-scale ticketing platform like Ticketmaster. The system allows users to browse upcoming live events, view venue seating charts, and purchase tickets securely.

The main user groups are event organizers (who list events) and ticket buyers (who purchase seats). The most critical action ticket buyers care about is successfully securing seats for highly anticipated events without the system crashing or double booking their selection.

Because this is a massive system, we will focus strictly on the customer booking flow: browsing events, viewing seat maps, temporarily reserving seats, and completing the purchase.

We will explicitly exclude event creation, admin dashboards, dynamic pricing algorithms, and the secondary ticket resale market.

2. Clarify functional requirements

Must have:

  • User can view a list of upcoming events.
  • User can view a seating map showing available, held, and booked seats for an event in real time.
  • User can select seats and place a temporary “hold” on them (for example, 10 minutes) while they check out.
  • User can complete payment for their held seats to finalize the booking.
  • System must automatically release held seats if the 10 minute timer expires before payment is completed.

Nice to have:

  • User can search for events by keyword or city.
  • A “virtual waiting room” for highly anticipated events to ensure fairness and prevent system crashes.

Functional Requirements

3. Clarify non-functional requirements

  • Target users: 100 million Monthly Active Users (MAU), 10 million Daily Active Users (DAU).
  • Read and write pattern: Extremely read-heavy under normal conditions (roughly a 100:1 read-to-write ratio). However, during major concert ticket drops, the system experiences massive, localized write-heavy bursts (the thundering herd problem).
  • Latency goals: < 200ms for browsing events and loading seat maps. < 500ms for reserving a seat.
  • Availability targets: 99.99% overall availability.
  • Consistency preference: Strong consistency is absolutely mandatory for seat reservations and payments to prevent double booking. We favor Consistency over Availability (CP in the CAP theorem) during checkout. Eventual consistency is acceptable for search results.
  • Data retention: Keep ticket and order history for 5 years for legal compliance.

Non-Functional Requirements

4. Back-of-the-envelope estimates

  • Estimate average read QPS: 10 million DAU * 10 page views per day = 100 million read requests per day.
    • 100,000,000 / 86,400 seconds = ~1,150 average read QPS.
  • Estimate average write QPS: Assume 1 million tickets are sold globally per day.
    • 1,000,000 / 86,400 seconds = ~12 average write QPS.
  • Estimate peak traffic versus average: A 50,000 seat stadium might sell out in 5 minutes during a highly anticipated drop.
    • 50,000 / 300 seconds = ~166 write QPS focused entirely on a single event. Read traffic (users refreshing) can easily hit 100,000+ QPS. Handling this massive localized spike is the real challenge.
  • Estimate needed storage: 1 million tickets per day * 500 bytes per ticket record = 500 MB per day.
    • Over a year, this is roughly 180 GB. Storage volume is very small; the real challenge is concurrent write operations.
  • Estimate bandwidth: Venue seating maps are heavy visual assets (for example, 1 MB per image).
    • 1,150 average read QPS * 1 MB = ~1.15 GB per second. A Content Delivery Network (CDN) is absolutely necessary to handle this bandwidth.

Back-of-the-envelope Estimation

5. API design

We will use a RESTful API for our core flows over HTTPS.

  • Search Events
    • GET /v1/events?keyword={keyword}&city={city}
    • Response body: List of { event_id, name, date, venue_name, thumbnail_url }.
    • Status codes: 200 OK.
  • View Seats
    • GET /v1/events/{event_id}/seats
    • Response body: List of { seat_id, row, number, status, price }.
    • Status codes: 200 OK, 404 Not Found.
  • Reserve Seats (The Hold)
    • POST /v1/reservations
    • Request parameters: { event_id: "e123", seat_ids: ["s1", "s2"] }
    • Response body: { reservation_id: "r999", expires_at: "2026-05-01T20:10:00Z" }
    • Error cases: 409 Conflict (Seats already taken), 400 Bad Request.
  • Confirm Booking
    • POST /v1/bookings
    • Request parameters: { reservation_id: "r999", payment_token: "tok_xyz", idempotency_key: "uuid" }
    • Response body: { booking_id: "b111", status: "CONFIRMED" }
    • Error cases: 400 Bad Request (Reservation expired), 402 Payment Required (Card declined).

6. High-level architecture

Here is the step-by-step path a request takes from the user to our data stores:

Client -> CDN / WAF -> Load Balancer -> Virtual Waiting Room -> API Gateway -> App Servers -> Cache -> Database -> Message Queue

High-level Architecture

  • Clients (web, mobile): The frontend interfaces used by ticket buyers.
  • CDN and WAF: The Content Delivery Network serves heavy static assets instantly. The Web Application Firewall blocks malicious bot traffic.
  • Load Balancer: Distributes incoming traffic evenly across our backend servers.
  • Virtual Waiting Room: A specialized reverse proxy layer that buffers traffic during massive spikes, letting a manageable number of users into the active site.
  • API Gateway: Routes incoming requests, handles authentication, and enforces rate limits.
  • App Servers (Microservices): Independent stateless services handling business logic (Search Service, Booking Service, Payment Service).
  • Cache (Redis): Stores event details and the real time status of seats to protect the database from massive read spikes.
  • Database (PostgreSQL): The ultimate source of truth for seat ownership and financial transactions.
  • Message Queues (Kafka): Handles asynchronous background tasks like generating digital tickets and sending confirmation emails.

7. Data model

We must use a Relational SQL Database (such as PostgreSQL) as our primary data store.

The most critical requirement of this system is preventing double booking. This requires strict ACID transactions and row-level locking, which SQL databases are perfectly designed for.

NoSQL eventual consistency is too risky here.

Main tables:

  • Events Table: id (Primary Key), name, venue_id, start_time.
  • Seats Table: id (Primary Key), venue_id, row, seat_number.
  • Event_Seats Table: event_id (PK), seat_id (PK), status (AVAILABLE, HELD, BOOKED), reservation_id, held_until.
    • Index: We will place a composite index on (event_id, status). This allows the database to instantly answer the API’s most common query: “Find all AVAILABLE seats for Event X.”
  • Bookings Table: id (Primary Key), user_id, event_id, total_price, status.

Media Storage:

The actual image files for venue seating maps and artist banners are stored in Cloud Object Storage (like AWS S3). The database simply stores a string URL pointing to that object.

8. Core flows end-to-end

Flow 1: Browsing events and viewing the seat map

  1. The user opens an event page. The client application fetches the heavy venue map images directly from the CDN.
  2. Simultaneously, the client sends a GET /v1/events/{id}/seats request to see which seats are open.
  3. The request passes through the API gateway to the Booking Service.
  4. The Booking Service checks the Redis cache first. Because this is read heavy, 99% of the time the seat statuses are found in the cache and returned instantly.
  5. If it is a cache miss, the service queries PostgreSQL, saves the available seats to Redis with a short expiration time, and returns the data.
  6. The user sees the seating map populate, with available seats colored green.

Flow 1

Flow 2: Reserving a seat (The critical path)

  1. The user selects a green seat and taps “Checkout”. The client immediately shows a loading spinner.
  2. The client sends a POST /v1/reservations request to the Booking Service.
  3. The Booking Service opens a transaction in the PostgreSQL database.
  4. It runs a query with a strict row level lock: SELECT * FROM Event_Seats WHERE event_id = X AND seat_id = Y AND status = 'AVAILABLE' FOR UPDATE.
  5. If the seat is available, the database locks that specific row. The service updates the status to HELD, sets the held_until timestamp for 10 minutes from now, and commits the transaction.
  6. What if another user clicked the exact same seat at the exact same millisecond? The database forces the second user’s query to wait. Once the first user’s transaction finishes, the second user’s query sees the seat is now HELD. The service returns a 409 Conflict error (”Seat no longer available”).
  7. For the successful user, the Booking Service updates the Redis cache to reflect the new HELD status. The user’s screen transitions to checkout, displaying a 10 minute countdown timer.

Flow 2

Flow 3: Completing the purchase and releasing holds

  1. The user enters their credit card details and taps “Pay”. The client sends a POST /v1/bookings request.
  2. The Payment Service securely forwards the payment token to a third party gateway like Stripe.
  3. Once Stripe confirms the charge, the Payment Service updates the seat status in PostgreSQL to BOOKED.
  4. We use asynchronous queues here to keep the user experience fast. The Payment Service drops an “Order Completed” message into a Kafka queue. The user immediately sees a “Payment Successful” screen. In the background, a worker service reads the Kafka message, generates the digital ticket, and emails it.
  5. What if the user abandons the checkout? A background cron job runs constantly, looking for rows where status = 'HELD' and held_until < NOW(). It flips those seats back to AVAILABLE and updates the Redis cache so other users can buy them.

Flow 3

9. Caching and read performance

  • What we cache: Event metadata (descriptions, dates) and the real time status of the seat map.
  • Where the cache sits: Redis sits as a “look aside” cache directly behind the microservices, shielding the PostgreSQL database from the heavy 100:1 read traffic.
  • Cache keys and values: We use keys like event:{id}:details (storing JSON) and event:{id}:seats (storing a Hash map of seat IDs to their status).
  • Cache updates and invalidation: Seat availability is highly dynamic. Whenever a user successfully holds or books a seat, the Booking Service updates PostgreSQL and immediately overwrites the specific seat’s status in the Redis Hash.
  • Eviction policy: Least Recently Used (LRU) is perfect here. Old, past events naturally stop getting queried and fall out of memory to make room for hot, upcoming events.

10. Storage, indexing, and media

  • Primary data storage: PostgreSQL handles the core entities. The write load is manageable on average, so a standard relational setup works for daily operations.
  • Indexes: The composite index on (event_id, status) is vital. When users load an event, we must quickly find all available seats among tens of thousands of rows without locking the database with full table scans.
  • Media storage: Large files (artist banners, stadium map SVGs) are stored in Object Storage.
  • Serving media: A pull based CDN is placed in front of Object Storage. When a user in London requests an image, the CDN pulls it once, caches it at a London edge server, and serves it to all subsequent local users instantly, drastically lowering read latency and bandwidth costs.

11. Scaling strategies

  • Simple version: For a small platform, a single load balancer, a few stateless app servers, one Redis instance, and one primary PostgreSQL database is sufficient.
  • Database replication: To handle read growth, we add PostgreSQL Read Replicas. All queries for browsing events or viewing past orders are routed to replicas. Only reservations and payments go to the Primary writer node.
  • Sharding by event: If write traffic grows too large for one primary database, we shard the Event_Seats table by event_id. This means a massive global pop star’s concert might live entirely on Shard A, while a local comedy show lives on Shard B. This isolates heavy events so they do not impact the rest of the system.
  • Virtual Waiting Room: This is our ultimate scaling tool. When 1 million users rush the site, the Load Balancer redirects excess traffic to a lightweight waiting room page. We only allow users through to the API Gateway at a rate our database can safely handle (for example, 2,000 users per minute). This effectively sheds non essential load and prevents backend meltdown.

12. Reliability, failure handling, and backpressure

  • Removing single points of failure: All app servers are stateless and run in multiple Availability Zones. Redis is deployed as a cluster. PostgreSQL runs with automatic failover to a hot standby node.
  • Timeouts and retries: When calling the external payment gateway, we use strict timeouts (for example, 3 seconds).
  • Idempotency keys: If a user’s network connection drops and they tap “Pay” twice, we do not want to charge them twice. The API requires an idempotency_key (usually the reservation ID). The backend checks this key to ensure it only processes the charge once, returning a cached success response for the second tap.
  • Overload and backpressure: If database CPU spikes dangerously high, the API Gateway actively sheds load by routing all new users into the Virtual Waiting Room, prioritizing the compute resources for users who are already in the checkout flow.

13. Security, privacy and abuse

  • Authentication and authorization: User sessions are secured using JWT tokens passed in the authorization header.
  • Handling sensitive user data: We never store raw credit card numbers. They are tokenized on the frontend and sent directly to a PCI compliant payment gateway (like Stripe). Personally Identifiable Information (PII) like names and emails are encrypted at rest.
  • Encryption: All data in transit is encrypted using TLS.
  • Abuse protection: Scalpers use bots to buy hundreds of tickets in seconds. We mitigate this using strict IP rate limiting, a Web Application Firewall (WAF) to detect bot patterns, mandatory CAPTCHAs before high demand sales, and hardcoded business limits (maximum 4 tickets per user).

14. Bottlenecks and next steps

  • Bottleneck: Database row lock contention. Even with sharding by event, 100,000 people trying to book seats for the exact same event at the exact same second will cause severe row level lock contention in PostgreSQL.
    • Next step: Move the seat hold locking mechanism entirely into Redis using single threaded Lua scripts. Redis can process tens of thousands of atomic lock checks per second in memory. We lock the seat in Redis instantly, and then asynchronously write the hold to PostgreSQL behind the scenes.
  • Bottleneck: Expired hold cleanup. Scanning millions of database rows every minute to find and release 10 minute expired holds is slow and wastes database CPU.
    • Next step: Utilize a delayed message queue. When a seat is held, push a message with a 10 minute delay. A worker picks it up exactly 10 minutes later, checks if the seat is still marked HELD (meaning unpaid), and releases it efficiently without table scans.
  • Summary of design:
    • We use a Relational Database with row level locking (SELECT FOR UPDATE) to strictly prevent double booking seats.
    • We leverage a Virtual Waiting Room as a powerful backpressure mechanism to throttle massive traffic spikes and protect backend services.
    • We rely heavily on Redis caching and a CDN to absorb the massive 100:1 read to write ratio.
    • We partition data using sharding by event to isolate database load and prevent noisy neighbor problems during mega ticket drops.