Real-Time Notification Systems (Push Notifications, In-App)
Notifications are how your application reaches users who are not actively looking at it.
A message notification pulls a user back into a chat app.
A price drop alert drives a purchase.
A security alert warns about an unauthorized login.
The system design challenge is delivering the right notification to the right user at the right time, reliably and at scale.
Notification Channels
Different channels have different delivery characteristics and infrastructure requirements.
Push notifications (mobile) are delivered through platform-specific services: Apple Push Notification Service (APNs) for iOS and Firebase Cloud Messaging (FCM) for Android. Your server sends a notification payload to APNs or FCM, and the platform delivers it to the user's device. You do not control delivery timing or guarantee delivery (the device might be offline, the user might have disabled notifications).
In-app notifications appear while the user is actively using your application. They are delivered through a persistent connection (WebSocket or SSE) between the client and your server. When a notification is triggered, the server pushes it to the user's active connection immediately. In-app notifications are the most reliable channel because you know the user is online and actively engaged.
Email notifications are handled by email delivery services (SendGrid, SES, Mailgun). They have the highest latency (seconds to minutes) but the broadest reach (every user has an email address). Email is appropriate for non-urgent notifications: daily summaries, weekly reports, marketing campaigns.
SMS notifications are delivered through SMS providers (Twilio, Vonage). They reach users without internet access but cost money per message and are limited to short text.
Notification System Architecture
A production notification system has several components.
Event producer. Any service in your system that triggers a notification publishes an event to a message queue or event stream. The order service publishes "OrderShipped." The security service publishes "SuspiciousLogin." The social service publishes "NewFollower."
Notification service. A dedicated service consumes these events, determines which users should be notified, resolves their notification preferences (does this user want push notifications for shipping updates? do they want email or in-app for social events?), and constructs the notification payload.
User preference store. A database (or cache) that stores each user's notification preferences: which channels they have enabled, which notification types they want, quiet hours during which notifications should be suppressed.
Channel dispatchers. Separate services for each delivery channel. The push dispatcher sends to APNs and FCM. The email dispatcher sends through SendGrid. The in-app dispatcher pushes through WebSocket connections. Separating dispatchers by channel lets each one scale independently and handle channel-specific retry logic.
Delivery tracking. Track the status of each notification: sent, delivered, opened, failed. This feeds analytics (notification open rates) and reliability monitoring (is the push notification failure rate spiking?).
Challenges at Scale
Fan-out: A celebrity with 10 million followers posts a photo. The notification system must generate 10 million individual notifications. Publishing 10 million events synchronously would take hours. Instead, the system fans out asynchronously: the "NewPost" event enters a fan-out service that reads the follower list and enqueues individual notification tasks to a high-throughput queue (Kafka or SQS). Worker pools process the queue and dispatch notifications to each channel.
Deduplication: A user should not receive the same notification twice. If the notification service processes the same event twice (because of at-least-once delivery), it must detect the duplicate and skip it. An idempotency key (event ID + user ID) stored in Redis with a short TTL handles this.
Rate limiting per user: Sending a user 50 notifications in one hour is harassment. Per-user rate limits (at most 10 push notifications per hour, at most 3 emails per day) protect user experience and prevent your messages from being classified as spam.
Notification System Architecture
Practice system design problems with Grokking the System Design Interview course.
Presence and Online Status Tracking
Presence tracking shows whether a user is currently online, idle, or offline.
Chat applications (Slack, WhatsApp, Discord) display green dots next to active users.
Collaborative tools show who is currently viewing a document. Gaming platforms show who is available to play.
The Challenge
Presence seems simple: track whether a user has an active connection to the server. But at scale (millions of concurrent users), presence tracking becomes a distributed systems problem.
Each user maintains a connection (WebSocket) to one of many server instances.
Server A knows that users 1-1000 are connected to it. Server B knows about users 1001-2000.
When user 42 wants to know if user 1500 is online, user 42's server (A) needs to query user 1500's server (B).
With thousands of servers, point-to-point queries are impractical.
Architecture Approaches
Centralized Presence Store
All servers report their connected users to a shared store (Redis).
When a user connects, the server writes a presence record: presence:user_1500 = {server: B, status: online, last_seen: timestamp}. When any server needs to check a user's status, it reads from Redis. Redis handles the read/write throughput comfortably for millions of records.
Heartbeats maintain presence accuracy. Each client sends a heartbeat to the server every 30 seconds.
The server refreshes the presence record's TTL in Redis.
If a user disconnects without sending a clean "offline" signal (network drop, app crash), the presence record expires when its TTL runs out, and the user transitions to "offline."
A typical TTL is 60 to 90 seconds.
Status Transitions
Online means the user has an active connection and recent heartbeat.
Idle means the user has an active connection but no recent activity (no clicks, no keystrokes for 5 minutes).
The client detects idle state locally and sends a status update to the server. Offline means no active connection and the presence record has expired.
Presence Fan-out
When user 42's status changes from offline to online, their contacts need to be notified.
For a user with 500 contacts, this means 500 presence update messages.
The fan-out strategy from the notification section applies: publish the status change to a message queue, and workers deliver the update to each contact's active connection.
For very large contact lists (a Slack workspace with 50,000 members), presence updates can be expensive.
Optimizations include only sending presence updates for users currently visible on the recipient's screen, batching presence updates (send one batch every 5 seconds rather than individual updates), and using a pub/sub model where clients subscribe to presence channels for the users they care about.
Real-Time Collaboration (Operational Transforms, CRDTs)
Real-time collaboration lets multiple users edit the same document, spreadsheet, or design simultaneously.
Google Docs, Figma, and Notion all implement this.
The fundamental challenge is conflict resolution: when two users edit the same paragraph at the same instant, whose edit wins, and how do you merge them so both users see a consistent result?
Operational Transformation (OT)
OT is the algorithm that powers Google Docs. It works by transforming operations based on concurrent operations from other users.
Each edit is represented as an operation: insert("Hello", position=5), delete(position=10, length=3), or replace(position=7, text="world").
When user A and user B simultaneously edit the document, each sends their operation to the server.
The server determines the order and transforms one operation against the other so both produce the same result.
Example: the document contains "ABCD". User A inserts "X" at position 1 (result: "AXBCD"). Simultaneously, user B inserts "Y" at position 3 (result: "ABCYD").
Without transformation, applying both operations sequentially would produce different results depending on the order. OT transforms B's operation based on A's: since A inserted a character before position 3, B's position shifts to 4.
The result is "AXBCYD" regardless of the application order.
OT requires a central server to order operations and perform transformations. This introduces a single point of coordination but guarantees consistency.
The server is the authority on the document state, and all clients converge to its version.
CRDTs for Collaboration
CRDTs (covered in Part IV, Lesson 6) provide an alternative to OT that works without a central server. Each character in the document is assigned a unique, globally ordered identifier. Insertions create new identifiers between existing ones.
Deletions mark characters as tombstones (logically deleted but still present in the data structure).
Because identifiers are globally unique and the ordering is deterministic, all replicas converge to the same document state without a central coordinator.
Figma uses CRDTs for its real-time collaboration.
Each user's client can operate independently (even offline) and sync changes when connectivity is available. The CRDTs guarantee convergence without conflicts.
OT vs. CRDTs
| Aspect | OT | CRDTs |
|---|---|---|
| Central server | Required (orders operations) | Not required (peer-to-peer possible) |
| Offline support | Limited (operations must reach server) | Full (local edits merge on reconnect) |
| Complexity | Transformation logic can be complex | Data structure design is complex |
| Memory | Lower (no tombstones) | Higher (tombstones accumulate) |
| Adopted by | Google Docs, Microsoft Office Online | Figma, Yjs, Automerge |
For most applications, either approach works.
OT is well-understood and has decades of production use.
CRDTs are newer and enable peer-to-peer and offline-first architectures but require more memory management (garbage collecting tombstones).
Collaboration Infrastructure
A real-time collaboration system requires a document server that maintains the authoritative document state and applies operations, a presence layer that shows which users are currently viewing and editing (with cursors and selections), a transport layer (WebSockets) for low-latency bidirectional communication between clients and the server, a conflict resolution engine (OT or CRDT) that merges concurrent edits, and a persistence layer that periodically saves the document state to a database for durability.
Live Streaming and Video Delivery Architecture
Live streaming delivers video content to viewers in near-real-time, with latency from a few seconds (standard live streaming) to under a second (ultra-low-latency streaming).
The architecture is fundamentally different from on-demand video (like YouTube) because the content is being produced and consumed simultaneously.
Ingest
The broadcaster (a camera, a screen capture, a production studio) encodes the video feed and sends it to an ingest server.
The most common ingest protocol is RTMP (Real-Time Messaging Protocol), though newer alternatives like SRT (Secure Reliable Transport) and WHIP (WebRTC-HTTP Ingestion Protocol) are gaining adoption.
The ingest server receives the raw stream and passes it to the transcoding pipeline.
Transcoding
A single incoming stream must be converted into multiple quality levels (resolutions and bitrates) so viewers on different devices and network speeds can watch without buffering.
A 1080p stream at 6 Mbps might be transcoded into 720p at 3 Mbps, 480p at 1.5 Mbps, and 360p at 0.8 Mbps.
Transcoding is computationally intensive and happens in real time.
Cloud transcoding services (AWS MediaLive, Azure Media Services) or dedicated transcoding servers with hardware encoding (NVIDIA NVENC) handle this at scale.
Delivery
The transcoded streams are segmented into small chunks (typically 2 to 6 seconds each) and distributed through a CDN.
Viewers' players download these chunks sequentially, with the player selecting the appropriate quality level based on current bandwidth (adaptive bitrate streaming).
HLS (HTTP Live Streaming) by Apple is the most widely used delivery protocol. It uses HTTP-based chunk delivery and is supported by every major browser and device.
Standard HLS has a latency of 10 to 30 seconds (because the player buffers several chunks before playing).
Low-Latency HLS (LL-HLS) and Low-Latency CMAF reduce latency to 2 to 5 seconds by using smaller chunks and partial segment delivery.
WebRTC achieves sub-second latency by establishing peer-to-peer connections between the broadcaster and viewers.
WebRTC is used for video calls and ultra-low-latency streams (live auctions, sports betting, interactive live events).
The trade-off is that WebRTC is harder to scale to large audiences because each viewer connection consumes more server resources than HLS.
| Protocol | Latency | Scalability | Best For |
|---|---|---|---|
| HLS / DASH | 10-30 seconds | Excellent (CDN-based) | Standard live streaming, sports, events |
| LL-HLS / LL-CMAF | 2-5 seconds | Good (CDN-compatible) | Live events where moderate latency is acceptable |
| WebRTC | < 1 second | Limited (more server resources per viewer) | Video calls, auctions, interactive streams |
Chat and Interaction
Most live streams include a chat or interaction component.
A chat system (Part VIII, Lesson 2) alongside the stream uses WebSockets for real-time message delivery.
At large scale (100,000 concurrent viewers chatting), the chat system needs its own scaling strategy: message rate limiting (viewers can send at most 1 message per 2 seconds), message fan-out through pub/sub, and content moderation for filtering inappropriate messages.
Real-Time Gaming Backends
Online multiplayer games are among the most demanding real-time systems. They require extremely low latency (under 50ms for competitive games), high reliability (a disconnection can lose a ranked match), and complex state management (tracking the positions, actions, and interactions of many players simultaneously).
Game Server Architecture
Authoritative server model: The game server maintains the official game state. Players send their inputs (move, shoot, jump) to the server. The server validates the inputs (is this move legal? is this player allowed to perform this action?), updates the game state, and broadcasts the updated state to all players. This prevents cheating because the server is the authority on what happened, not the client.
Client-side prediction: Waiting for the server's response before showing the result of a player's action adds a round-trip delay (50-100ms) that makes the game feel sluggish. Client-side prediction shows the predicted result of the player's action immediately on their screen while the input is sent to the server. When the server's authoritative response arrives, the client reconciles its predicted state with the server's state. If they match, the experience is seamless. If they differ (the server rejected the action or another player's action changed the outcome), the client corrects its state, sometimes causing a visible "snap" or rubber-banding effect.
State synchronization: The server sends game state updates to all players at a fixed rate (often called the tick rate). A 64-tick server sends 64 state updates per second. Higher tick rates produce smoother gameplay but consume more bandwidth and server resources. Competitive shooters typically use 64 or 128 ticks. Casual games might use 20 to 30.
Networking for Games
Games use UDP rather than TCP for state updates because TCP's reliability mechanisms (retransmission, ordering) add latency that is worse than occasional packet loss.
A dropped game state update is irrelevant once the next update arrives a few milliseconds later.
UDP lets the latest update arrive as quickly as possible without waiting for retransmission of stale data.
Custom reliability layers on top of UDP provide selective reliability: critical events (a player scored, a match ended) are sent reliably with acknowledgment and retransmission.
Frequent state updates (player positions) are sent unreliably and superseded by the next update.
Matchmaking and Session Management
Before a game begins, the matchmaking system pairs players of similar skill levels.
The matchmaker maintains a queue of waiting players, scores each player on skill (using Elo, Glicko, or TrueSkill rating systems), and forms matches that minimize skill disparity while keeping queue times short.
Once matched, players are assigned to a game server instance.
The session management service tracks active game sessions, handles reconnection (if a player temporarily disconnects), and manages session cleanup when a game ends.
Scaling Game Servers
Game servers are stateful (each server holds the state for one or more active games) and cannot be load-balanced freely like stateless web servers. Scaling requires spinning up new game server instances as demand increases and routing new matches to servers with available capacity. Container orchestration (Kubernetes with game server operators like Agones) automates this scaling.
Geographic distribution matters for latency.
A player in Seoul connecting to a server in Virginia experiences 150ms+ latency, which is unplayable for competitive games.
Game servers run in multiple regions, and the matchmaker prioritizes placing players on servers close to them geographically.
IoT Data Ingestion and Processing Pipelines
Internet of Things (IoT) systems collect data from millions of devices: temperature sensors, industrial machines, smart home devices, connected vehicles, and wearable health monitors.
The engineering challenge is ingesting data from an enormous number of devices, processing it in real time, and acting on it within tight time constraints.
Ingestion Layer
IoT devices are typically resource-constrained: limited CPU, limited memory, unreliable network connections. The communication protocol must be lightweight.
MQTT (Message Queuing Telemetry Transport) is the dominant protocol for IoT. It is a lightweight pub/sub protocol designed for constrained devices and unreliable networks. Messages are tiny (a few bytes of header overhead). Connections are persistent, so a device does not need to re-establish a connection for each message. MQTT supports three quality-of-service levels: fire-and-forget (QoS 0), acknowledged delivery (QoS 1), and exactly-once delivery (QoS 2).
AWS IoT Core, Azure IoT Hub, and Google Cloud IoT Core (deprecated, replaced by Pub/Sub) provide managed MQTT brokers that handle millions of simultaneous device connections.
For devices that communicate over HTTP (more common in consumer IoT), an API gateway or a lightweight ingestion service receives periodic data posts from devices.
Processing Pipeline
Raw IoT data flows through a multi-stage processing pipeline.
Real-time processing handles time-sensitive events. A temperature sensor reading above a critical threshold triggers an immediate alert. An industrial machine vibration pattern indicating imminent failure triggers a shutdown command. Stream processors (Flink, Kafka Streams) evaluate rules against incoming data and emit alerts or commands in milliseconds.
Near-real-time aggregation computes rolling metrics. The average temperature across all sensors in a building over the last 5 minutes. The count of active devices per region. These aggregations feed monitoring dashboards and trend detection.
Batch processing analyzes historical data. Identifying seasonal patterns in energy consumption. Training predictive maintenance models on months of machine sensor data. Batch pipelines (Spark) process data stored in the data lake or data warehouse.
Storage
IoT data is almost exclusively time-series: each data point is a value associated with a device ID and a timestamp. Time-series databases (InfluxDB, TimescaleDB, covered in Part IV, Lesson 2) are the natural storage choice for IoT data.
Data volume is a defining challenge.
A fleet of 1 million devices each sending one reading per second generates 1 million data points per second, or 86 billion per day.
Storage strategies include downsampling (storing per-second readings for the last hour, per-minute averages for the last week, per-hour averages for the last year), partitioning by device ID and time range, and tiered storage (recent data on fast SSDs, older data on cheap object storage).
Device Management
Beyond data ingestion, IoT systems need to manage the devices themselves: provisioning (registering new devices with authentication credentials), firmware updates (pushing new firmware to thousands of devices without bricking them), commands (sending instructions to devices: adjust thermostat, restart, enter maintenance mode), and health monitoring (detecting unresponsive devices, tracking battery levels, monitoring connectivity).
Managed IoT platforms (AWS IoT Core, Azure IoT Hub) provide device management capabilities alongside data ingestion.
Beginner Mistake to Avoid
New engineers sometimes design IoT ingestion pipelines with the same patterns they use for web APIs: synchronous HTTP requests with JSON payloads processed by a traditional web server.
At IoT scale (millions of devices, constant data streams), this falls apart.
HTTP is too heavy for constrained devices. JSON is too verbose for sensor readings that are a few numbers.
Synchronous processing cannot keep up with millions of messages per second.
Use MQTT for communication, compact binary formats for payloads, and stream processing for real-time analysis.
Design for the scale of IoT from the start, because the transition from a web API pattern to an IoT pattern is a fundamental architectural change, not a minor optimization.
Interview-Style Question
Q: You are designing a system for a smart building platform that monitors 10,000 sensors across 500 buildings. Sensors report temperature, humidity, and occupancy every 10 seconds. The system must display real-time dashboards and send alerts when conditions exceed thresholds. How do you design this?
A: Ingestion: sensors connect to an MQTT broker (AWS IoT Core) and publish readings to topics organized by building and sensor type (
building/42/temperature,building/42/humidity). At 10,000 sensors reporting every 10 seconds, the ingestion rate is 1,000 messages per second, well within managed MQTT broker capacity. Processing: a Kafka cluster receives all sensor data from the MQTT broker (via a Kafka Connect MQTT connector). Two consumer pipelines run in parallel. A real-time alerting pipeline (Flink) evaluates threshold rules per sensor (temperature above 35C triggers an alert) and per-building aggregates (average occupancy above 90% triggers a capacity alert). Alerts are published to a notification service (Part VII, Lesson 3 notification architecture). A dashboard pipeline (Kafka Streams or Flink) computes rolling 1-minute and 5-minute averages per building and writes to a time-series database (TimescaleDB). Storage: raw readings are stored in TimescaleDB with per-second granularity retained for 7 days, per-minute aggregates retained for 90 days, and per-hour aggregates retained for 2 years. Older data is exported to S3 for archive. Dashboards: Grafana connects to TimescaleDB, displaying real-time building conditions with drill-down from building overview to individual sensor readings. The total data volume is manageable: 1,000 messages/second × ~100 bytes per message = 100 KB/s ingestion, about 8.6 GB per day of raw data.
IoT Data Pipeline
KEY TAKEAWAYS
-
Notification systems fan out events through channel-specific dispatchers (push, email, in-app, SMS). Deduplication, per-user rate limiting, and user preferences are essential at scale.
-
Presence tracking uses a centralized store (Redis) with TTL-based heartbeats. Optimize fan-out for users with large contact lists by batching updates and sending only for visible contacts.
-
Real-time collaboration uses OT (centralized, proven, Google Docs) or CRDTs (decentralized, offline-capable, Figma) to merge concurrent edits without conflicts.
-
Live streaming transcodes a single ingest stream into multiple quality levels and delivers via CDN. HLS provides 10-30 second latency at massive scale. WebRTC provides sub-second latency at limited scale.
-
Game backends use an authoritative server with client-side prediction. UDP carries state updates. Tick rate determines smoothness. Geographic distribution is critical for latency-sensitive games.
-
IoT ingestion uses MQTT for lightweight device communication, Kafka for buffering, stream processors for real-time alerts, and time-series databases for storage. Downsampling and tiered storage manage the massive data volume.