1. Restate the Problem and Pick the Scope
We are designing a notification system that delivers messages to users across multiple channels -- push notifications (mobile), email, SMS, and in-app notifications. The system receives notification requests from many internal services (e.g., social feed, payments, marketing) and reliably delivers them to the right users through the right channels at the right time.
Main user groups and actions:
- End users -- receive notifications across channels (push, email, SMS, in-app) and manage their notification preferences (opt in/out of specific types, choose channels, set quiet hours).
- Internal services (producers) -- send notification requests to the system (e.g., "notify user X that someone liked their photo", "send order confirmation email to user Y").
- Operations/marketing teams -- send bulk notifications to large user segments (e.g., promotional campaigns).
Scope decisions:
- We will focus on the core notification pipeline: accepting requests, applying user preferences, routing to channels, and delivering through push/email/SMS/in-app.
- We will NOT cover: the UI for composing marketing campaigns, A/B testing of notification content, deep analytics dashboards, or the internals of third-party delivery services (APNs, FCM, Twilio, SendGrid). We treat those as external APIs we call.
2. Clarify Functional Requirements
Must-Have Features
- Internal services can send a notification request for a specific user or a group of users.
- The system supports four delivery channels: mobile push (iOS and Android), email, SMS, and in-app (stored in a notification inbox).
- Each user has notification preferences: which notification types they want, through which channels, and quiet hours (do not disturb).
- The system respects user preferences before sending -- if a user has opted out of marketing emails, the email is not sent.
- Notifications are delivered reliably -- at-least-once delivery. No notification should be silently lost.
- The system supports templated notifications -- producers send a notification type and variables (e.g.,
{ type: "order_shipped", order_id: "123", carrier: "FedEx" }), and the system renders the final message per channel. - In-app notifications are stored and retrievable -- users can open their notification inbox and see a history of recent notifications.
- The system tracks delivery status per notification per channel (queued, sent, delivered, failed, opened).
Nice-to-Have Features
- Rate limiting per user -- do not bombard a user with more than N notifications per hour.
- Notification batching/digesting -- group multiple similar notifications into a single summary (e.g., "You have 5 new likes" instead of 5 separate notifications).
- Priority levels (urgent vs. normal vs. low) that affect delivery speed and channel selection.
Functional Requirements
3. Clarify Non-Functional Requirements
| Metric | Assumption / Target |
|---|---|
| Total users | 500 million registered users |
| Daily active users (DAU) | 100 million |
| Notifications sent per day | 1 billion (across all channels) |
| Read:Write ratio | Write-heavy for the pipeline (sending notifications); read-heavy for the in-app inbox |
| End-to-end delivery latency | < 5 seconds for urgent (transactional) notifications; < 30 seconds for normal; minutes acceptable for bulk/marketing |
| Availability | 99.99% -- lost or significantly delayed notifications directly impact user experience and revenue |
| Consistency | At-least-once delivery (duplicates are tolerable; lost notifications are not). Eventual consistency for inbox reads and analytics. |
| Data retention | In-app notifications retained for 90 days. Delivery logs retained for 30 days. |
Non-Functional Requirements
4. Back-of-the-Envelope Estimates
Write QPS (notifications entering the system)
Notifications/day = 1 billion
Write QPS (avg) = 1B / 86,400 ≈ 11,600 QPS
Peak QPS = 11,600 x 5 ≈ ~58,000 QPS
Each notification may fan out to multiple channels (push + in-app), so the downstream delivery QPS is higher:
Avg channels per notification ≈ 1.5
Delivery QPS (avg) = 11,600 x 1.5 ≈ 17,400 QPS
Peak delivery QPS = ~87,000 QPS
Read QPS (in-app inbox)
Assume 20% of DAU check their notification inbox once per day.
Inbox reads/day = 100M x 0.2 = 20 million
Read QPS (avg) = 20M / 86,400 ≈ 230 QPS
Peak read QPS = 230 x 5 ≈ ~1,150 QPS
Inbox reads are relatively low. The write/delivery path dominates.
Storage
In-app notifications: Each notification record is ~500 bytes (user_id, type, title, body, metadata, timestamp, read status).
Storage/day = 1B x 0.5 (fraction that go to in-app) x 500 bytes = 250 GB/day
Storage/90 days = 250 GB x 90 = ~22.5 TB
Delivery logs: Each log entry is ~200 bytes.
Log storage/day = 1B x 1.5 channels x 200 bytes = 300 GB/day
Log storage/30 days = 300 GB x 30 = ~9 TB
Bandwidth to Third-Party Providers
Push notifications are small (~1 KB each including headers).
Push bandwidth = 500M pushes/day x 1 KB = 500 GB/day ≈ ~46 Mbps average
Emails are larger (~10 KB average with HTML).
Email bandwidth = 300M emails/day x 10 KB = 3 TB/day ≈ ~280 Mbps
These are outbound to external providers - moderate and manageable.
Back-of-the-envelope estimation
5. API Design
5.1 Send Notification (Internal API, called by producer services)
| Field | Value |
|---|---|
| Method & Path | POST /api/v1/notifications |
| Request body | { "user_id": "abc123", "type": "order_shipped", "priority": "high", "data": { "order_id": "789", "carrier": "FedEx" }, "channels": ["push", "email", "in_app"] (optional -- if omitted, use user preferences) } |
| Success (202 Accepted) | { "notification_id": "notif_xyz", "status": "queued" } |
| Error codes | 400 -- invalid type or missing required data; 429 -- producer rate limited |
Returns 202 (Accepted), not 201, because the notification is queued for async processing, not delivered synchronously.
5.2 Send Bulk Notification (for campaigns)
| Field | Value |
|---|---|
| Method & Path | POST /api/v1/notifications/bulk |
| Request body | { "segment": "all_us_users", "type": "promo_summer_sale", "data": { ... }, "scheduled_at": "2026-04-08T10:00:00Z" (optional) } |
| Success (202) | { "batch_id": "batch_456", "estimated_recipients": 50000000, "status": "queued" } |
5.3 Get In-App Notifications (User-Facing)
| Field | Value |
|---|---|
| Method & Path | GET /api/v1/users/{user_id}/notifications?cursor={cursor}&limit=20 |
| Success (200) | { "notifications": [ { "id", "type", "title", "body", "created_at", "read": false, "action_url": "..." } ... ], "next_cursor": "...", "unread_count": 7 } |
| Error codes | 401 -- not authenticated |
5.4 Mark Notification as Read
| Field | Value |
|---|---|
| Method & Path | PATCH /api/v1/notifications/{notification_id} |
| Request body | { "read": true } |
| Success (200) | { "id": "notif_xyz", "read": true } |
5.5 Update Notification Preferences
| Field | Value |
|---|---|
| Method & Path | PUT /api/v1/users/{user_id}/notification-preferences |
| Request body | { "channels": { "push": true, "email": true, "sms": false }, "types": { "marketing": { "push": false, "email": true }, "social": { "push": true, "email": false } }, "quiet_hours": { "start": "22:00", "end": "07:00", "timezone": "Asia/Karachi" } } |
| Success (200) | Updated preferences object |
6. High-Level Architecture
Component Responsibilities
-
Notification API -- the entry point. Validates the request (valid user, valid notification type, required data present), assigns a notification_id, and publishes to Kafka. Returns 202 immediately. The producer never waits for delivery.
-
Kafka (Main Queue) -- the backbone of the system. Decouples producers from the processing pipeline. Absorbs traffic spikes (peak 58K QPS) and ensures no notification is lost. Each notification is a message on a topic, partitioned by user_id for ordering.
-
Notification Processing Workers -- the brain of the system. They consume from Kafka and for each notification:
- Look up the user's preferences and device tokens.
- Determine which channels to use (respecting opt-outs and quiet hours).
- Render the message using the template service (different formats for push vs. email vs. SMS).
- Apply rate limiting (has this user received too many notifications recently?).
- Publish to channel-specific queues.
-
Channel-Specific Queues and Senders -- separate queues for push, email, SMS, and in-app. Each has its own pool of sender workers that call the external provider APIs (APNs for iOS push, FCM for Android push, SendGrid/SES for email, Twilio for SMS). Separating channels allows independent scaling and failure isolation -- if the SMS provider is down, push and email continue unaffected.
-
In-App Writer -- writes notification records to the Notification Inbox DB. This is the only channel that is internal (no external provider).
-
User Preferences DB -- stores each user's notification preferences (which types, which channels, quiet hours). Read on every notification processing step.
-
Template Service -- stores notification templates per type per channel. For example,
order_shippedhas a push template ("Your order #{{order_id}} has shipped via {{carrier}}!"), an email template (full HTML with tracking details), and an SMS template (short text). The processing worker sends the type and variables, and the template service returns the rendered message. -
Delivery Status Tracker -- after each sender worker attempts delivery, it reports the result (success, failure, bounced, etc.) to a delivery log. This powers retry logic and debugging.
High-level Architecture
7. Data Model
Database Choice
- PostgreSQL for user preferences and notification templates. These are structured, relational, and benefit from strong consistency (a preference change must take effect immediately).
- Cassandra (or DynamoDB) for the in-app notification inbox. This is a write-heavy, append-only workload with a simple access pattern (all notifications for a user, sorted by time). Cassandra excels at high write throughput and time-series-like queries.
- Cassandra or a time-series store for delivery logs. Same reasoning: high write volume, simple queries (by notification_id or user_id + time range).
- Redis for user device tokens and rate limiting counters.
Table: user_preferences (PostgreSQL)
| Column | Type | Notes |
|---|---|---|
| user_id | BIGINT PK | |
| channel_push | BOOLEAN | Global push opt-in/out |
| channel_email | BOOLEAN | Global email opt-in/out |
| channel_sms | BOOLEAN | Global SMS opt-in/out |
| type_overrides | JSONB | Per-type per-channel overrides, e.g., { "marketing": { "email": false } } |
| quiet_start | TIME | Quiet hours start |
| quiet_end | TIME | Quiet hours end |
| timezone | VARCHAR | User's timezone |
| updated_at | TIMESTAMP |
Table: device_tokens (Redis hash)
Key: devices:{user_id}
Value: { "ios": ["token1", "token2"], "android": ["token3"] }
Stored in Redis for fast lookup. Updated when the user's app registers or refreshes a token.
Table: notification_inbox (Cassandra)
| Column | Type | Notes |
|---|---|---|
| user_id | PARTITION KEY | All notifications for a user on one partition |
| created_at | CLUSTERING KEY (DESC) | Newest first |
| notification_id | UUID | Unique ID |
| type | VARCHAR | e.g., "order_shipped" |
| title | TEXT | Rendered title |
| body | TEXT | Rendered body |
| action_url | TEXT | Deep link when tapped |
| read | BOOLEAN | Default false |
Query supported: SELECT * FROM notification_inbox WHERE user_id = ? ORDER BY created_at DESC LIMIT 20 -- fetches the latest 20 notifications for a user. This is exactly how the in-app inbox API works.
Table: delivery_logs (Cassandra)
| Column | Type | Notes |
|---|---|---|
| notification_id | PARTITION KEY | |
| channel | CLUSTERING KEY | push, email, sms, in_app |
| status | VARCHAR | queued, sent, delivered, failed, bounced |
| provider_response | TEXT | Raw response from APNs/FCM/SendGrid/Twilio |
| attempted_at | TIMESTAMP | |
| delivered_at | TIMESTAMP | Nullable |
Table: notification_templates (PostgreSQL)
| Column | Type | Notes |
|---|---|---|
| type | VARCHAR | e.g., "order_shipped" |
| channel | VARCHAR | push, email, sms |
| template | TEXT | Template string with {{variable}} placeholders |
| subject | TEXT | Email subject template (nullable for non-email) |
| updated_at | TIMESTAMP | |
| Composite PK | (type, channel) |
8. Core Flows - End to End
Flow 1: Send a Transactional Notification (e.g., "Your order has shipped")
This is the most common flow. An internal service wants to notify a single user about a specific event.
-
Step 1 -- Producer service sends the request. The Order Service detects that order #789 has shipped. It calls the Notification API:
POST /api/v1/notificationswith{ user_id: "abc123", type: "order_shipped", priority: "high", data: { order_id: "789", carrier: "FedEx" } }. The request hits the load balancer and reaches a Notification API instance. -
Step 2 -- Notification API validates and enqueues. The API validates: is
user_ida real user? Isorder_shippeda registered notification type? Are the required template variables (order_id,carrier) present? If valid, the API generates anotification_id, publishes a message to Kafka on thenotificationstopic (partitioned byuser_idfor ordering), and immediately returns HTTP 202 to the producer. The Order Service does not wait for delivery -- it continues processing.Why 202, not 200? The notification is accepted for processing but not yet delivered. The producer does not need to know when or how it is delivered. This decoupling is critical for reliability and latency.
-
Step 3 -- Processing worker consumes the message. A Notification Processing Worker picks up the message from Kafka. Now begins the decision-making pipeline:
-
Step 4 -- Look up user preferences. The worker reads the user's preferences from PostgreSQL (cached in Redis for hot users). User abc123 has: push enabled, email enabled, SMS disabled, no type-level overrides for
order_shipped, quiet hours from 10 PM to 7 AM (Asia/Karachi timezone). The current time in the user's timezone is 3 PM -- not in quiet hours. So the eligible channels are: push and email (SMS is globally disabled by the user). -
Step 5 -- Check rate limits. The worker checks a Redis counter: how many notifications has user abc123 received in the last hour? If the count is below the threshold (e.g., 30/hour), proceed. If above, this notification is still sent because it is priority "high" (transactional). Rate limiting primarily throttles low-priority and marketing notifications.
-
Step 6 -- Render templates. The worker calls the Template Service (or looks up templates from a local cache) for
order_shippedon each eligible channel:- Push template: "Your order #{{order_id}} has shipped via {{carrier}}! Track it now." Rendered: "Your order #789 has shipped via FedEx! Track it now."
- Email template: Full HTML email with order details, tracking link, and branded header. Rendered with the same variables.
-
Step 7 -- Route to channel queues. The worker publishes two messages:
- To the
push-queue:{ notification_id, user_id, title: "Order Shipped", body: "Your order #789...", action_url: "/orders/789", device_tokens: ["token1", "token3"] } - To the
email-queue:{ notification_id, user_id, to: "user@email.com", subject: "Your Order Has Shipped!", html_body: "..." } - It also writes the notification to the
in-app-queue(in-app notifications are always stored regardless of other channel preferences).
- To the
-
Step 8 -- Push Sender Worker delivers the push notification. A Push Sender Worker consumes from the push queue. It sends the notification to APNs (for the iOS token) and FCM (for the Android token) via their respective APIs. The worker sets a 5-second timeout for each call.
- APNs returns success (200). The worker logs:
{ notification_id, channel: "push", device: "ios", status: "sent" }to the delivery logs. - FCM returns an error (invalid token). The worker logs the failure and marks the Android token as invalid in Redis (so future notifications skip it). It does NOT retry for an invalid token -- retries are only for transient errors (5xx, timeout).
- APNs returns success (200). The worker logs:
-
Step 9 -- Email Sender Worker delivers the email. Similarly, an Email Sender Worker calls SendGrid's API with the rendered HTML email. SendGrid returns 202 (accepted). The worker logs
status: "sent". SendGrid will later send a webhook callback when the email is actually delivered (or bounces), which updates the delivery log. -
Step 10 -- In-App Writer stores the notification. An In-App Writer worker inserts a row into the Cassandra
notification_inboxtable for user abc123. This is a fast, single-partition write. -
Step 11 -- User receives the notifications.
- Push: The user's phone displays a push notification within 1-3 seconds of Step 8. They see: "Order Shipped -- Your order #789 has shipped via FedEx! Track it now." Tapping it deep-links to the order tracking page.
- Email: The email arrives in the user's inbox within 30-60 seconds (email delivery is inherently slower due to SMTP relays).
- In-app: Next time the user opens the app, their notification inbox shows the notification with an unread badge. The unread count ("7") is shown on the bell icon.
-
Total end-to-end latency for push: ~2-5 seconds from the Order Service's API call to the push notification appearing on the user's phone. Most of this time is Kafka consumer processing and the APNs/FCM round-trip.
Flow 1
Flow 2: User Opens Their In-App Notification Inbox
This is the read path for in-app notifications.
-
Step 1 -- User taps the notification bell icon. The app sends
GET /api/v1/users/abc123/notifications?limit=20to the load balancer, which routes to an API server. -
Step 2 -- API server checks the cache. The server first checks Redis for a cached version of this user's recent notifications:
inbox:{user_id}. If the cache has a fresh result (e.g., cached within the last 60 seconds), return it immediately. For most users who check their inbox rarely, this will be a cache miss. -
Step 3 -- Cache miss: query Cassandra. The server queries:
SELECT * FROM notification_inbox WHERE user_id = 'abc123' ORDER BY created_at DESC LIMIT 20. This is a single-partition read in Cassandra -- very fast (~5-10 ms). Cassandra returns the 20 most recent notifications. -
Step 4 -- Compute unread count. The server also needs the unread count for the badge. Rather than counting all unread notifications in Cassandra (potentially expensive), we maintain an
unread_countcounter in Redis:unread:{user_id}. This counter is incremented by 1 every time the In-App Writer stores a new notification, and decremented when the user marks a notification as read. The server reads this counter:GET unread:abc123returns7. -
Step 5 -- Populate cache and return. The server writes the notification list to Redis with a 60-second TTL:
SET inbox:abc123 [serialized list] EX 60. It returns the JSON response to the client. -
Step 6 -- Client renders the inbox. The app displays the list of notifications with titles, bodies, timestamps, and read/unread status. The bell icon shows "7" as the unread badge. Total latency: ~20-50 ms.
-
Step 7 -- User reads a notification. The user taps on "Order Shipped." The app sends
PATCH /api/v1/notifications/notif_xyzwith{ read: true }. The server updates the Cassandra row (read = true), decrements the Redis unread counter (DECR unread:abc123), and invalidates the inbox cache (DEL inbox:abc123). The badge updates to "6".
Flow 2
Flow 3: Sending a Bulk Marketing Notification to Millions of Users
This flow is different from transactional notifications because it involves massive fan-out.
-
Step 1 -- Marketing team triggers a campaign. They call
POST /api/v1/notifications/bulkwith a segment ("all US users who have been active in the last 30 days") and the notification type (promo_summer_sale). The API validates the request and creates abatch_id. -
Step 2 -- Segment resolution (async). A Batch Worker picks up the job and queries the user database for all user_ids matching the segment. This might return 50 million user IDs. The worker does NOT try to process all 50 million at once.
-
Step 3 -- Chunk and enqueue. The worker splits the 50 million user IDs into chunks of 1,000. For each chunk, it publishes a message to a dedicated
bulk-notificationsKafka topic:{ batch_id, user_ids: [1000 ids], type: "promo_summer_sale", data: {...} }. This creates 50,000 Kafka messages. -
Step 4 -- Bulk Processing Workers consume chunks. A pool of Bulk Processing Workers consumes from the
bulk-notificationstopic. For each chunk of 1,000 users, the worker:-
Batch-reads preferences for all 1,000 users.
-
Filters out users who have opted out of marketing notifications.
-
Filters out users in quiet hours.
-
Applies rate limiting (skip users who have received too many notifications today).
-
For the remaining users, renders templates and publishes to channel queues.
Why chunk instead of sending 50 million individual messages? Batching reduces Kafka overhead (50,000 messages vs. 50 million), allows efficient batch reads for preferences, and makes it easy to pause/resume or cancel a campaign mid-flight.
-
-
Step 5 -- Channel senders deliver at a controlled rate. The push/email/SMS sender workers process the campaign messages, but with rate controls to avoid overwhelming external providers (APNs, SendGrid, Twilio all have rate limits). The system spreads delivery over 30-60 minutes rather than trying to send all 50 million at once.
-
Step 6 -- Progress tracking. The Delivery Status Tracker aggregates results per batch_id: 50M targeted, 45M eligible (after preference filtering), 44M sent, 43M delivered, 500K failed, 500K pending. The marketing team can check progress via an internal dashboard.
-
What the user sees: Over the course of an hour, users receive a push notification and/or email about the summer sale. Users who opted out see nothing. Users in quiet hours receive the notification after quiet hours end (the system queues it and releases it at the right time).
Flow 3
9. Caching and Read Performance
What We Cache
- User preferences (Redis):
prefs:{user_id}-- the user's channel and type preferences. Read on every single notification processed (~17K QPS). Caching avoids hitting PostgreSQL on every notification. - Device tokens (Redis hash):
devices:{user_id}-- the user's push notification tokens. Read every time a push is sent. - Notification inbox (Redis):
inbox:{user_id}-- the most recent 20 notifications for the in-app inbox. Short TTL (60 seconds). Only populated when a user actually views their inbox. - Unread count (Redis counter):
unread:{user_id}-- atomic counter for the badge number. - Templates (local in-memory cache on processing workers): Templates change rarely. Workers cache them locally and refresh every 5 minutes.
Where the Cache Sits
For the notification processing path: Workers check Redis for preferences and device tokens before touching PostgreSQL.
For the inbox read path: API servers check Redis for the cached inbox before querying Cassandra.
Cache Update and Invalidation
- Preferences cache: Written to Redis when a user updates their preferences. TTL of 1 hour as a safety net. If the cache is missing, read from PostgreSQL and populate.
- Device tokens cache: Updated when the app registers a new token. No TTL (tokens are managed explicitly). Invalid tokens are removed when a push provider reports them.
- Inbox cache: Invalidated (deleted) whenever a new notification is written to the inbox or a notification is marked as read. Short TTL (60 seconds) prevents serving very stale data.
- Unread count: Incremented atomically on each new notification, decremented on each read. If the counter gets out of sync (unlikely but possible), a background job reconciles it against Cassandra nightly.
Eviction Policy
LRU on Redis. Active users' preferences and tokens stay warm. Inactive users' data is evicted and reloaded from the database on the next notification. This is fine because the database read adds only a few milliseconds, and inactive users receive notifications infrequently.
10. Storage, Indexing, and Media
Primary Data Storage
- PostgreSQL: User preferences, notification templates, batch campaign metadata. Small datasets (~500M preference rows at ~200 bytes each ≈ 100 GB), comfortably handled by PostgreSQL with read replicas.
- Cassandra: In-app notification inbox (~22.5 TB for 90 days) and delivery logs (~9 TB for 30 days). Write-heavy, time-ordered, and partitioned by user_id (inbox) or notification_id (logs). Cassandra is purpose-built for this workload.
- Redis: Device tokens, preference cache, unread counters, rate limit counters. Fast reads, ~50-100 GB total.
Indexes
- notification_inbox: Partition key on
user_id, clustering key oncreated_at DESC. No secondary indexes needed -- the only query is "recent notifications for a user." - delivery_logs: Partition key on
notification_id, clustering key onchannel. Supports "what happened to this notification across all channels?" - user_preferences: Primary key on
user_idin PostgreSQL.
Media
The notification system itself has minimal media. Emails may contain images, but those are typically hosted on the product's CDN and referenced by URL in the email HTML. Push notifications may include a small icon/image URL pointing to the CDN. No new media storage is needed in the notification system itself.
Trade-offs
- Cassandra for inbox: Excellent write throughput and partition-based reads, but no support for complex queries (e.g., "all unread notifications across all users" requires a full scan). For analytics, we export to a data warehouse.
- Redis for unread counts: Fast atomic increments, but if Redis loses data (restart without persistence), counts may drift. The nightly reconciliation job fixes this.
11. Scaling Strategies
Version 1: Simple Setup
For a startup with millions of users:
- A single PostgreSQL instance for preferences and templates.
- A single Cassandra cluster (3 nodes) for inbox and logs.
- A single Redis instance for caching and counters.
- A small Kafka cluster (3 brokers) for the notification pipeline.
- 5-10 processing workers, 5 push senders, 5 email senders.
This handles ~1,000 notifications/second comfortably.
Growing the System
Kafka scaling: Add more partitions to the notification topic. Each partition is consumed by one worker, so more partitions = more parallel processing. At 58K peak QPS, we might need 50-100 partitions with 50-100 processing workers.
Cassandra scaling: Add more nodes to the cluster. Cassandra scales linearly -- doubling nodes roughly doubles throughput. At 1 billion notifications/day writing to inbox, we need significant write capacity. A 20-30 node Cassandra cluster handles this well.
Channel-specific scaling: Each channel (push, email, SMS) has its own queue and worker pool, scaled independently. Push sending is typically the highest volume (most users have push enabled). Email is second. SMS is lowest (expensive, used sparingly). If the email provider has a rate limit of 10K emails/second, we configure the email worker pool accordingly.
PostgreSQL for preferences: Add read replicas. The processing workers read preferences very frequently (17K QPS), but writes are rare (users update preferences infrequently). One primary + 2-3 replicas handles this easily.
Geographic distribution: Deploy the notification pipeline in multiple regions. Notifications for US users are processed by US workers calling US-region provider endpoints, reducing latency. Kafka topics can be region-specific.
Handling Bursts
- Kafka absorbs spikes. If 10 producer services simultaneously fire notifications (e.g., during a flash sale), Kafka buffers them. Processing workers consume at a steady rate. No notifications are lost; they may be slightly delayed.
- Bulk campaign throttling. Marketing campaigns (50M+ notifications) are spread over 30-60 minutes using a token-bucket rate limiter on the bulk processing workers. This prevents campaigns from starving transactional notifications.
- Priority queues. High-priority (transactional) notifications are on a separate Kafka topic consumed by dedicated workers. Marketing and low-priority notifications are on a separate topic. Under load, transactional notifications are never delayed by a marketing blast.
12. Reliability, Failure Handling, and Backpressure
Removing Single Points of Failure
- Kafka: Multi-broker cluster with replication factor 3. Survives broker failures.
- Cassandra: Multi-node cluster with replication factor 3. Survives node failures.
- PostgreSQL: Primary + synchronous standby with automatic failover.
- Redis: Redis Sentinel or Cluster for failover. If Redis goes down, workers fall back to reading preferences from PostgreSQL directly (slower but functional).
- Processing workers: Stateless. If one dies, Kafka reassigns its partitions to other workers. In-flight messages are retried (Kafka's consumer offset is committed only after successful processing).
At-Least-Once Delivery
The entire pipeline is designed for at-least-once delivery:
- Kafka consumers commit offsets only after the notification is fully processed and handed to channel queues.
- Channel senders commit only after the provider API returns success.
- If a worker crashes mid-processing, the message is redelivered to another worker (Kafka's consumer group rebalancing).
- Duplicates are possible but rare. Push notifications are inherently idempotent (APNs/FCM deduplicate by message ID). Emails and SMS may be duplicated in rare cases -- acceptable and preferable to losing notifications.
Timeouts and Retries
- Provider API timeouts: Push senders set a 5-second timeout for APNs/FCM. Email senders set a 10-second timeout for SendGrid. On timeout, the message is retried up to 3 times with exponential backoff (5s, 15s, 45s).
- Dead letter queue: After 3 retries, a failed notification is moved to a dead letter topic in Kafka. An alerting system monitors the DLQ and pages on-call engineers if it grows beyond a threshold.
Circuit Breakers
If APNs starts returning 5xx errors at a high rate, a circuit breaker on the push sender opens -- it stops sending to APNs for 60 seconds (probe with a single request to check recovery). During this window, push notifications are buffered in the push queue (Kafka retains them). When the circuit closes, delivery resumes. Meanwhile, email and in-app channels continue working.
Behavior Under Overload
- Shed marketing notifications. If the pipeline is overloaded, pause bulk campaign processing. Transactional notifications (order confirmations, security alerts) always get through.
- Rate limit producers. If a producer service is flooding the system (bug or misconfiguration), the API gateway enforces per-producer rate limits and returns 429.
- Drop delivery logs. Under extreme load, delivery log writes can be skipped. Notifications are still delivered; we just lose some debugging data temporarily.
13. Security, Privacy, and Abuse
Authentication and Authorization
- Producer authentication: Internal services authenticate with the Notification API using service-to-service credentials (mutual TLS or API keys). Only authorized services can send notifications.
- User-facing API authentication: The inbox and preference endpoints require a valid user session token (JWT). A user can only read their own notifications and update their own preferences.
- Admin API: Bulk send and template management require elevated permissions.
Encryption
- In transit: All internal communication over TLS. External API calls to APNs/FCM/SendGrid/Twilio use HTTPS.
- At rest: Cassandra and PostgreSQL disk encryption. Kafka log encryption. S3 encryption if used for any data.
Handling Sensitive Data
- PII in notifications: Notifications may contain personal data (names, order details, account information). This data flows through Kafka, workers, and external providers. Minimize PII in notification payloads -- use references (order_id) rather than full details where possible, and let the client fetch details via authenticated APIs.
- Device tokens: Treat push notification device tokens as sensitive. Store them encrypted. Rotate and invalidate promptly when reported invalid.
- Email addresses and phone numbers: Used for email and SMS delivery. Access-controlled and encrypted at rest.
Abuse Protection
- Per-user rate limiting: No user receives more than 30 notifications per hour (configurable). This protects against buggy producer services that accidentally fire thousands of notifications to one user.
- Per-producer rate limiting: Each internal service has a quota. Prevents a single service from monopolizing the pipeline.
- Spam prevention for user-triggered notifications: If notifications are triggered by user actions (e.g., "someone liked your photo"), an attacker could spam likes to flood someone's inbox. Mitigation: batch similar notifications ("5 people liked your photo") and apply per-type-per-user rate limits.
- Unsubscribe compliance: All marketing emails must include an unsubscribe link (CAN-SPAM, GDPR). The notification system enforces this at the template level -- marketing email templates are required to include
{{unsubscribe_url}}.
14. Bottlenecks and Next Steps
Main Bottlenecks and Risks
-
Bulk campaign fan-out. Sending 50 million notifications from a single campaign creates a massive spike in the pipeline. Mitigation (already in design): Chunking, separate bulk topic, rate-controlled delivery. Next step: Implement a dedicated campaign scheduler that spreads sends over hours based on optimal delivery times per timezone.
-
Provider rate limits and outages. Third-party providers (APNs, FCM, SendGrid, Twilio) have rate limits and occasional outages. Mitigation (already in design): Per-channel queues, circuit breakers, retries. Next step: Support multiple providers per channel (e.g., SendGrid as primary email, SES as fallback). Automatic failover when the primary is down.
-
User preference lookup at scale. Every notification requires a preference check (~17K QPS). Mitigation (already in design): Redis cache. Next step: If Redis becomes a bottleneck, batch preference lookups for bulk campaigns (fetch 1,000 preferences in one round-trip instead of 1,000 individual reads).
-
Unread count accuracy. Redis counters can drift if increments/decrements are lost (e.g., during a Redis failover). Mitigation (already in design): Nightly reconciliation job. Next step: Use Cassandra's counter columns or a separate counter table for durability, with Redis as a fast read cache.
-
Quiet hours across timezones. Checking quiet hours for every notification requires knowing the user's timezone and current local time. For bulk campaigns targeting users across many timezones, this means different delivery times per user. Next step: The bulk scheduler pre-groups users by timezone and schedules per-timezone delivery windows.
Design Summary
| Aspect | Decision | Key Trade-off |
|---|---|---|
| Architecture | Event-driven pipeline: API -> Kafka -> Processing Workers -> Channel Queues -> Senders | Fully async, decoupled; producers never wait for delivery. Slight delay (2-5s) is acceptable. |
| Channel isolation | Separate queues and worker pools per channel (push, email, SMS, in-app) | Independent scaling and failure isolation; more operational complexity |
| Delivery guarantee | At-least-once via Kafka offset management and retries | No lost notifications; rare duplicates are acceptable |
| In-app storage | Cassandra partitioned by user_id | High write throughput for inbox; limited query flexibility |
| User preferences | PostgreSQL + Redis cache | Strong consistency on write; fast reads from cache for processing |
| Bulk campaigns | Chunked fan-out with rate-controlled delivery on a separate pipeline | Prevents campaigns from starving transactional notifications |
This design is built around one core principle: the notification system is an async pipeline, not a synchronous API.
Producers fire and forget; the pipeline handles preference checking, template rendering, channel routing, delivery, and retry logic entirely in the background.
By using Kafka as the central bus, separating channels into independent queues, and caching aggressively for the high-frequency lookups (preferences, device tokens), we build a system that reliably delivers a billion notifications per day while gracefully handling provider outages, traffic spikes, and bulk campaigns without impacting the latency of transactional notifications.