8.17 Design a notification system - The System Design Interview Handbook

1. Restate the Problem and Pick the Scope

We are designing a notification system that delivers messages to users across multiple channels -- push notifications (mobile), email, SMS, and in-app notifications. The system receives notification requests from many internal services (e.g., social feed, payments, marketing) and reliably delivers them to the right users through the right channels at the right time.

Main user groups and actions:

End users -- receive notifications across channels (push, email, SMS, in-app) and manage their notification preferences (opt in/out of specific types, choose channels, set quiet hours).
Internal services (producers) -- send notification requests to the system (e.g., "notify user X that someone liked their photo", "send order confirmation email to user Y").
Operations/marketing teams -- send bulk notifications to large user segments (e.g., promotional campaigns).

Scope decisions:

We will focus on the core notification pipeline: accepting requests, applying user preferences, routing to channels, and delivering through push/email/SMS/in-app.
We will NOT cover: the UI for composing marketing campaigns, A/B testing of notification content, deep analytics dashboards, or the internals of third-party delivery services (APNs, FCM, Twilio, SendGrid). We treat those as external APIs we call.

2. Clarify Functional Requirements

Must-Have Features

Internal services can send a notification request for a specific user or a group of users.
The system supports four delivery channels: mobile push (iOS and Android), email, SMS, and in-app (stored in a notification inbox).
Each user has notification preferences: which notification types they want, through which channels, and quiet hours (do not disturb).
The system respects user preferences before sending -- if a user has opted out of marketing emails, the email is not sent.
Notifications are delivered reliably -- at-least-once delivery. No notification should be silently lost.
The system supports templated notifications -- producers send a notification type and variables (e.g., { type: "order_shipped", order_id: "123", carrier: "FedEx" }), and the system renders the final message per channel.
In-app notifications are stored and retrievable -- users can open their notification inbox and see a history of recent notifications.
The system tracks delivery status per notification per channel (queued, sent, delivered, failed, opened).

Nice-to-Have Features

Rate limiting per user -- do not bombard a user with more than N notifications per hour.
Notification batching/digesting -- group multiple similar notifications into a single summary (e.g., "You have 5 new likes" instead of 5 separate notifications).
Priority levels (urgent vs. normal vs. low) that affect delivery speed and channel selection.

Functional Requirements

3. Clarify Non-Functional Requirements

Metric	Assumption / Target
Total users	500 million registered users
Daily active users (DAU)	100 million
Notifications sent per day	1 billion (across all channels)
Read:Write ratio	Write-heavy for the pipeline (sending notifications); read-heavy for the in-app inbox
End-to-end delivery latency	< 5 seconds for urgent (transactional) notifications; < 30 seconds for normal; minutes acceptable for bulk/marketing
Availability	99.99% -- lost or significantly delayed notifications directly impact user experience and revenue
Consistency	At-least-once delivery (duplicates are tolerable; lost notifications are not). Eventual consistency for inbox reads and analytics.
Data retention	In-app notifications retained for 90 days. Delivery logs retained for 30 days.

Non-Functional Requirements

4. Back-of-the-Envelope Estimates

Write QPS (notifications entering the system)

Notifications/day = 1 billion
Write QPS (avg) = 1B / 86,400 ≈ 11,600 QPS
Peak QPS = 11,600 x 5 ≈ ~58,000 QPS

Each notification may fan out to multiple channels (push + in-app), so the downstream delivery QPS is higher:

Avg channels per notification ≈ 1.5
Delivery QPS (avg) = 11,600 x 1.5 ≈ 17,400 QPS
Peak delivery QPS = ~87,000 QPS

Read QPS (in-app inbox)

Assume 20% of DAU check their notification inbox once per day.

Inbox reads/day = 100M x 0.2 = 20 million
Read QPS (avg) = 20M / 86,400 ≈ 230 QPS
Peak read QPS = 230 x 5 ≈ ~1,150 QPS

Inbox reads are relatively low. The write/delivery path dominates.

Storage

In-app notifications: Each notification record is ~500 bytes (user_id, type, title, body, metadata, timestamp, read status).

Storage/day = 1B x 0.5 (fraction that go to in-app) x 500 bytes = 250 GB/day
Storage/90 days = 250 GB x 90 = ~22.5 TB

Delivery logs: Each log entry is ~200 bytes.

Log storage/day = 1B x 1.5 channels x 200 bytes = 300 GB/day
Log storage/30 days = 300 GB x 30 = ~9 TB

Bandwidth to Third-Party Providers

Push notifications are small (~1 KB each including headers).

Push bandwidth = 500M pushes/day x 1 KB = 500 GB/day ≈ ~46 Mbps average

Emails are larger (~10 KB average with HTML).

Email bandwidth = 300M emails/day x 10 KB = 3 TB/day ≈ ~280 Mbps

These are outbound to external providers - moderate and manageable.

Back-of-the-envelope estimation

5. API Design

5.1 Send Notification (Internal API, called by producer services)

Field	Value
Method & Path	`POST /api/v1/notifications`
Request body	`{ "user_id": "abc123", "type": "order_shipped", "priority": "high", "data": { "order_id": "789", "carrier": "FedEx" }, "channels": ["push", "email", "in_app"] (optional -- if omitted, use user preferences) }`
Success (202 Accepted)	`{ "notification_id": "notif_xyz", "status": "queued" }`
Error codes	400 -- invalid type or missing required data; 429 -- producer rate limited

Returns 202 (Accepted), not 201, because the notification is queued for async processing, not delivered synchronously.

5.2 Send Bulk Notification (for campaigns)

Field	Value
Method & Path	`POST /api/v1/notifications/bulk`
Request body	`{ "segment": "all_us_users", "type": "promo_summer_sale", "data": { ... }, "scheduled_at": "2026-04-08T10:00:00Z" (optional) }`
Success (202)	`{ "batch_id": "batch_456", "estimated_recipients": 50000000, "status": "queued" }`

5.3 Get In-App Notifications (User-Facing)

Field	Value
Method & Path	`GET /api/v1/users/{user_id}/notifications?cursor={cursor}&limit=20`
Success (200)	`{ "notifications": [ { "id", "type", "title", "body", "created_at", "read": false, "action_url": "..." } ... ], "next_cursor": "...", "unread_count": 7 }`
Error codes	401 -- not authenticated

5.4 Mark Notification as Read

Field	Value
Method & Path	`PATCH /api/v1/notifications/{notification_id}`
Request body	`{ "read": true }`
Success (200)	`{ "id": "notif_xyz", "read": true }`

5.5 Update Notification Preferences

Field	Value
Method & Path	`PUT /api/v1/users/{user_id}/notification-preferences`
Request body	`{ "channels": { "push": true, "email": true, "sms": false }, "types": { "marketing": { "push": false, "email": true }, "social": { "push": true, "email": false } }, "quiet_hours": { "start": "22:00", "end": "07:00", "timezone": "Asia/Karachi" } }`
Success (200)	Updated preferences object

6. High-Level Architecture

Component Responsibilities

Notification API -- the entry point. Validates the request (valid user, valid notification type, required data present), assigns a notification_id, and publishes to Kafka. Returns 202 immediately. The producer never waits for delivery.
Kafka (Main Queue) -- the backbone of the system. Decouples producers from the processing pipeline. Absorbs traffic spikes (peak 58K QPS) and ensures no notification is lost. Each notification is a message on a topic, partitioned by user_id for ordering.
Notification Processing Workers -- the brain of the system. They consume from Kafka and for each notification:
1. Look up the user's preferences and device tokens.
2. Determine which channels to use (respecting opt-outs and quiet hours).
3. Render the message using the template service (different formats for push vs. email vs. SMS).
4. Apply rate limiting (has this user received too many notifications recently?).
5. Publish to channel-specific queues.
Channel-Specific Queues and Senders -- separate queues for push, email, SMS, and in-app. Each has its own pool of sender workers that call the external provider APIs (APNs for iOS push, FCM for Android push, SendGrid/SES for email, Twilio for SMS). Separating channels allows independent scaling and failure isolation -- if the SMS provider is down, push and email continue unaffected.
In-App Writer -- writes notification records to the Notification Inbox DB. This is the only channel that is internal (no external provider).
User Preferences DB -- stores each user's notification preferences (which types, which channels, quiet hours). Read on every notification processing step.
Template Service -- stores notification templates per type per channel. For example, order_shipped has a push template ("Your order #{{order_id}} has shipped via {{carrier}}!"), an email template (full HTML with tracking details), and an SMS template (short text). The processing worker sends the type and variables, and the template service returns the rendered message.
Delivery Status Tracker -- after each sender worker attempts delivery, it reports the result (success, failure, bounced, etc.) to a delivery log. This powers retry logic and debugging.

High-level Architecture

7. Data Model

Database Choice

PostgreSQL for user preferences and notification templates. These are structured, relational, and benefit from strong consistency (a preference change must take effect immediately).
Cassandra (or DynamoDB) for the in-app notification inbox. This is a write-heavy, append-only workload with a simple access pattern (all notifications for a user, sorted by time). Cassandra excels at high write throughput and time-series-like queries.
Cassandra or a time-series store for delivery logs. Same reasoning: high write volume, simple queries (by notification_id or user_id + time range).
Redis for user device tokens and rate limiting counters.

Table: user_preferences (PostgreSQL)

Column	Type	Notes
user_id	BIGINT PK
channel_push	BOOLEAN	Global push opt-in/out
channel_email	BOOLEAN	Global email opt-in/out
channel_sms	BOOLEAN	Global SMS opt-in/out
type_overrides	JSONB	Per-type per-channel overrides, e.g., `{ "marketing": { "email": false } }`
quiet_start	TIME	Quiet hours start
quiet_end	TIME	Quiet hours end
timezone	VARCHAR	User's timezone
updated_at	TIMESTAMP

Table: device_tokens (Redis hash)

Key: devices:{user_id}
Value: { "ios": ["token1", "token2"], "android": ["token3"] }

Stored in Redis for fast lookup. Updated when the user's app registers or refreshes a token.

Table: notification_inbox (Cassandra)

Column	Type	Notes
user_id	PARTITION KEY	All notifications for a user on one partition
created_at	CLUSTERING KEY (DESC)	Newest first
notification_id	UUID	Unique ID
type	VARCHAR	e.g., "order_shipped"
title	TEXT	Rendered title
body	TEXT	Rendered body
action_url	TEXT	Deep link when tapped
read	BOOLEAN	Default false

Query supported: SELECT * FROM notification_inbox WHERE user_id = ? ORDER BY created_at DESC LIMIT 20 -- fetches the latest 20 notifications for a user. This is exactly how the in-app inbox API works.

Table: delivery_logs (Cassandra)

Column	Type	Notes
notification_id	PARTITION KEY
channel	CLUSTERING KEY	push, email, sms, in_app
status	VARCHAR	queued, sent, delivered, failed, bounced
provider_response	TEXT	Raw response from APNs/FCM/SendGrid/Twilio
attempted_at	TIMESTAMP
delivered_at	TIMESTAMP	Nullable

Table: notification_templates (PostgreSQL)

Column	Type	Notes
type	VARCHAR	e.g., "order_shipped"
channel	VARCHAR	push, email, sms
template	TEXT	Template string with `{{variable}}` placeholders
subject	TEXT	Email subject template (nullable for non-email)
updated_at	TIMESTAMP
Composite PK	(type, channel)

8. Core Flows - End to End

Flow 1: Send a Transactional Notification (e.g., "Your order has shipped")

This is the most common flow. An internal service wants to notify a single user about a specific event.

Step 1 -- Producer service sends the request. The Order Service detects that order #789 has shipped. It calls the Notification API: POST /api/v1/notifications with { user_id: "abc123", type: "order_shipped", priority: "high", data: { order_id: "789", carrier: "FedEx" } }. The request hits the load balancer and reaches a Notification API instance.
Step 2 -- Notification API validates and enqueues. The API validates: is user_id a real user? Is order_shipped a registered notification type? Are the required template variables (order_id, carrier) present? If valid, the API generates a notification_id, publishes a message to Kafka on the notifications topic (partitioned by user_id for ordering), and immediately returns HTTP 202 to the producer. The Order Service does not wait for delivery -- it continues processing.

Why 202, not 200? The notification is accepted for processing but not yet delivered. The producer does not need to know when or how it is delivered. This decoupling is critical for reliability and latency.
Step 3 -- Processing worker consumes the message. A Notification Processing Worker picks up the message from Kafka. Now begins the decision-making pipeline:
Step 4 -- Look up user preferences. The worker reads the user's preferences from PostgreSQL (cached in Redis for hot users). User abc123 has: push enabled, email enabled, SMS disabled, no type-level overrides for order_shipped, quiet hours from 10 PM to 7 AM (Asia/Karachi timezone). The current time in the user's timezone is 3 PM -- not in quiet hours. So the eligible channels are: push and email (SMS is globally disabled by the user).
Step 5 -- Check rate limits. The worker checks a Redis counter: how many notifications has user abc123 received in the last hour? If the count is below the threshold (e.g., 30/hour), proceed. If above, this notification is still sent because it is priority "high" (transactional). Rate limiting primarily throttles low-priority and marketing notifications.
Step 6 -- Render templates. The worker calls the Template Service (or looks up templates from a local cache) for order_shipped on each eligible channel:
- Push template: "Your order #{{order_id}} has shipped via {{carrier}}! Track it now." Rendered: "Your order #789 has shipped via FedEx! Track it now."
- Email template: Full HTML email with order details, tracking link, and branded header. Rendered with the same variables.
Step 7 -- Route to channel queues. The worker publishes two messages:
- To the push-queue: { notification_id, user_id, title: "Order Shipped", body: "Your order #789...", action_url: "/orders/789", device_tokens: ["token1", "token3"] }
- To the email-queue: { notification_id, user_id, to: "user@email.com", subject: "Your Order Has Shipped!", html_body: "..." }
- It also writes the notification to the in-app-queue (in-app notifications are always stored regardless of other channel preferences).
Step 8 -- Push Sender Worker delivers the push notification. A Push Sender Worker consumes from the push queue. It sends the notification to APNs (for the iOS token) and FCM (for the Android token) via their respective APIs. The worker sets a 5-second timeout for each call.
- APNs returns success (200). The worker logs: { notification_id, channel: "push", device: "ios", status: "sent" } to the delivery logs.
- FCM returns an error (invalid token). The worker logs the failure and marks the Android token as invalid in Redis (so future notifications skip it). It does NOT retry for an invalid token -- retries are only for transient errors (5xx, timeout).
Step 9 -- Email Sender Worker delivers the email. Similarly, an Email Sender Worker calls SendGrid's API with the rendered HTML email. SendGrid returns 202 (accepted). The worker logs status: "sent". SendGrid will later send a webhook callback when the email is actually delivered (or bounces), which updates the delivery log.
Step 10 -- In-App Writer stores the notification. An In-App Writer worker inserts a row into the Cassandra notification_inbox table for user abc123. This is a fast, single-partition write.
Step 11 -- User receives the notifications.
- Push: The user's phone displays a push notification within 1-3 seconds of Step 8. They see: "Order Shipped -- Your order #789 has shipped via FedEx! Track it now." Tapping it deep-links to the order tracking page.
- Email: The email arrives in the user's inbox within 30-60 seconds (email delivery is inherently slower due to SMTP relays).
- In-app: Next time the user opens the app, their notification inbox shows the notification with an unread badge. The unread count ("7") is shown on the bell icon.
Total end-to-end latency for push: ~2-5 seconds from the Order Service's API call to the push notification appearing on the user's phone. Most of this time is Kafka consumer processing and the APNs/FCM round-trip.

Flow 1

Flow 2: User Opens Their In-App Notification Inbox

This is the read path for in-app notifications.

Step 1 -- User taps the notification bell icon. The app sends GET /api/v1/users/abc123/notifications?limit=20 to the load balancer, which routes to an API server.
Step 2 -- API server checks the cache. The server first checks Redis for a cached version of this user's recent notifications: inbox:{user_id}. If the cache has a fresh result (e.g., cached within the last 60 seconds), return it immediately. For most users who check their inbox rarely, this will be a cache miss.
Step 3 -- Cache miss: query Cassandra. The server queries: SELECT * FROM notification_inbox WHERE user_id = 'abc123' ORDER BY created_at DESC LIMIT 20. This is a single-partition read in Cassandra -- very fast (~5-10 ms). Cassandra returns the 20 most recent notifications.
Step 4 -- Compute unread count. The server also needs the unread count for the badge. Rather than counting all unread notifications in Cassandra (potentially expensive), we maintain an unread_count counter in Redis: unread:{user_id}. This counter is incremented by 1 every time the In-App Writer stores a new notification, and decremented when the user marks a notification as read. The server reads this counter: GET unread:abc123 returns 7.
Step 5 -- Populate cache and return. The server writes the notification list to Redis with a 60-second TTL: SET inbox:abc123 [serialized list] EX 60. It returns the JSON response to the client.
Step 6 -- Client renders the inbox. The app displays the list of notifications with titles, bodies, timestamps, and read/unread status. The bell icon shows "7" as the unread badge. Total latency: ~20-50 ms.
Step 7 -- User reads a notification. The user taps on "Order Shipped." The app sends PATCH /api/v1/notifications/notif_xyz with { read: true }. The server updates the Cassandra row (read = true), decrements the Redis unread counter (DECR unread:abc123), and invalidates the inbox cache (DEL inbox:abc123). The badge updates to "6".

Flow 2

Flow 3: Sending a Bulk Marketing Notification to Millions of Users

This flow is different from transactional notifications because it involves massive fan-out.

Step 1 -- Marketing team triggers a campaign. They call POST /api/v1/notifications/bulk with a segment ("all US users who have been active in the last 30 days") and the notification type (promo_summer_sale). The API validates the request and creates a batch_id.
Step 2 -- Segment resolution (async). A Batch Worker picks up the job and queries the user database for all user_ids matching the segment. This might return 50 million user IDs. The worker does NOT try to process all 50 million at once.
Step 3 -- Chunk and enqueue. The worker splits the 50 million user IDs into chunks of 1,000. For each chunk, it publishes a message to a dedicated bulk-notifications Kafka topic: { batch_id, user_ids: [1000 ids], type: "promo_summer_sale", data: {...} }. This creates 50,000 Kafka messages.
Step 4 -- Bulk Processing Workers consume chunks. A pool of Bulk Processing Workers consumes from the bulk-notifications topic. For each chunk of 1,000 users, the worker:
- Batch-reads preferences for all 1,000 users.
- Filters out users who have opted out of marketing notifications.
- Filters out users in quiet hours.
- Applies rate limiting (skip users who have received too many notifications today).
- For the remaining users, renders templates and publishes to channel queues.
  
  Why chunk instead of sending 50 million individual messages? Batching reduces Kafka overhead (50,000 messages vs. 50 million), allows efficient batch reads for preferences, and makes it easy to pause/resume or cancel a campaign mid-flight.
Step 5 -- Channel senders deliver at a controlled rate. The push/email/SMS sender workers process the campaign messages, but with rate controls to avoid overwhelming external providers (APNs, SendGrid, Twilio all have rate limits). The system spreads delivery over 30-60 minutes rather than trying to send all 50 million at once.
Step 6 -- Progress tracking. The Delivery Status Tracker aggregates results per batch_id: 50M targeted, 45M eligible (after preference filtering), 44M sent, 43M delivered, 500K failed, 500K pending. The marketing team can check progress via an internal dashboard.
What the user sees: Over the course of an hour, users receive a push notification and/or email about the summer sale. Users who opted out see nothing. Users in quiet hours receive the notification after quiet hours end (the system queues it and releases it at the right time).

Flow 3

9. Caching and Read Performance

What We Cache

User preferences (Redis): prefs:{user_id} -- the user's channel and type preferences. Read on every single notification processed (~17K QPS). Caching avoids hitting PostgreSQL on every notification.
Device tokens (Redis hash): devices:{user_id} -- the user's push notification tokens. Read every time a push is sent.
Notification inbox (Redis): inbox:{user_id} -- the most recent 20 notifications for the in-app inbox. Short TTL (60 seconds). Only populated when a user actually views their inbox.
Unread count (Redis counter): unread:{user_id} -- atomic counter for the badge number.
Templates (local in-memory cache on processing workers): Templates change rarely. Workers cache them locally and refresh every 5 minutes.

Where the Cache Sits

For the notification processing path: Workers check Redis for preferences and device tokens before touching PostgreSQL.

For the inbox read path: API servers check Redis for the cached inbox before querying Cassandra.

Cache Update and Invalidation

Preferences cache: Written to Redis when a user updates their preferences. TTL of 1 hour as a safety net. If the cache is missing, read from PostgreSQL and populate.
Device tokens cache: Updated when the app registers a new token. No TTL (tokens are managed explicitly). Invalid tokens are removed when a push provider reports them.
Inbox cache: Invalidated (deleted) whenever a new notification is written to the inbox or a notification is marked as read. Short TTL (60 seconds) prevents serving very stale data.
Unread count: Incremented atomically on each new notification, decremented on each read. If the counter gets out of sync (unlikely but possible), a background job reconciles it against Cassandra nightly.

Eviction Policy

LRU on Redis. Active users' preferences and tokens stay warm. Inactive users' data is evicted and reloaded from the database on the next notification. This is fine because the database read adds only a few milliseconds, and inactive users receive notifications infrequently.

10. Storage, Indexing, and Media

Primary Data Storage

PostgreSQL: User preferences, notification templates, batch campaign metadata. Small datasets (~500M preference rows at ~200 bytes each ≈ 100 GB), comfortably handled by PostgreSQL with read replicas.
Cassandra: In-app notification inbox (~22.5 TB for 90 days) and delivery logs (~9 TB for 30 days). Write-heavy, time-ordered, and partitioned by user_id (inbox) or notification_id (logs). Cassandra is purpose-built for this workload.
Redis: Device tokens, preference cache, unread counters, rate limit counters. Fast reads, ~50-100 GB total.

Indexes

notification_inbox: Partition key on user_id, clustering key on created_at DESC. No secondary indexes needed -- the only query is "recent notifications for a user."
delivery_logs: Partition key on notification_id, clustering key on channel. Supports "what happened to this notification across all channels?"
user_preferences: Primary key on user_id in PostgreSQL.

Media

The notification system itself has minimal media. Emails may contain images, but those are typically hosted on the product's CDN and referenced by URL in the email HTML. Push notifications may include a small icon/image URL pointing to the CDN. No new media storage is needed in the notification system itself.

Trade-offs

Cassandra for inbox: Excellent write throughput and partition-based reads, but no support for complex queries (e.g., "all unread notifications across all users" requires a full scan). For analytics, we export to a data warehouse.
Redis for unread counts: Fast atomic increments, but if Redis loses data (restart without persistence), counts may drift. The nightly reconciliation job fixes this.

11. Scaling Strategies

Version 1: Simple Setup

For a startup with millions of users:

A single PostgreSQL instance for preferences and templates.
A single Cassandra cluster (3 nodes) for inbox and logs.
A single Redis instance for caching and counters.
A small Kafka cluster (3 brokers) for the notification pipeline.
5-10 processing workers, 5 push senders, 5 email senders.

This handles ~1,000 notifications/second comfortably.

Growing the System

Kafka scaling: Add more partitions to the notification topic. Each partition is consumed by one worker, so more partitions = more parallel processing. At 58K peak QPS, we might need 50-100 partitions with 50-100 processing workers.

Cassandra scaling: Add more nodes to the cluster. Cassandra scales linearly -- doubling nodes roughly doubles throughput. At 1 billion notifications/day writing to inbox, we need significant write capacity. A 20-30 node Cassandra cluster handles this well.

Channel-specific scaling: Each channel (push, email, SMS) has its own queue and worker pool, scaled independently. Push sending is typically the highest volume (most users have push enabled). Email is second. SMS is lowest (expensive, used sparingly). If the email provider has a rate limit of 10K emails/second, we configure the email worker pool accordingly.

PostgreSQL for preferences: Add read replicas. The processing workers read preferences very frequently (17K QPS), but writes are rare (users update preferences infrequently). One primary + 2-3 replicas handles this easily.

Geographic distribution: Deploy the notification pipeline in multiple regions. Notifications for US users are processed by US workers calling US-region provider endpoints, reducing latency. Kafka topics can be region-specific.

Handling Bursts

Kafka absorbs spikes. If 10 producer services simultaneously fire notifications (e.g., during a flash sale), Kafka buffers them. Processing workers consume at a steady rate. No notifications are lost; they may be slightly delayed.
Bulk campaign throttling. Marketing campaigns (50M+ notifications) are spread over 30-60 minutes using a token-bucket rate limiter on the bulk processing workers. This prevents campaigns from starving transactional notifications.
Priority queues. High-priority (transactional) notifications are on a separate Kafka topic consumed by dedicated workers. Marketing and low-priority notifications are on a separate topic. Under load, transactional notifications are never delayed by a marketing blast.

12. Reliability, Failure Handling, and Backpressure

Removing Single Points of Failure

Kafka: Multi-broker cluster with replication factor 3. Survives broker failures.
Cassandra: Multi-node cluster with replication factor 3. Survives node failures.
PostgreSQL: Primary + synchronous standby with automatic failover.
Redis: Redis Sentinel or Cluster for failover. If Redis goes down, workers fall back to reading preferences from PostgreSQL directly (slower but functional).
Processing workers: Stateless. If one dies, Kafka reassigns its partitions to other workers. In-flight messages are retried (Kafka's consumer offset is committed only after successful processing).

At-Least-Once Delivery

The entire pipeline is designed for at-least-once delivery:

Kafka consumers commit offsets only after the notification is fully processed and handed to channel queues.
Channel senders commit only after the provider API returns success.
If a worker crashes mid-processing, the message is redelivered to another worker (Kafka's consumer group rebalancing).
Duplicates are possible but rare. Push notifications are inherently idempotent (APNs/FCM deduplicate by message ID). Emails and SMS may be duplicated in rare cases -- acceptable and preferable to losing notifications.

Timeouts and Retries

Provider API timeouts: Push senders set a 5-second timeout for APNs/FCM. Email senders set a 10-second timeout for SendGrid. On timeout, the message is retried up to 3 times with exponential backoff (5s, 15s, 45s).
Dead letter queue: After 3 retries, a failed notification is moved to a dead letter topic in Kafka. An alerting system monitors the DLQ and pages on-call engineers if it grows beyond a threshold.

Circuit Breakers

If APNs starts returning 5xx errors at a high rate, a circuit breaker on the push sender opens -- it stops sending to APNs for 60 seconds (probe with a single request to check recovery). During this window, push notifications are buffered in the push queue (Kafka retains them). When the circuit closes, delivery resumes. Meanwhile, email and in-app channels continue working.

Behavior Under Overload

Shed marketing notifications. If the pipeline is overloaded, pause bulk campaign processing. Transactional notifications (order confirmations, security alerts) always get through.
Rate limit producers. If a producer service is flooding the system (bug or misconfiguration), the API gateway enforces per-producer rate limits and returns 429.
Drop delivery logs. Under extreme load, delivery log writes can be skipped. Notifications are still delivered; we just lose some debugging data temporarily.

13. Security, Privacy, and Abuse

Authentication and Authorization

Producer authentication: Internal services authenticate with the Notification API using service-to-service credentials (mutual TLS or API keys). Only authorized services can send notifications.
User-facing API authentication: The inbox and preference endpoints require a valid user session token (JWT). A user can only read their own notifications and update their own preferences.
Admin API: Bulk send and template management require elevated permissions.

Encryption

In transit: All internal communication over TLS. External API calls to APNs/FCM/SendGrid/Twilio use HTTPS.
At rest: Cassandra and PostgreSQL disk encryption. Kafka log encryption. S3 encryption if used for any data.

Handling Sensitive Data

PII in notifications: Notifications may contain personal data (names, order details, account information). This data flows through Kafka, workers, and external providers. Minimize PII in notification payloads -- use references (order_id) rather than full details where possible, and let the client fetch details via authenticated APIs.
Device tokens: Treat push notification device tokens as sensitive. Store them encrypted. Rotate and invalidate promptly when reported invalid.
Email addresses and phone numbers: Used for email and SMS delivery. Access-controlled and encrypted at rest.

Abuse Protection

Per-user rate limiting: No user receives more than 30 notifications per hour (configurable). This protects against buggy producer services that accidentally fire thousands of notifications to one user.
Per-producer rate limiting: Each internal service has a quota. Prevents a single service from monopolizing the pipeline.
Spam prevention for user-triggered notifications: If notifications are triggered by user actions (e.g., "someone liked your photo"), an attacker could spam likes to flood someone's inbox. Mitigation: batch similar notifications ("5 people liked your photo") and apply per-type-per-user rate limits.
Unsubscribe compliance: All marketing emails must include an unsubscribe link (CAN-SPAM, GDPR). The notification system enforces this at the template level -- marketing email templates are required to include {{unsubscribe_url}}.

14. Bottlenecks and Next Steps

Main Bottlenecks and Risks

Bulk campaign fan-out. Sending 50 million notifications from a single campaign creates a massive spike in the pipeline. Mitigation (already in design): Chunking, separate bulk topic, rate-controlled delivery. Next step: Implement a dedicated campaign scheduler that spreads sends over hours based on optimal delivery times per timezone.
Provider rate limits and outages. Third-party providers (APNs, FCM, SendGrid, Twilio) have rate limits and occasional outages. Mitigation (already in design): Per-channel queues, circuit breakers, retries. Next step: Support multiple providers per channel (e.g., SendGrid as primary email, SES as fallback). Automatic failover when the primary is down.
User preference lookup at scale. Every notification requires a preference check (~17K QPS). Mitigation (already in design): Redis cache. Next step: If Redis becomes a bottleneck, batch preference lookups for bulk campaigns (fetch 1,000 preferences in one round-trip instead of 1,000 individual reads).
Unread count accuracy. Redis counters can drift if increments/decrements are lost (e.g., during a Redis failover). Mitigation (already in design): Nightly reconciliation job. Next step: Use Cassandra's counter columns or a separate counter table for durability, with Redis as a fast read cache.
Quiet hours across timezones. Checking quiet hours for every notification requires knowing the user's timezone and current local time. For bulk campaigns targeting users across many timezones, this means different delivery times per user. Next step: The bulk scheduler pre-groups users by timezone and schedules per-timezone delivery windows.

Design Summary

Aspect	Decision	Key Trade-off
Architecture	Event-driven pipeline: API -> Kafka -> Processing Workers -> Channel Queues -> Senders	Fully async, decoupled; producers never wait for delivery. Slight delay (2-5s) is acceptable.
Channel isolation	Separate queues and worker pools per channel (push, email, SMS, in-app)	Independent scaling and failure isolation; more operational complexity
Delivery guarantee	At-least-once via Kafka offset management and retries	No lost notifications; rare duplicates are acceptable
In-app storage	Cassandra partitioned by user_id	High write throughput for inbox; limited query flexibility
User preferences	PostgreSQL + Redis cache	Strong consistency on write; fast reads from cache for processing
Bulk campaigns	Chunked fan-out with rate-controlled delivery on a separate pipeline	Prevents campaigns from starving transactional notifications

This design is built around one core principle: the notification system is an async pipeline, not a synchronous API.

Producers fire and forget; the pipeline handles preference checking, template rendering, channel routing, delivery, and retry logic entirely in the background.

By using Kafka as the central bus, separating channels into independent queues, and caching aggressively for the high-frequency lookups (preferences, device tokens), we build a system that reliably delivers a billion notifications per day while gracefully handling provider outages, traffic spikes, and bulk campaigns without impacting the latency of transactional notifications.