## Batch Processing
Batch processing handles large volumes of data in scheduled, finite jobs.
You collect data over a period (hours, days), process it all at once, and produce results. There is no expectation of real-time output.
A nightly job that computes daily revenue, a weekly pipeline that rebuilds a recommendation model, or a monthly report that summarizes user activity are all batch processing.
**MapReduce Paradigm**
MapReduce is the programming model that made large-scale batch processing practical. It was published by Google in 2004 and changed how the industry thinks about processing data that is too large to fit on a single machine.
The model has two phases.
The Map phase takes an input dataset, splits it into chunks, and processes each chunk independently on a different machine. Each mapper reads its chunk and emits key-value pairs.
The Reduce phase collects all values associated with the same key and combines them into a final result.
Consider counting word frequency across a billion web pages.
The Map phase distributes the pages across 1,000 machines. Each mapper reads its assigned pages and emits pairs like ("the", 1), ("kubernetes", 1), ("design", 1\) for every word it encounters.
The framework groups all values by key: "the" → \[1, 1, 1, 1, ... millions of times\]. The Reduce phase sums each key's values: "the" → 4,827,391. The final output is a complete word frequency count across the entire corpus.
MapReduce's power is in its simplicity. You write two functions (map and reduce), and the framework handles distributing data across machines, scheduling tasks, handling failures (restarting mappers that crash), and sorting and grouping intermediate results. Individual machines can fail and the job continues without data loss.
The limitation of MapReduce is that many real-world jobs require multiple passes over the data, and each pass writes intermediate results to disk.
A pipeline that joins two datasets, filters, aggregates, and sorts might require four or five sequential MapReduce jobs, each reading from and writing to HDFS. The disk I/O between stages makes complex pipelines slow.
**Apache Hadoop Ecosystem**
Hadoop is the open-source implementation of Google's MapReduce and Google File System papers. It consists of two core components: HDFS (Hadoop Distributed File System) for storing data across a cluster, and MapReduce for processing that data.
Around these core components, an ecosystem of tools grew to address different needs.
* Hive provides a SQL-like query language (HiveQL) over data stored in HDFS. Analysts who know SQL can query massive datasets without writing MapReduce code. Under the hood, Hive translates queries into MapReduce (or Spark or Tez) jobs. * Pig provides a scripting language (Pig Latin) for data transformation pipelines. It sits between raw MapReduce code and SQL, offering more flexibility than Hive for complex data flows.
* YARN (Yet Another Resource Negotiator) is the resource management layer that schedules and allocates cluster resources for different processing frameworks (MapReduce, Spark, Flink) running on the same Hadoop cluster. * HBase is a wide-column store (covered in Part II, Lesson 2\) that runs on top of HDFS, providing random read/write access to data stored in the Hadoop ecosystem.
The Hadoop ecosystem dominated big data processing from roughly 2006 to 2015\. It still runs in many large organizations, but Apache Spark has largely replaced MapReduce as the processing engine of choice for new projects because of its dramatically better performance.
**Apache Spark for Large-Scale Processing**
Spark addresses MapReduce's biggest weakness: disk I/O between stages.
Instead of writing intermediate results to disk after each Map and Reduce phase, Spark keeps data in memory (as Resilient Distributed Datasets, or RDDs) and passes it directly between processing stages.
For iterative algorithms that pass over data multiple times (like machine learning training), Spark can be 10 to 100 times faster than MapReduce.
Spark provides a rich API beyond simple map and reduce operations. You can filter, join, group, sort, aggregate, and window data using a DataFrame API that feels similar to working with a table in pandas or SQL.
Spark SQL lets you write standard SQL queries that execute on the Spark engine.
The Spark ecosystem includes Spark SQL for structured data queries, Spark Streaming (now Structured Streaming) for stream processing (covered in section 2.2), MLlib for distributed machine learning, and GraphX for graph processing.
Spark runs on various cluster managers: YARN (running alongside Hadoop), Kubernetes, its own standalone scheduler, or managed cloud services (AWS EMR, Databricks, Google Dataproc).
Many organizations that adopted Hadoop have migrated their processing jobs from MapReduce to Spark while keeping HDFS (or replacing it with S3) as the storage layer.
| Framework | Processing Model | Speed | Ease of Use | Best For |
|---|---|---|---|---|
| MapReduce | Disk-based, two-phase | Slower (disk I/O between stages) | Low (Java-centric, verbose) | Legacy jobs, simple batch ETL |
| Spark | In-memory, multi-stage DAG | 10-100x faster for iterative jobs | High (Python, Scala, SQL APIs) | Complex batch, ML training, SQL analytics |
**ETL Pipelines and Data Warehousing**
ETL stands for Extract, Transform, Load. It is the process of pulling data from multiple source systems (databases, APIs, logs, third-party services), transforming it into a consistent format (cleaning, normalizing, joining, aggregating), and loading it into a destination system (usually a data warehouse) for analysis.
A typical ETL pipeline for an e-commerce platform might extract order data from PostgreSQL, user data from MongoDB, and marketing data from a third-party API every night.
The transform phase joins orders with user profiles, calculates revenue per region, and formats timestamps consistently.
The load phase writes the results into a data warehouse like Snowflake, BigQuery, or Redshift, where analysts query it for dashboards and reports.
Data warehouses are databases optimized for analytical queries (OLAP, Online Analytical Processing) rather than transactional workloads (OLTP). They use columnar storage (storing all values of a single column together) which makes aggregation queries ("sum of revenue by month") dramatically faster because the query engine reads only the relevant columns instead of entire rows.
| Data Warehouse | Provider | Strengths |
|---|---|---|
| Snowflake | Snowflake Inc. | Separates storage and compute, scales independently |
| BigQuery | Google Cloud | Serverless, pay-per-query, massive scale |
| Redshift | AWS | Tight AWS integration, mature, cost-effective |
| ClickHouse | Open source | Extremely fast aggregations, real-time analytics |
Modern pipelines often use ELT (Extract, Load, Transform) instead of ETL.
Raw data is loaded into the warehouse first, and transformations happen inside the warehouse using SQL. This simplifies the pipeline because transformations use the warehouse's powerful query engine rather than external processing frameworks.
Tools like dbt (data build tool) have popularized this approach.
Interview-Style Question
> Q: Your company has transaction data in PostgreSQL, user behavior logs in S3, and marketing campaign data in a third-party API. Leadership wants a daily dashboard showing revenue by marketing channel. How do you build this?
> A: Build an ELT pipeline. Extract: a nightly job pulls the previous day's transactions from PostgreSQL, copies the relevant S3 log files, and calls the marketing API for campaign data. Load: all three datasets are loaded as raw tables into a data warehouse (Snowflake or BigQuery). Transform: dbt models join transactions with user behavior logs (to attribute each purchase to a user session) and with marketing campaign data (to attribute sessions to campaigns). The final transformed table contains revenue broken down by marketing channel, ready for the dashboard. An orchestration tool like Airflow schedules the pipeline nightly and handles retries if any step fails. The dashboard queries the warehouse directly.
_ETL Pipeline Architecture_
### KEY TAKEAWAYS
* MapReduce splits large-scale processing into Map (process chunks independently) and Reduce (combine results by key). It handles failures gracefully but is slow due to disk I/O between stages.
* Hadoop provides the open-source ecosystem for distributed storage (HDFS) and processing (MapReduce, Hive, Pig). It is mature but largely succeeded by Spark for new workloads. * Spark keeps data in memory between stages, making it 10-100x faster than MapReduce for iterative and complex jobs. It supports SQL, machine learning, and streaming.
* ETL pipelines extract data from sources, transform it into a consistent format, and load it into a data warehouse. Modern approaches often use ELT, loading raw data first and transforming inside the warehouse. * Data warehouses (Snowflake, BigQuery, Redshift) use columnar storage optimized for analytical queries. They are the destination for processed data consumed by dashboards and reports.
## Stream Processing
Batch processing handles data that has already been collected.
Stream processing handles data as it arrives, event by event, in real time.
The distinction matters when timeliness matters.
A fraud detection system that runs as a nightly batch job catches fraud the next day.
A stream processor catches it in milliseconds.
**Real-Time vs. Near-Real-Time Processing**
True real-time processing means each event is processed individually the moment it arrives. The latency between an event occurring and the system reacting is measured in milliseconds. Apache Flink and Kafka Streams process events this way.
Near-real-time processing collects events into small batches (micro-batches) and processes each batch together. The latency is measured in seconds to minutes.
Spark Structured Streaming operates in this mode, with configurable batch intervals as short as 100 milliseconds.
For most applications, the difference between 50 milliseconds (true real-time) and 2 seconds (near-real-time) is invisible to users.
Near-real-time is sufficient for dashboards, alerting, analytics, and most notification systems.
True real-time matters for fraud detection (where milliseconds determine whether a fraudulent transaction is blocked), algorithmic trading, real-time bidding in ad auctions, and industrial process control.
**Apache Kafka Streams, Apache Flink, Apache Storm**
These are the three major stream processing frameworks (Kafka Streams and Flink were introduced in Chapter II).
Here is a deeper comparison of their architectures and strengths.
Kafka Streams is a Java/Kotlin library, not a standalone cluster. You embed it in your application, and it processes data from Kafka topics.
It scales by running multiple instances of your application, each handling a subset of Kafka partitions.
Because it is a library, there is no separate cluster to deploy and manage.
This makes Kafka Streams the simplest option when your data is already in Kafka and your processing needs are moderate (filtering, transforming, aggregating, joining streams).
Apache Flink is a distributed stream processing cluster. It provides the most powerful event processing capabilities: event-time semantics (processing events based on when they actually happened, not when they arrived), exactly-once state management (the system guarantees no data is lost or duplicated even during failures), complex event processing (detecting patterns across sequences of events), and sophisticated windowing (tumbling, sliding, session, and custom windows).
Flink handles both bounded (batch) and unbounded (stream) datasets with the same API.
Apache Storm was one of the earliest stream processing frameworks. It processes events one at a time with low latency.
Storm's architecture uses spouts (data sources) and bolts (processing steps) connected in a topology.
While Storm pioneered real-time stream processing, it has been largely superseded by Flink and Kafka Streams for new projects.
Storm lacks some of the advanced features (event-time processing, exactly-once guarantees) that Flink provides.
| Framework | Architecture | Processing Model | State Management | Best For |
|---|---|---|---|---|
| Kafka Streams | Library (in your app) | True streaming | Local state stores backed by Kafka | Moderate processing within Kafka ecosystem |
| Flink | Distributed cluster | True streaming | Distributed, exactly-once snapshots | Complex event processing, event-time logic |
| Storm | Distributed cluster | True streaming (per-event) | Limited | Legacy real-time pipelines |
| Spark Streaming | Micro-batch on Spark | Near-real-time | RDD/DataFrame state | Teams already using Spark |
**Windowing Strategies: Tumbling, Sliding, Session**
Stream processing often needs to aggregate events over time periods. "Count the number of page views in the last 5 minutes." "Compute average transaction value per hour." "Detect when a user has been inactive for 30 minutes."
Windows define how events are grouped into finite chunks for aggregation.
Tumbling windows divide time into fixed, non-overlapping intervals. A 5-minute tumbling window creates buckets: 10:00-10:05, 10:05-10:10, 10:10-10:15. Each event falls into exactly one bucket.
At the end of each window, the aggregation is emitted (total page views from 10:00 to 10:05 \= 1,247). Tumbling windows are the simplest and most common windowing strategy.
Sliding windows overlap. A 5-minute sliding window that slides every 1 minute creates buckets: 10:00-10:05, 10:01-10:06, 10:02-10:07.
A single event can appear in up to 5 buckets.
Sliding windows give a smoother view of trends because each window overlaps with its neighbors. They are useful for moving averages and detecting gradual changes.
Session windows are defined by inactivity gaps rather than fixed time.
A session window with a 30-minute gap groups all events from a user that are separated by less than 30 minutes of inactivity.
If a user browses your site from 10:00 to 10:20, leaves, and returns at 11:15, that creates two session windows: one from 10:00 to 10:20 and another starting at 11:15. Session windows are essential for user behavior analysis, billing systems, and activity tracking.
| Window Type | Overlap? | Defined By | Use Case |
|---|---|---|---|
| Tumbling | No | Fixed time intervals | Hourly counts, daily aggregates |
| Sliding | Yes | Interval \+ slide period | Moving averages, trend detection |
| Session | No | Inactivity gap | User sessions, activity tracking |
_Windowing Strategy_
### Lambda Architecture vs. Kappa Architecture
As organizations built both batch and stream processing systems, two architectural patterns emerged for combining them.
Lambda architecture runs two parallel processing paths. The batch layer processes the complete historical dataset on a schedule (every few hours or daily) and produces accurate, comprehensive results.
The speed layer (stream processor) handles recent data in real time but with potentially less accuracy.
A serving layer merges results from both paths to provide a complete, up-to-date view.
Lambda architecture guarantees correctness (the batch layer reprocesses everything periodically, correcting any errors from the speed layer) and low latency (the speed layer provides immediate results).
The trade-off is complexity: you maintain two codebases doing similar work in different frameworks, and the merging logic adds another layer of code.
Kappa architecture simplifies Lambda by eliminating the batch layer entirely. All data is processed through a single stream processing pipeline.
If you need to reprocess historical data (to fix a bug or apply a new algorithm), you replay the data through the same streaming pipeline from the beginning of the event log (which Kafka retains).
Kappa architecture is simpler because there is one codebase and one processing path. It works well when your stream processing framework is powerful enough to handle both real-time and historical processing.
The prerequisite is that your event log (Kafka) retains enough history for reprocessing, and that your stream processor (Flink, Kafka Streams) can handle the throughput of replaying weeks or months of data.
| Architecture | Processing Paths | Complexity | Correctness | Best For |
|---|---|---|---|---|
| Lambda | Batch \+ Stream (two paths) | High (two codebases, merge logic) | High (batch corrects stream errors) | When batch and stream logic differ significantly |
| Kappa | Stream only (one path) | Lower (one codebase) | Depends on stream processor reliability | When stream processing handles all use cases |
Most new systems favor Kappa architecture because modern stream processors (Flink in particular) are reliable enough to handle both real-time and reprocessing workloads.
Lambda architecture remains relevant in organizations that have existing batch infrastructure and cannot easily migrate.
Interview-Style Question
> Q: Your platform needs to show real-time metrics (orders per minute, revenue per hour) on a dashboard and also produce accurate daily and monthly reports. Should you use Lambda or Kappa architecture?
> A: Start with Kappa. Use Flink or Kafka Streams to process the order event stream in real time, computing per-minute and per-hour aggregations and writing them to a time-series database for the dashboard. For daily and monthly reports, either aggregate the same stream data into a data warehouse as it flows through the pipeline, or replay the Kafka topic (with sufficient retention) through the same Flink job with daily and monthly windows. This single-path approach avoids maintaining separate batch and stream codebases. If you later discover that the daily reports need complex joins with data sources not available in the stream (like third-party marketing data), you can add a batch ETL pipeline for those specific reports without adopting full Lambda architecture for everything.
**KEY TAKEAWAYS**
* True real-time processing handles events individually in milliseconds (Flink, Kafka Streams). Near-real-time processes micro-batches in seconds (Spark Streaming). * Kafka Streams is a library for moderate stream processing. Flink is a distributed cluster for complex event processing. Storm is legacy.
* Tumbling windows are fixed and non-overlapping. Sliding windows overlap for smoother trends. Session windows group by user inactivity. * Lambda architecture runs parallel batch and stream paths for correctness and low latency but adds complexity. Kappa simplifies to a single stream path with replay for reprocessing. * Modern stream processors (Flink) are powerful enough to handle both real-time and reprocessing, making Kappa the preferred choice for most new systems.
**Analytics & Monitoring**
A system you cannot observe is a system you cannot fix.
Analytics and monitoring give you visibility into what your system is doing, how it is performing, and when something is going wrong.
This section covers the infrastructure that every production system needs, and that every system design interview expects you to consider.
**Analytics System Architecture**
An analytics system collects data about how your product is used and transforms it into insights. The architecture follows a standard pattern: collection, storage, processing, and visualization.
Collection happens through event tracking. Your application emits events for user actions: page views, button clicks, purchases, searches, errors. These events flow to a collection service (which might be a Kafka topic, a managed service like Segment or Amplitude, or a custom API) that buffers and routes them.
Storage depends on query patterns. Raw events go to a data lake (S3, GCS) for long-term retention and batch processing. Aggregated metrics go to a data warehouse (BigQuery, Snowflake) for SQL-based analysis. Real-time metrics go to a time-series database (InfluxDB, TimescaleDB) for dashboards.
Processing transforms raw events into meaningful metrics. Batch jobs compute daily and monthly aggregates. Stream processors compute real-time metrics. dbt models transform raw warehouse data into analytical tables.
Visualization presents the data through dashboards (Grafana, Tableau, Looker, Metabase), ad-hoc query tools, and automated reports.
**Metrics Collection and Aggregation**
Metrics are numerical measurements of your system's behavior, collected at regular intervals. CPU utilization at 72%.
Request latency at 45ms.
Error rate at 0.3%.
Queue depth at 1,247 messages.
Metrics collection follows one of two models.
Push-based collection means each service sends its metrics to a central collector at regular intervals. StatsD and Telegraf follow this model. The service is responsible for pushing.
Pull-based collection means a central system periodically scrapes metrics from each service's metrics endpoint. Prometheus follows this model. The collector is responsible for pulling.
| Model | How It Works | Pros | Cons |
|---|---|---|---|
| Push | Services send metrics to collector | Works behind firewalls, immediate delivery | Collector can be overwhelmed, harder to discover services |
| Pull | Collector scrapes service endpoints | Collector controls rate, easy to add targets | Requires network access to services, slight delay |
Prometheus has become the de facto standard for infrastructure and application metrics. It scrapes HTTP endpoints (typically `/metrics`) from your services, stores the data in its time-series database, and supports a powerful query language (PromQL) for aggregation and alerting.
Aggregation reduces the volume of metric data.
Instead of storing every individual latency measurement (millions per hour), you aggregate into percentiles: p50 (median), p95, p99, and p99.9.
These percentiles tell you what a typical user experiences (p50), what most users experience (p95), and what the worst-off users experience (p99). P99 is especially critical because it captures tail latency, the slow requests that affect a small but real percentage of users.
**Time-Series Data Storage and Querying**
Time-series data is any data indexed primarily by time. Metrics, stock prices, IoT sensor readings, and application performance data are all time-series. Time-series databases are optimized for this access pattern.
InfluxDB is a purpose-built time-series database with its own query language (Flux) and built-in retention policies (automatically delete data older than 30 days). It handles high write throughput for metrics ingestion and efficient time-range queries for dashboards.
TimescaleDB extends PostgreSQL with time-series optimizations (hypertables, automatic partitioning by time, continuous aggregates). The advantage is full SQL compatibility: your team does not need to learn a new query language, and you can join time-series data with relational data in the same database.
Prometheus stores its own metrics in a custom time-series format optimized for the pull-based scraping model. It is excellent for short-term metrics (typically 15 to 30 days of retention). For long-term storage, Prometheus data is often forwarded to Thanos, Cortex, or VictoriaMetrics, which provide durable, scalable storage.
Time-series databases share common optimizations: data is partitioned by time (recent data is in hot storage, older data is compressed and moved to cold storage), high cardinality dimensions (like unique user IDs) can be expensive and need careful management, and downsampling reduces storage costs by keeping fine-grained data for recent periods and coarser aggregates for older periods.
**Log Aggregation and Centralized Logging (ELK/EFK Stack)**
When you have 50 microservices running across hundreds of containers, debugging a problem by SSH-ing into individual machines and reading log files is impossible. Centralized logging collects logs from every service into a single searchable system.
The ELK stack (Elasticsearch, Logstash, Kibana) is the most widely deployed centralized logging solution.
Logstash (or its lighter alternative, Filebeat) collects log data from services, parses and transforms it (extracting timestamps, log levels, request IDs, error messages), and ships it to Elasticsearch.
Elasticsearch stores the logs and provides full-text search across them. You can search for all logs containing a specific error message, filter by service name, or correlate logs by request ID across multiple services.
Kibana is the visualization layer. It provides a search interface for exploring logs, dashboards for log-based metrics (error counts over time, log volume by service), and alerting on log patterns.
The EFK stack replaces Logstash with Fluentd (or its lighter version, Fluent Bit).
Fluentd is a unified logging layer that collects logs from many sources using a plugin architecture. It is the standard log collector in Kubernetes environments.
Practical log management tips: use structured logging (JSON format) instead of unstructured text.
Include a correlation ID (request ID) in every log entry so you can trace a request across all services it touches. Set log retention policies (keep detailed logs for 7 to 30 days, keep aggregated logs longer).
Use log levels appropriately: DEBUG for development, INFO for normal operations, WARN for recoverable issues, ERROR for failures that need attention.
**Distributed Tracing (Jaeger, Zipkin, OpenTelemetry)**
Logs tell you what happened on a single service.
Distributed tracing tells you what happened across all services for a single request.
When a user's request takes 3 seconds instead of 300 milliseconds, tracing shows you which service in the chain is responsible.
A trace represents the complete journey of a request through your system. It consists of spans, where each span represents a unit of work in one service.
A trace for an order placement might include spans for the API gateway (10ms), the order service (50ms), the payment service (200ms), the inventory service (30ms), and the database query within the inventory service (15ms).
The spans nest to show parent-child relationships and execute in sequence or parallel.
Jaeger (developed by Uber, now a CNCF project) is the most popular open-source distributed tracing system. It collects traces, stores them, and provides a UI for searching and visualizing request flows.
Zipkin (originally from Twitter) provides similar capabilities with a different architecture. Both Jaeger and Zipkin support the OpenTracing standard.
OpenTelemetry is the current industry standard for instrumenting applications. It provides a unified API and SDK for collecting traces, metrics, and logs. OpenTelemetry replaced the older OpenTracing and OpenCensus projects. Applications instrument their code with OpenTelemetry, and the data can be exported to any compatible backend: Jaeger, Zipkin, Datadog, New Relic, or others.
Implementing distributed tracing requires propagating a trace context (trace ID and span ID) through every service call.
When Service A calls Service B, it passes the trace context in an HTTP header. Service B creates a child span linked to A's span. This chaining connects all spans into a single trace.
Most modern frameworks and service meshes inject this context automatically.
**Alerting and On-Call Systems**
Monitoring data is useless unless it triggers action when something goes wrong. Alerting systems evaluate metric conditions and notify the right people when thresholds are breached.
Good alerts have four properties.
Actionable: the alert tells you something you can actually fix. "CPU at 95% on server-3" is actionable. "Disk usage increased by 0.1%" is noise.
Urgent: the alert fires only for conditions that need immediate attention. If it can wait until morning, it should not page someone at 3 AM.
Specific: the alert tells you where the problem is, not just that a problem exists. "Payment service p99 latency exceeded 2 seconds" is specific. "Something is slow" is not.
Low false-positive rate: if an alert fires regularly without a real problem, the on-call engineer learns to ignore it, and eventually ignores a real incident.
Alert routing sends notifications to the right channel based on severity.
Critical alerts (system down, data loss risk) page the on-call engineer via PagerDuty, Opsgenie, or VictoriaMetrics's alerting. Warning alerts go to a Slack channel. Informational alerts go to a dashboard.
On-call rotation ensures someone is always responsible for responding to alerts. Teams rotate the on-call responsibility weekly.
The on-call engineer has runbooks for common issues, escalation paths for unfamiliar problems, and the authority to make decisions during incidents.
**Dashboards and Observability Best Practices**
Observability is the broader discipline that encompasses metrics, logs, and traces.
A system is observable if you can understand its internal state by examining its external outputs. The three pillars of observability are metrics (what is happening numerically), logs (what happened in detail), and traces (how a request flowed through the system).
Dashboard design should follow a hierarchy.
The top-level dashboard shows the health of the entire system at a glance: overall request rate, error rate, and latency percentiles.
Drilling down, each service has its own dashboard showing its specific metrics: request throughput, error rate, CPU/memory utilization, database query latency, cache hit ratio, and queue depth.
Drilling further, individual dashboards for specific subsystems (database performance, cache performance, queue performance) provide fine-grained detail for debugging.
Grafana is the standard tool for building dashboards. It connects to Prometheus for metrics, Elasticsearch for logs, and Jaeger for traces, providing a unified interface for all three observability pillars.
RED method (Rate, Errors, Duration) is a practical framework for service-level monitoring. For every service, track: the rate of requests (throughput), the number of errors (failures), and the duration of requests (latency distribution). If all three are within normal ranges, the service is healthy.
USE method (Utilization, Saturation, Errors) applies to infrastructure resources. For every resource (CPU, memory, disk, network), track: utilization (how busy it is), saturation (how much work is queued waiting), and errors (hardware or software failures).
High utilization with low saturation means the resource is busy but coping.
High saturation means the resource is overloaded and requests are waiting.
**Beginner Mistake to Avoid**
New engineers often instrument their system with hundreds of metrics and dozens of dashboards, then never look at any of them.
A dashboard that nobody checks is the same as no dashboard.
Start with the RED metrics for your three most critical services and the USE metrics for your database and cache.
Set up alerts for those metrics.
Only add more metrics and dashboards when a specific debugging scenario reveals a gap in your visibility. Fewer, well-understood dashboards are infinitely more valuable than a wall of charts that nobody can interpret.
Interview-Style Question
> Q: A user reports that your application is intermittently slow. Sometimes requests complete in 100ms, but occasionally they take 5 seconds. How do you use your observability infrastructure to find the cause?
> A: Start with the metrics dashboard. Check p99 latency for the user-facing service. If p99 is high but p50 is normal, the problem affects a small percentage of requests. Check if the latency spikes correlate with specific times, specific endpoints, or specific downstream services. Next, use distributed tracing. Find traces for requests that took 5 seconds and compare them with traces for normal 100ms requests. The slow traces will show which span is consuming the extra time. Maybe the payment service span takes 4.8 seconds in slow requests but 20ms in normal ones. Now you know where the problem is. Drill into the payment service's logs, filtering by the trace IDs of slow requests. The logs might reveal a specific database query that occasionally runs without an index (a query plan change), an external API call that times out and retries, or garbage collection pauses. The combination of metrics (identifying the pattern), traces (isolating the component), and logs (finding the root cause) is how you systematically diagnose intermittent performance problems.
_Three Pillars of Observability_
For complete system design interview prep, check out resources like Grokking the System Design Interview.
**KEY TAKEAWAYS**
* Analytics systems follow a pipeline: collect events, store raw and aggregated data, process with batch or stream jobs, and visualize with dashboards.
* Prometheus is the standard for metrics collection (pull-based). Use percentiles (p50, p95, p99) instead of averages to understand latency distribution. * Time-series databases (InfluxDB, TimescaleDB) optimize for time-indexed data with high write throughput, time-range queries, and automatic downsampling. * Centralized logging (ELK/EFK) collects logs from all services into a searchable system. Use structured logging (JSON) and correlation IDs (request IDs) to trace requests across services. * Distributed tracing (Jaeger, OpenTelemetry) shows how a single request flows through multiple services. It is essential for diagnosing latency issues in microservices. * Good alerts are actionable, urgent, specific, and have low false-positive rates. Use the RED method (Rate, Errors, Duration) for services and the USE method (Utilization, Saturation, Errors) for infrastructure. * Start with fewer, well-understood dashboards and alerts for your most critical services. Expand coverage based on actual debugging needs, not theoretical completeness.