Data Lake Architecture
A data lake is a centralized repository that stores all of an organization's data in its raw, unprocessed form.
Structured data from databases, semi-structured data from JSON logs and APIs, and unstructured data like images, videos, and documents all live in the same place.
The data is stored as-is, with no schema imposed at write time.
Structure is applied later, when the data is read and analyzed.
How Data Lakes Work
Data flows into the lake from multiple sources: application databases, event streams, third-party APIs, IoT devices, log files, and file uploads. It is stored in a cheap, scalable object store (AWS S3, Google Cloud Storage, Azure Data Lake Storage).
The raw data is organized by source, date, and type using a folder-like key structure: s3://datalake/raw/orders/2026/04/03/orders_20260403.parquet.
Data in the lake is typically organized into zones that represent stages of processing.
The raw zone (also called the bronze layer) contains data exactly as it arrived from the source.
Nothing is modified, filtered, or transformed. This zone serves as the immutable historical record.
The cleaned zone (silver layer) contains data that has been validated, deduplicated, standardized, and merged.
Schema is applied, data types are corrected, and obvious errors are removed.
The curated zone (gold layer) contains data that has been transformed, aggregated, and organized for specific analytical use cases. This is what analysts and data scientists query directly.
Why Data Lakes Exist
Before data lakes, organizations loaded only the data they knew they needed into a data warehouse, with a predefined schema.
If a new analysis required data that was not in the warehouse, someone had to go back to the source systems, build a new ETL pipeline, and load the data. This process took weeks.
Data lakes flip the approach. Store everything first, figure out what you need later.
When a data scientist wants to analyze user behavior alongside weather data and marketing campaign data, all three datasets are already in the lake. They apply schema and transformations at query time.
Data Lake Challenges
The biggest risk with data lakes is that they become data swamps: a dumping ground of disorganized, undocumented, untrusted data that nobody can use effectively.
Without governance (metadata catalogs, data quality checks, access controls, documentation), a data lake degrades from a valuable asset to an expensive storage bill.
Data lakes also lack the performance optimizations of databases.
Querying raw files in S3 is slower than querying a columnar database. This is why the lakehouse pattern (covered below) emerged, adding database-like performance to lake storage.
Data Warehouse vs. Data Lake vs. Data Lakehouse
These three architectures represent different philosophies for managing analytical data.
Data Warehouse
A data warehouse stores structured, processed data optimized for analytical queries.
Data is cleaned, transformed, and loaded into predefined schemas (star schema, snowflake schema) before it enters the warehouse.
The schema is defined upfront (schema-on-write). This makes queries fast and reliable because the data is already organized for analysis.
Data warehouses (Snowflake, BigQuery, Redshift) use columnar storage and query optimization techniques that make aggregation queries (SUM, AVG, COUNT, GROUP BY) extremely fast. They are the right choice when your analytical workloads involve structured data with known query patterns.
The limitation is rigidity.
Adding a new data source or changing the schema requires modifying the ETL pipeline and potentially rebuilding tables.
Unstructured data (images, videos, raw text) does not fit into warehouse schemas.
Data Lake
A data lake stores raw data in any format without predefined schema (schema-on-read). It handles structured, semi-structured, and unstructured data. It is cheap (object storage costs a fraction of warehouse storage) and flexible (no schema changes needed to add new data).
The limitation is query performance and usability.
Querying raw files is slower than querying a warehouse.
Without schema enforcement, data quality can degrade.
Analysts accustomed to SQL and structured tables find raw lake data harder to work with.
Data Lakehouse
A data lakehouse combines the flexibility of a data lake with the performance and management features of a data warehouse.
Data is stored in open file formats (Parquet, ORC) on cheap object storage (like a data lake), but a metadata and transaction layer adds warehouse-like capabilities: ACID transactions, schema enforcement, time travel (querying data as of a past point in time), and indexing.
Delta Lake (by Databricks), Apache Iceberg (used by Netflix, Apple, and others), and Apache Hudi are the three major lakehouse table formats. They sit on top of object storage and provide the metadata layer that turns a data lake into a lakehouse.
The lakehouse pattern is increasingly the default for new data architectures because it eliminates the need to maintain both a lake and a warehouse.
You store data once (in the lake), apply lakehouse metadata for governance and performance, and query it with warehouse-level speed.
| Aspect | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Data format | Structured only | Any format | Any format (with schema enforcement option) |
| Schema | Schema-on-write (predefined) | Schema-on-read (flexible) | Schema-on-write or read (configurable) |
| Storage cost | Higher (proprietary formats) | Lowest (object storage) | Low (object storage) |
| Query performance | Fastest (optimized engine) | Slowest (raw file scan) | Fast (indexing, caching, optimization) |
| ACID transactions | Yes | No | Yes |
| Best for | BI dashboards, SQL analytics | ML training data, raw data archival | Unified analytics and ML on one platform |
Interview-Style Question
Q: Your company currently has a data warehouse for business analytics and a separate data lake for the data science team. Maintaining both is expensive and creates data inconsistency. How would you consolidate?
A: Migrate to a lakehouse architecture. Store all data in a single object store (S3) using an open table format like Delta Lake or Apache Iceberg. The business analytics team queries the curated (gold) layer using SQL through a query engine like Spark SQL, Trino, or the Databricks SQL warehouse, getting performance comparable to their current warehouse. The data science team accesses the raw and cleaned layers directly for ML training, using Spark or pandas, without needing a separate copy of the data. ACID transactions from the lakehouse format ensure data consistency. Schema enforcement on the gold layer protects analysts from data quality issues. Time travel enables debugging and auditing. The storage cost drops because data lives on S3 instead of proprietary warehouse storage. The engineering overhead drops because you maintain one data platform instead of two.
Data Warehouse vs. Data Lake vs. Data Lakehouse
Data Pipeline Orchestration: Airflow, Dagster, Prefect
Data pipelines are not single scripts. They are complex workflows with dependencies, schedules, error handling, and retry logic.
A daily revenue report might depend on: extracting order data from PostgreSQL, extracting payment data from Stripe's API, extracting refund data from a separate service, joining all three datasets, computing aggregates, loading results into the warehouse, and sending an email summary.
If step 3 fails, steps 4-6 should not run.
If step 5 fails, you should not re-run steps 1-4 on retry.
Orchestration tools manage this complexity.
Apache Airflow
Airflow (originally built at Airbnb, now an Apache project) is the most widely deployed pipeline orchestrator.
You define pipelines as DAGs (Directed Acyclic Graphs) in Python. Each node in the DAG is a task (a Python function, a SQL query, a Spark job, a call to an external API).
Edges define dependencies.
Airflow provides scheduling (run this DAG every day at midnight), dependency management (task B runs only after task A succeeds), retry logic (retry task C three times with 5-minute intervals), alerting (notify Slack if any task fails), backfilling (re-run the pipeline for a historical date range), and a web UI for monitoring DAG status, task logs, and execution history.
Airflow's weakness is its execution model.
The Airflow scheduler is a single process that can become a bottleneck with thousands of DAGs.
Dynamic DAGs (where the structure changes at runtime) are possible but cumbersome.
And Airflow's programming model mixes pipeline definition with execution logic, making testing harder.
Dagster
Dagster (by Elementl) is a newer orchestrator designed to address Airflow's shortcomings. Its core concept is the software-defined asset: instead of defining tasks that run in order, you define the data assets your pipeline produces and the dependencies between them.
A Dagster pipeline for the revenue report would define assets: raw_orders, raw_payments, raw_refunds, joined_data, revenue_aggregates, and report_email. Dagster infers the execution order from the asset dependencies.
It provides built-in data quality checks (validating that raw_orders has the expected schema before downstream assets run), a rich development experience (local testing, type checking, IDE support), and a modern UI with asset lineage visualization.
Dagster is gaining adoption for new projects, especially teams that value software engineering practices (testing, typing, modularity) in their data pipelines.
Prefect
Prefect positions itself as a simpler alternative to Airflow. It uses Python decorators to define tasks and flows, handles scheduling, retries, and monitoring through a managed cloud service (Prefect Cloud) or a self-hosted server.
Prefect's strengths are its minimal boilerplate (a flow is a decorated Python function), its handling of dynamic workflows (loops, conditionals within flows), and its hybrid execution model (the orchestrator runs in the cloud but tasks execute in your infrastructure).
| Orchestrator | Architecture | Strengths | Best For |
|---|---|---|---|
| Airflow | Self-hosted or managed (MWAA, Astronomer) | Largest community, most integrations, battle-tested | Large, established data teams |
| Dagster | Self-hosted or managed (Dagster Cloud) | Asset-based model, data quality, developer experience | Teams valuing software engineering practices |
| Prefect | Hybrid (cloud orchestrator, local execution) | Simplicity, dynamic workflows, minimal boilerplate | Smaller teams, Python-first workflows |
Change Data Capture (CDC)
Change Data Capture captures every insert, update, and delete that happens in a database and publishes those changes as events.
CDC turns your database into an event stream without modifying your application code.
Why CDC Matters
Many system design patterns require knowing what changed in a database.
Keeping a search index in sync with the database (when a product's price changes, update Elasticsearch).
Populating a data lake with fresh data from production databases.
Implementing the outbox pattern by capturing outbox table inserts from the transaction log.
Replicating data between databases or between a database and a cache.
Without CDC, you either poll the database periodically (which is inefficient, adds load, and misses changes between polls) or modify your application to publish events alongside database writes (which couples your application to the event infrastructure and risks inconsistency if the write succeeds but the event publish fails).
How CDC Works
The most reliable CDC approach reads the database's transaction log (also called the write-ahead log, WAL, or binlog).
Every change the database commits is recorded in this log.
A CDC connector reads the log, extracts the changes, and publishes them to a message broker (typically Kafka).
Debezium is the most widely used open-source CDC platform. It provides connectors for PostgreSQL (reading the WAL), MySQL (reading the binlog), MongoDB (reading the oplog), SQL Server, Oracle, and others.
Debezium runs as a Kafka Connect connector, publishing change events to Kafka topics with one topic per database table.
A Debezium change event for a row update includes the table name, the primary key, the before values (the row before the change), and the after values (the row after the change).
This enables consumers to understand exactly what changed and react accordingly.
CDC Use Cases
Search index synchronization: A Kafka consumer reads CDC events from the products table and updates the Elasticsearch index in near-real-time. When a product's price changes in PostgreSQL, the change appears in search results within seconds.
Data lake ingestion: A Kafka consumer reads CDC events from all production tables and writes them to the data lake. The lake always has a near-real-time copy of the production data without any load on the production database (CDC reads the log, not the live tables).
Cache invalidation: A consumer reads CDC events and invalidates or updates the corresponding cache entries in Redis. This provides reliable cache invalidation without modifying application code.
Cross-service data synchronization: In a microservices architecture with database-per-service, CDC lets one service's data changes propagate to other services through events without introducing direct database coupling.
Data Quality and Data Governance
As data systems grow, the quality and trustworthiness of data become critical concerns.
A dashboard showing incorrect revenue numbers because of a data quality issue can lead to bad business decisions.
A data pipeline that exposes personally identifiable information to unauthorized users violates regulations.
Data Quality
Data quality measures whether data is accurate, complete, consistent, timely, and valid.
Validation checks run at every stage of the pipeline. Schema validation ensures data matches the expected structure (correct column names, correct data types). Range validation ensures values fall within expected bounds (order amounts should be positive, dates should not be in the future). Completeness validation ensures required fields are not null. Referential validation ensures foreign key relationships hold (every order references a valid customer).
Data quality frameworks like Great Expectations (open source), Soda (open source and managed), and dbt tests (built into the dbt transformation tool) define quality expectations as code. You write assertions like "the orders table should have no null customer_id values" and "the daily revenue should not deviate by more than 20% from the 7-day average." These assertions run automatically in the pipeline and halt processing if they fail.
Data observability platforms like Monte Carlo, Bigeye, and Anomalo continuously monitor data quality across your entire pipeline. They detect anomalies automatically (a sudden drop in row count, an unexpected schema change, a spike in null values) without requiring you to define every check manually.
Data Governance
Data governance establishes policies and processes for managing data across the organization: who can access what data, how data is classified, how long data is retained, and how data lineage is tracked.
Data catalogs (AWS Glue Catalog, Apache Atlas, DataHub, Amundation) provide a searchable inventory of all datasets in the organization. Each dataset is documented with its schema, owner, description, classification (PII, confidential, public), and lineage (which pipeline produced it, which datasets it depends on).
Access control ensures that sensitive data is accessible only to authorized users and services. Column-level security (hiding salary columns from non-HR users) and row-level security (showing each team only their own department's data) are common patterns. Lakehouse formats and data warehouse platforms support fine-grained access control natively.
Data lineage tracks the complete journey of data from source to consumption. When an analyst questions a number in a dashboard, lineage shows which table it came from, which transformation produced it, which source system the raw data originated from, and when each step last ran. This is essential for debugging data issues and for compliance audits ("show me where this customer's data was used").
Multi-Tenancy in Data Systems
Multi-tenancy means a single system serves multiple customers (tenants), each with their own data that must be isolated from other tenants. A SaaS platform where Company A and Company B both use the same application but must never see each other's data is a multi-tenant system.
Isolation Models
Shared tables with tenant ID. All tenants' data lives in the same tables. Every table has a tenant_id column, and every query includes WHERE tenant_id = ?. This is the simplest and cheapest model: one database, one set of tables, one application deployment. The risk is accidental data leakage if a query forgets the tenant filter. A missing WHERE clause exposes all tenants' data. Enforce tenant filtering at the ORM or middleware level so application developers cannot accidentally skip it.
Shared database, separate schemas. Each tenant gets its own schema (namespace) within the same database. Tenant A's data is in tenant_a.orders, Tenant B's in tenant_b.orders. This provides stronger isolation (queries only access one schema) and allows per-tenant schema customization. The cost is managing potentially thousands of schemas in one database.
Separate databases per tenant. Each tenant gets a completely independent database. This provides the strongest isolation, the easiest compliance with data residency requirements (Tenant A's database can be in EU-West while Tenant B's is in US-East), and independent scaling per tenant. The cost is operational complexity: managing thousands of database instances, running migrations across all of them, and routing each request to the correct database.
| Model | Isolation | Cost | Operational Complexity | Best For |
|---|---|---|---|---|
| Shared tables + tenant ID | Weakest (application-level) | Lowest | Lowest | Small SaaS, low data sensitivity |
| Shared DB, separate schemas | Moderate (schema-level) | Moderate | Moderate | Mid-size SaaS, moderate isolation needs |
| Separate databases | Strongest (database-level) | Highest | Highest | Enterprise SaaS, regulated industries, data residency |
Multi-Tenancy in Data Pipelines
Data pipelines in multi-tenant systems must maintain tenant isolation throughout the processing chain.
Raw data ingestion must tag every record with a tenant ID.
Transformations must never join data across tenants.
Output datasets must enforce tenant-level access control. Dashboards and reports must filter by tenant.
The lakehouse pattern supports multi-tenancy through row-level and column-level security in the metadata layer. Each tenant's users see only their own data when querying the same physical tables.
OLTP vs. OLAP Systems
The distinction between OLTP and OLAP defines two fundamentally different approaches to data storage and querying.
Understanding this distinction informs database selection, schema design, and infrastructure decisions throughout system design.
OLTP (Online Transaction Processing)
OLTP systems handle the day-to-day operational workload of an application: processing orders, updating user profiles, managing inventory, recording payments. Each operation touches a small number of rows (insert one order, update one user's email, decrement one product's stock).
OLTP characteristics include high concurrency (thousands of simultaneous users), low latency (each operation completes in milliseconds), row-oriented storage (data is stored and retrieved row by row), normalized schemas (3NF to minimize redundancy and protect data integrity), and ACID transactions (ensuring correctness under concurrent operations).
PostgreSQL, MySQL, SQL Server, and Oracle are classic OLTP databases.
DynamoDB, MongoDB, and Cassandra serve OLTP workloads in the NoSQL space.
OLAP (Online Analytical Processing)
OLAP systems handle analytical queries: aggregating, summarizing, and analyzing large volumes of data.
"Total revenue by product category for each region in 2025."
"Average order value per customer segment over the last 12 months."
Each query scans millions or billions of rows, aggregating them into a small result set.
OLAP characteristics include lower concurrency (dozens to hundreds of simultaneous analysts, not thousands of users), higher latency tolerance (queries take seconds to minutes, not milliseconds), columnar storage (data is stored column by column, making aggregation on specific columns extremely fast), denormalized schemas (star schema, wide tables to minimize joins and speed up scans), and eventual consistency (analytical data can be minutes or hours stale without business impact).
Snowflake, BigQuery, Redshift, ClickHouse, and DuckDB are OLAP databases.
They use columnar storage and query optimization techniques that make aggregation queries 10 to 100 times faster than the same queries on an OLTP database.
Why They Cannot Be the Same System
Running OLAP queries on an OLTP database degrades operational performance.
A heavy analytical query that scans 50 million rows competes for CPU, memory, and I/O with the thousands of short transactional queries that users depend on.
The analytical query slows down, and more critically, the transactional queries slow down too. Users experience latency spikes while the report runs.
Running OLTP workloads on an OLAP database is equally problematic.
OLAP databases are optimized for bulk reads, not for the high-concurrency, low-latency single-row operations that OLTP demands.
Inserting one row at a time into a columnar database is inefficient because the database needs to update each column's storage separately.
The standard pattern is to separate them: OLTP databases handle the operational workload, and data is replicated (through ETL, CDC, or event streaming) to an OLAP database for analytics.
| Aspect | OLTP | OLAP |
|---|---|---|
| Primary operations | Insert, update, delete (single rows) | Select with aggregation (millions of rows) |
| Query latency | Milliseconds | Seconds to minutes |
| Concurrency | Thousands of users | Dozens of analysts |
| Storage format | Row-oriented | Column-oriented |
| Schema | Normalized (3NF) | Denormalized (star schema) |
| Consistency | Strong (ACID) | Eventual (minutes to hours stale) |
| Examples | PostgreSQL, MySQL, DynamoDB | Snowflake, BigQuery, ClickHouse |
Beginner Mistake to Avoid
New engineers sometimes try to serve both operational and analytical workloads from a single database.
"PostgreSQL can handle analytics, just run the report query during off-peak hours."
This works until it does not.
The report query that took 30 seconds with 1 million rows takes 30 minutes with 100 million rows and locks tables that the application needs.
By the time you realize the database cannot serve both workloads, your analytics team is blocked and your users are experiencing degraded performance.
Separate OLTP and OLAP from the start if your data volume is growing and your analytics needs are non-trivial.
Interview-Style Question
Q: Your e-commerce platform uses PostgreSQL for all operations and analytics. With 500 million orders and growing, analytical queries are slowing down the production database. How do you fix this?
A: Separate the OLTP and OLAP workloads. Keep PostgreSQL as the OLTP database for operational traffic (order creation, user updates, inventory management). Set up a CDC pipeline (Debezium) that captures all changes from PostgreSQL's WAL and publishes them to Kafka. A consumer writes these changes to a lakehouse (Delta Lake on S3) or directly to an OLAP database (ClickHouse, Snowflake, or BigQuery). Analysts and dashboards query the OLAP layer instead of PostgreSQL. The CDC pipeline adds only seconds of latency, so the analytical data is near-real-time. For the most time-sensitive analytics (live order monitoring), a stream processing pipeline (Flink) computes real-time aggregates from the Kafka stream and writes to a dashboard-specific time-series database. The result: PostgreSQL handles only the transactional workload it is designed for, and analytical queries run against a system optimized for them, with zero impact on production performance.
KEY TAKEAWAYS
-
Data lakes store raw data in any format on cheap object storage. They are flexible but risk becoming data swamps without governance. Organize lakes into raw, cleaned, and curated zones.
-
Data warehouses store structured, optimized data for fast analytical queries. Data lakehouses combine lake flexibility with warehouse performance using open table formats (Delta Lake, Iceberg).
-
Pipeline orchestrators (Airflow, Dagster, Prefect) manage the scheduling, dependencies, retries, and monitoring of complex data workflows. Dagster's asset-based model is the modern approach.
-
CDC captures database changes from the transaction log and publishes them as events. Debezium with Kafka is the standard stack. CDC enables search sync, lake ingestion, cache invalidation, and cross-service data flow.
-
Data quality must be validated at every pipeline stage. Use frameworks (Great Expectations, dbt tests) for automated checks and observability platforms for anomaly detection.
-
Multi-tenancy isolation ranges from shared tables with tenant IDs (cheapest, weakest isolation) to separate databases per tenant (most expensive, strongest isolation). Choose based on your security requirements and customer expectations.
-
OLTP handles transactional operations (fast, concurrent, row-oriented). OLAP handles analytical queries (aggregation, columnar, millions of rows). Separate them early to prevent analytical workloads from degrading production performance.