Data Lake Architecture: From Swamp to Lakehouse
A data lake starts as a flexible storage layer. It becomes a swamp when nobody can find anything, nothing is reliable, and the answer to every data question is "it's somewhere in S3." Here is how to build one that stays navigable.
The data lake concept is simple: store all your data in object storage (S3, GCS, Azure Blob) in open formats, and query it with whatever compute engine you need. The reality is that without discipline around organization, table formats, and metadata management, a data lake degrades into a graveyard of files where schema has been lost, updates are impossible, and queries scan the entire bucket to answer basic questions.
This guide covers the foundational patterns that keep data lakes functional: folder structure, open table formats, partitioning, catalogs, and the lakehouse architecture that has become the standard for modern data platforms.
Why Data Lakes Become Swamps
The swamp pattern is consistent: it starts with ungoverned ingestion. Teams dump data into S3 without consistent naming conventions, folder structures, or format requirements. CSV files coexist with Parquet. Some folders have dates in the path, others do not. Schema is undocumented and changes silently when source systems change.
The second phase is the query problem. Without a catalog, consumers have to know the exact path to every dataset. Without table format metadata, every query must scan the entire partition to determine what data is relevant. Performance is unpredictable and slow.
The third phase is the update problem. Plain Parquet files cannot be updated or deleted. When a source system corrects a historical record, the only option is to rewrite entire partitions. When GDPR deletion requests arrive, fulfilling them requires custom tooling on top of plain files.
Open table formats (Apache Iceberg, Delta Lake) solve all of these problems. They are the reason the lakehouse architecture has emerged as the dominant pattern.
Open Table Formats: Iceberg and Delta Lake
Apache Iceberg and Delta Lake add a metadata layer on top of Parquet files that enables ACID transactions, schema evolution, time travel, and efficient query planning. They turn a folder of files into a proper table with the reliability guarantees of a traditional database.
# Apache Iceberg with PySpark
from pyspark.sql import SparkSession
spark = SparkSession.builder .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") .config("spark.sql.catalog.glue", "org.apache.iceberg.spark.SparkCatalog") .config("spark.sql.catalog.glue.type", "glue") .config("spark.sql.catalog.glue.warehouse", "s3://my-data-lake/warehouse/") .getOrCreate()
# Create an Iceberg table
spark.sql("""
CREATE TABLE IF NOT EXISTS glue.analytics.fct_orders (
order_id STRING,
customer_id STRING,
amount DECIMAL(10,2),
order_date DATE,
status STRING
)
USING iceberg
PARTITIONED BY (days(order_date))
LOCATION 's3://my-data-lake/warehouse/analytics/fct_orders/'
""")
# ACID upsert — updates existing rows, inserts new ones
spark.sql("""
MERGE INTO glue.analytics.fct_orders t
USING updates u ON t.order_id = u.order_id
WHEN MATCHED THEN UPDATE SET t.status = u.status
WHEN NOT MATCHED THEN INSERT *
""")
# Time travel — query historical state
spark.sql("""
SELECT * FROM glue.analytics.fct_orders
TIMESTAMP AS OF '2026-03-01 00:00:00'
WHERE order_date = '2026-03-01'
""")Iceberg maintains a metadata tree: table metadata files point to manifest lists, which point to manifest files, which describe the actual Parquet data files. This enables partition pruning at planning time without scanning any data files, schema evolution without rewriting data, and point-in-time queries by selecting the appropriate snapshot.
Delta Lake (Databricks) provides similar capabilities with a transaction log approach. The choice between Iceberg and Delta Lake is primarily a function of which compute engines you use: Iceberg has broader cross-engine support (Spark, Trino, Flink, DuckDB), while Delta Lake is tightly integrated with Databricks and has excellent Spark performance.
Folder Structure and Naming Conventions
Even with open table formats, folder structure matters for human navigability and for the non-table data that still lives as raw files.
s3://my-data-lake/
raw/ # Source data, immutable after ingestion
salesforce/
accounts/
year=2026/month=03/day=27/
part-0001.parquet
stripe/
charges/
year=2026/month=03/day=27/
warehouse/ # Iceberg/Delta managed tables
analytics/
fct_orders/ # Iceberg table files (managed by engine)
dim_customers/
staging/
stg_salesforce__accounts/
scratch/ # Temporary processing, auto-deleted after 30d
user_rkirsch/
archive/ # Data beyond retention windowThe raw zone uses Hive-style partitioning (year=/month=/day=) for compatibility with the widest range of query engines. The warehouse zone is managed by the Iceberg catalog and should not be manipulated directly. Naming conventions with double underscores for source-scoped tables match dbt conventions and make lineage clearer.
Data Catalog: Making the Lake Navigable
A data catalog registers table schemas, ownership, documentation, and lineage in a searchable interface. Without a catalog, the lake is only navigable by the people who put data into it.
AWS Glue Data Catalog is the native AWS option: it integrates directly with Athena, EMR, and Glue ETL jobs. It stores table schemas and partition metadata, enabling efficient query planning without scanning entire prefixes.
Apache Hive Metastore is the traditional option for Spark-based data lakes. Iceberg and Delta Lake both support Hive Metastore as a catalog backend.
For data discovery and documentation beyond raw schema metadata, tools like DataHub, OpenMetadata, or dbt docs provide richer search, lineage visualization, and business context. These run alongside the technical catalog (Glue/Hive) rather than replacing it.
The Lakehouse: Marrying Lake and Warehouse
The lakehouse architecture uses open table formats on object storage as the storage foundation, with one or more compute engines providing SQL analytics, ML workloads, and streaming processing from the same data.
# One dataset, multiple consumers s3://data-lake/warehouse/analytics/fct_orders/ ↓ Iceberg metadata layer Trino → SQL analytics, ad-hoc queries Spark → Batch ETL, ML feature engineering Flink → Streaming reads/writes DuckDB → Local development, small queries Snowflake → External table (read via Iceberg REST catalog) Athena → Serverless SQL for occasional large scans
The key properties of a well-functioning lakehouse: all tables use Iceberg or Delta, a central catalog provides discovery and schema management, compute engines are chosen based on workload characteristics (streaming vs. batch vs. ad-hoc), and the raw zone is kept immutable for auditability and reprocessing.
The lakehouse has largely replaced both the pure data warehouse (expensive, vendor-specific) and the pure data lake (cheap but unusable) as the architectural target for new data platform builds. The combination of open formats, flexible compute, and reasonable operational overhead makes it the right choice for most organizations building a new platform today.
Found this useful? Share it: