Back to Documentation
Data & Analytics

Data Lake Architecture Pattern

Build scalable data lakes on AWS using S3, Glue, Lake Formation, and Athena for advanced analytics.

18 min read
Updated Dec 15, 2025
Data LakeS3AnalyticsGlue

Data Lake Architecture Pattern

Overview

A Data Lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store data as-is, without having to first structure it, and run different types of analytics.

Architecture

graph TB subgraph "Data Sources" S1[IoT Devices] S2[Applications] S3[Databases] S4[APIs] S5[Files] end subgraph "Ingestion Layer" K[Kinesis Firehose] DMS[Database Migration Service] TF[Transfer Family] DS[DataSync] end subgraph "Storage Layer - Data Lake" S3Raw[S3 Raw Data Zone] S3Clean[S3 Cleaned Data Zone] S3Curated[S3 Curated Data Zone] end subgraph "Processing Layer" Glue[AWS Glue ETL] EMR[EMR Cluster] Lambda[Lambda Functions] end subgraph "Catalog & Governance" GC[Glue Catalog] LF[Lake Formation] end subgraph "Analytics & ML" A[Athena] RS[Redshift Spectrum] SM[SageMaker] QS[QuickSight] end S1 --> K S2 --> K S3 --> DMS S4 --> Lambda S5 --> TF K --> S3Raw DMS --> S3Raw TF --> S3Raw DS --> S3Raw Lambda --> S3Raw S3Raw --> Glue Glue --> S3Clean S3Clean --> EMR EMR --> S3Curated S3Raw --> GC S3Clean --> GC S3Curated --> GC GC --> LF LF --> A LF --> RS LF --> SM A --> QS RS --> QS style S3Raw fill:#f9f style S3Clean fill:#9f9 style S3Curated fill:#99f

Data Lake Zones

Raw Zone (Bronze)

The landing zone for all incoming data in its original format.

Characteristics:

  • Data stored as-is
  • No transformations applied
  • Historical archive
  • Immutable storage

AWS Services:

  • S3 with appropriate lifecycle policies
  • Versioning enabled
  • Server-side encryption

Cleaned Zone (Silver)

Data that has been validated, cleansed, and standardized.

Characteristics:

  • Schema validation
  • Data quality checks
  • Deduplication
  • Format standardization (Parquet, ORC)

Processing:

1# Example AWS Glue job for data cleaning 2import sys 3from awsglue.transforms import * 4from awsglue.utils import getResolvedOptions 5from pyspark.context import SparkContext 6from awsglue.context import GlueContext 7from awsglue.job import Job 8 9args = getResolvedOptions(sys.argv, ['JOB_NAME']) 10sc = SparkContext() 11glueContext = GlueContext(sc) 12spark = glueContext.spark_session 13job = Job(glueContext) 14job.init(args['JOB_NAME'], args) 15 16# Read from raw zone 17raw_data = glueContext.create_dynamic_frame.from_catalog( 18 database="raw_db", 19 table_name="raw_table" 20) 21 22# Apply transformations 23cleaned_data = raw_data.filter( 24 f=lambda x: x["status"] == "active" 25).drop_fields(['temp_field']) 26 27# Write to cleaned zone 28glueContext.write_dynamic_frame.from_options( 29 frame=cleaned_data, 30 connection_type="s3", 31 connection_options={ 32 "path": "s3://my-bucket/cleaned/", 33 "partitionKeys": ["year", "month", "day"] 34 }, 35 format="parquet" 36) 37 38job.commit()

Curated Zone (Gold)

Business-ready data optimized for analytics and reporting.

Characteristics:

  • Aggregated data
  • Business logic applied
  • Optimized for query performance
  • Partitioned and indexed

Key Components

AWS Lake Formation

Lake Formation simplifies data lake setup and management.

Features:

FeatureDescriptionBenefit
Data IngestionAutomated data loadingReduced complexity
SecurityFine-grained access controlEnhanced security
CatalogingAutomated metadata managementImproved discoverability
Data GovernanceCentralized policiesCompliance

AWS Glue

Serverless ETL service for data preparation.

1// Example: Glue job configuration with CDK 2import * as glue from 'aws-cdk-lib/aws-glue'; 3import * as s3 from 'aws-cdk-lib/aws-s3'; 4import * as iam from 'aws-cdk-lib/aws-iam'; 5 6const glueJob = new glue.CfnJob(this, 'DataCleaningJob', { 7 name: 'data-cleaning-job', 8 role: glueRole.roleArn, 9 command: { 10 name: 'glueetl', 11 pythonVersion: '3', 12 scriptLocation: 's3://my-bucket/scripts/clean_data.py' 13 }, 14 defaultArguments: { 15 '--job-language': 'python', 16 '--job-bookmark-option': 'job-bookmark-enable', 17 '--enable-metrics': 'true', 18 '--enable-continuous-cloudwatch-log': 'true' 19 }, 20 glueVersion: '4.0', 21 maxRetries: 1, 22 timeout: 60, 23 workerType: 'G.1X', 24 numberOfWorkers: 10 25});

Data Ingestion Patterns

Batch Ingestion

sequenceDiagram participant Source participant S3 participant Glue participant Catalog participant Athena Source->>S3: Upload batch files S3->>Glue: Trigger crawler Glue->>Catalog: Update schema Catalog-->>Athena: Schema available Athena->>S3: Query data

Real-time Ingestion

sequenceDiagram participant IoT participant Kinesis participant Firehose participant S3 participant Lambda IoT->>Kinesis: Stream events Kinesis->>Firehose: Deliver stream Firehose->>S3: Write batches S3->>Lambda: Trigger processing Lambda->>S3: Write processed data

Query Patterns

Amazon Athena

1-- Example: Analyzing user behavior 2SELECT 3 user_id, 4 COUNT(*) as event_count, 5 COUNT(DISTINCT session_id) as sessions, 6 DATE_TRUNC('day', event_timestamp) as event_date 7FROM 8 curated.user_events 9WHERE 10 event_date >= DATE '2025-01-01' 11 AND event_type = 'page_view' 12GROUP BY 13 user_id, 14 DATE_TRUNC('day', event_timestamp) 15HAVING 16 COUNT(*) > 10 17ORDER BY 18 event_count DESC 19LIMIT 100;

Redshift Spectrum

1-- Example: Joining data lake with warehouse 2SELECT 3 dw.customer_id, 4 dw.customer_name, 5 COUNT(DISTINCT dl.order_id) as total_orders, 6 SUM(dl.order_amount) as total_revenue 7FROM 8 warehouse.customers dw 9INNER JOIN 10 spectrum.orders dl ON dw.customer_id = dl.customer_id 11WHERE 12 dl.order_date >= '2025-01-01' 13GROUP BY 14 dw.customer_id, 15 dw.customer_name 16ORDER BY 17 total_revenue DESC;

Security Best Practices

Access Control Layers

graph TD A[User Request] --> B{IAM Authentication} B -->|Authenticated| C{Lake Formation Permissions} B -->|Failed| Z[Deny Access] C -->|Authorized| D{S3 Bucket Policy} C -->|Denied| Z D -->|Allowed| E{Resource Policy} D -->|Denied| Z E -->|Granted| F[Access Data] E -->|Denied| Z style B fill:#ffe0b2 style C fill:#fff3e0 style D fill:#f3e5f5 style E fill:#e1bee7 style F fill:#c5e1a5 style Z fill:#ffcdd2

Encryption Strategy

  1. At Rest

    • S3 bucket encryption (SSE-S3, SSE-KMS)
    • RDS encryption
    • EBS encryption
  2. In Transit

    • TLS/SSL for all connections
    • VPC endpoints for private connectivity
    • Certificate management with ACM
  3. Application Level

    • Field-level encryption for sensitive data
    • Tokenization for PII
    • Key rotation policies

Data Governance

Data Quality Framework

1# AWS Glue Data Quality example 2import boto3 3from awsglue.context import GlueContext 4 5def validate_data_quality(df, rules): 6 """ 7 Validate data against quality rules 8 """ 9 quality_checks = { 10 'completeness': check_completeness(df, rules['required_fields']), 11 'accuracy': check_accuracy(df, rules['validation_rules']), 12 'consistency': check_consistency(df, rules['consistency_rules']), 13 'timeliness': check_timeliness(df, rules['freshness_threshold']) 14 } 15 16 return quality_checks 17 18def check_completeness(df, required_fields): 19 """Check for null values in required fields""" 20 results = {} 21 for field in required_fields: 22 null_count = df.filter(df[field].isNull()).count() 23 total_count = df.count() 24 completeness_rate = 1 - (null_count / total_count) 25 results[field] = { 26 'completeness': completeness_rate, 27 'passed': completeness_rate >= 0.95 28 } 29 return results

Cost Optimization

Storage Tiering

ZoneStorage ClassRetentionCost/GB/Month
RawS3 Standard30 days$0.023
Raw ArchiveS3 Glacier7 years$0.004
CleanedS3 Standard-IA90 days$0.0125
CuratedS3 StandardIndefinite$0.023

Processing Optimization

  • Partitioning: Reduce data scanned
  • File Format: Use columnar formats (Parquet, ORC)
  • Compression: Enable compression (Snappy, Gzip)
  • Glue Jobs: Right-size worker types and counts
  • Athena: Use result reuse and query optimization

Monitoring and Observability

Key Metrics

graph LR A[CloudWatch Metrics] --> B[Data Ingestion Rate] A --> C[Query Performance] A --> D[ETL Job Duration] A --> E[Storage Usage] A --> F[Data Quality Score] B --> G[CloudWatch Dashboard] C --> G D --> G E --> G F --> G G --> H[SNS Alerts] G --> I[Lambda Actions] style G fill:#e3f2fd style H fill:#ffcdd2 style I fill:#c5e1a5

Use Cases

1. Customer 360 Analytics

Combine data from multiple sources to create a unified customer view.

2. IoT Data Processing

Collect, store, and analyze IoT sensor data at scale.

3. Log Analytics

Centralize and analyze application and infrastructure logs.

4. Machine Learning

Train ML models on large datasets with SageMaker.

5. Data Archival

Long-term storage of historical data for compliance.

Implementation Checklist

  • Define data lake zones and policies
  • Set up S3 buckets with encryption
  • Configure Lake Formation permissions
  • Create Glue crawlers and ETL jobs
  • Set up data quality checks
  • Configure monitoring and alerting
  • Implement data lifecycle policies
  • Document data catalog
  • Train users on query tools
  • Establish data governance processes

Related Patterns

Resources

Back to all documentation
Last updated: Dec 15, 2025