Back to Documentation
Reference Guides

AWS Universal Architecture Guide

Comprehensive domain-agnostic framework for analyzing, decomposing, and implementing any cloud system on AWS with best practices.

45 min read
Updated Dec 15, 2025
ArchitectureFrameworkBest PracticesSystem Design

AWS Universal Architecture Design & Decision Framework

A Domain-Agnostic Reference Guide for Analyzing, Decomposing, and Implementing Any Cloud System on AWS


Executive Summary

This guide provides a universal, repeatable methodology for transforming any business requirement, technical problem, or system constraint into a correct, scalable, and cost-efficient AWS architecture. It is domain-agnostic and works for financial systems, consumer apps, data platforms, AI/ML systems, real-time workloads, batch processes, legacy migrations, and unknown system types.

The guide functions as:

  • A system design handbook for architects and engineers
  • An enterprise architecture playbook for organizations
  • A teaching and onboarding reference for cloud teams
  • A foundation for automating AWS architectural decisions

Section 1: Universal Requirement Decomposition Framework

Any architectural problem—regardless of domain—must be decomposed across 10 critical dimensions. This framework ensures no aspect of your system is overlooked.

1.1 Business Intent

Purpose: Understand the "why" before the "what."

Guiding Questions:

  • What is the core business value this system delivers?
  • What are the primary revenue drivers or cost reduction targets?
  • Who are the end users (internal employees, external customers, partners)?
  • What is the go-to-market timeline and business phase (MVP, scaling, mature)?
  • Are there regulatory, compliance, or governance mandates?
  • What are the competitive pressures or market differentiators?

Decision Heuristic: Business intent drives all architectural priorities. A fast market entry (MVP) justifies serverless and managed services over optimal scaling. A mission-critical financial system prioritizes reliability and auditability over speed-to-market.

Output: A single-paragraph "business thesis" that frames all downstream decisions.


1.2 User & System Actors

Purpose: Identify who interacts with the system and how.

Guiding Questions:

  • Who are the primary users? (internal ops, external customers, partners, machines)
  • How many concurrent users? (10, 10k, 1M+)
  • What is the geographic distribution? (single region, multi-region, global)
  • Are there bots, APIs, or automated integrators?
  • What are the access patterns? (browser, mobile, API, batch, event-driven)
  • Are there downstream systems that depend on this?

Decision Heuristic:

  • Large distributed user base → CDN, global load balancing, edge computing.
  • Batch/job consumers → Event-driven, Step Functions orchestration.
  • API integrators → API Gateway, rate limiting, schema validation.
  • Internal tools only → Simplified networking, reduced durability requirements.

Output: Actor matrix (type, count, geography, interaction pattern, SLA expectations).


1.3 Data Characteristics

Purpose: Classify the Volume, Velocity, Variety, and Sensitivity of data flowing through the system.

Volume

  • Gigabytes → RDS, DynamoDB, S3
  • Terabytes → Redshift, EMR, Athena on S3
  • Petabytes → Data lakes, distributed processing (Spark/Flink on EMR)

Velocity

  • Batch (hourly/daily) → Glue jobs, Lambda scheduled, batch processing
  • Near-real-time (seconds) → Kinesis, MSK, Lambda streaming
  • Real-time (milliseconds) → Kinesis with parallel consumers, DynamoDB Streams, SQS
  • Streaming (continuous) → Kinesis Data Analytics, Flink, Kafka

Variety

  • Structured (SQL schemas) → RDS, Aurora, DynamoDB
  • Semi-structured (JSON, Avro, Parquet) → S3 + Athena, Glue Data Catalog
  • Unstructured (images, video, logs) → S3, OpenSearch for text search
  • Mixed → Lake Formation for unified governance

Sensitivity (Data Classification)

  • Public → No encryption, standard S3 access
  • Internal → Standard encryption, IAM access control
  • Confidential → KMS encryption, VPC isolation, audit logging
  • PII/Regulated (HIPAA, PCI, GDPR) → Encryption, access auditing, data retention policies, DLP tools

Decision Heuristic:

  • High volume + high sensitivity → Use AWS Glue for data classification, Lake Formation for fine-grained access control.
  • High velocity + high variety → Kinesis or MSK for ingestion; Lambda + DynamoDB for processing.
  • Mixed sensitivity → Separate data layers by classification; encrypt all; audit all access.

Output: Data catalog (volume, velocity, variety, sensitivity, retention, lineage).


1.4 Workload Type

Purpose: Classify the synchronicity and scheduling of work.

Synchronous Workloads

  • User waits for response
  • Low latency required (milliseconds to seconds)
  • Examples: web requests, API calls, real-time queries
  • AWS Services: Lambda, API Gateway, ECS, RDS, DynamoDB, ElastiCache
  • Scaling: Auto-scale based on concurrent requests; cold starts matter

Asynchronous Workloads

  • User does not wait; work happens later
  • Moderate latency acceptable (seconds to hours)
  • Examples: email notifications, data processing, report generation
  • AWS Services: SQS, SNS, EventBridge, Step Functions, Glue, Batch, EMR
  • Scaling: Auto-scale workers; decouple producers from consumers

Batch Workloads

  • Large volumes of data processed at scheduled intervals
  • Latency can be hours or days
  • Examples: ETL, data analytics, ML training, backups
  • AWS Services: Glue, Batch, EMR, Lambda scheduled, Redshift, SageMaker
  • Scaling: Fixed or dynamic parallelization; cost-optimized

Streaming Workloads

  • Continuous or near-continuous data flow
  • Low latency per event (milliseconds to seconds)
  • Examples: sensor data, clickstreams, logs, financial ticks
  • AWS Services: Kinesis, MSK, Lambda, DynamoDB Streams, EventBridge
  • Scaling: Partition by shard; auto-scale shards

Decision Heuristic:

  • Synchronous + low latency → Lambda (cold start < 100ms) or ECS (predictable latency).
  • Synchronous + consistent load → ECS, EC2 (warm instances).
  • Asynchronous + bursty → SQS with Lambda, EventBridge rules.
  • Batch + scheduled → Glue, Batch, or Lambda scheduled.
  • Streaming + high throughput → Kinesis or MSK with parallel consumers.

Output: Workload classification (sync/async, latency target, throughput, scheduling).


1.5 Traffic & Scale Patterns

Purpose: Understand load profile, growth trajectory, and burst capacity.

Guiding Questions:

  • What is baseline traffic? (requests/sec, MB/sec)
  • What is peak traffic? (seasonal, event-driven, predictable or not)
  • What is growth rate? (stable, linear, exponential)
  • Are there time-zone or geographic spikes?
  • What is acceptable latency during peak load?
  • Can your system gracefully degrade, or must it handle all traffic?

Decision Heuristic:

  • Stable, predictable load → Reserved Instances, provisioned capacity (RDS, Redshift, Kinesis), Savings Plans.
  • Bursty, unpredictable load → Serverless (Lambda, Fargate), on-demand (DynamoDB), auto-scaling.
  • Rapid growth (0 → scale) → Serverless initially; migrate to provisioned as load stabilizes.
  • Time-zone dependent → Multi-region or scheduled scaling.
  • Graceful degradation required → Circuit breakers, queue shedding, read replicas for read-heavy workloads.

Output: Traffic profile (baseline, peak, growth, patterns, cost/performance targets).


1.6 Availability & Durability Targets

Purpose: Define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each component.

RTO = How long can the system be down before it impacts business?

  • Critical (< 1 hour) → Multi-AZ, multi-region failover, hot standby
  • High (1-4 hours) → Multi-AZ failover, automated recovery
  • Medium (4-24 hours) → Single AZ with automated backups, manual recovery
  • Low (> 24 hours) → Manual recovery acceptable; development/test environments

RPO = How much data loss is acceptable?

  • Critical (zero data loss) → Synchronous replication, read replicas, event sourcing
  • High (minutes) → Asynchronous replication, hourly backups
  • Medium (hours) → Daily snapshots
  • Low (days) → Weekly or monthly backups

Decision Heuristic:

  • RTO < 1 hour, RPO zero → Multi-AZ Aurora with read replicas, DynamoDB global tables, cross-region failover.
  • RTO 1-4 hours, RPO hours → Multi-AZ RDS with automated backups, S3 cross-region replication.
  • RTO > 1 day, RPO > 1 day → Single AZ with backups; manual recovery acceptable.

AWS Service Mapping:

  • High availability: Aurora, DynamoDB, Lambda, ALB/NLB
  • Disaster recovery: S3 cross-region replication, RDS backups, Backup service, Route 53 health checks
  • Durability: S3 (11 nines), EBS snapshots, Data Lifecycle Manager

Output: Availability matrix (component, RTO, RPO, strategy, cost implications).


1.7 Security & Compliance Needs

Purpose: Identify confidentiality, integrity, availability requirements and regulatory constraints.

Guiding Questions:

  • What data is handled? (PII, PHI, financial, trade secrets)
  • What are regulatory requirements? (HIPAA, PCI-DSS, GDPR, SOX, CCPA)
  • Must data stay in-region? (data residency, data sovereignty)
  • Who can access what data? (role-based, attribute-based)
  • Is encryption required? (in-transit, at-rest, key management)
  • Are audit logs and compliance reports required?
  • Are there penetration testing or security assessment requirements?

Decision Heuristic:

  • PII/GDPR → Encryption (KMS), access auditing, data classification, DLP (Macie), deletion policies
  • HIPAA/PHI → Encryption, audit logging, VPC isolation, Business Associates Agreement (BAA)
  • PCI-DSS → Network isolation, encryption, access logging, vulnerability scanning (Inspector)
  • Financial/SOX → Immutable audit logs (CloudTrail), change controls, segregation of duties
  • Data residency required → Single-region architecture, explicit region selection
  • High-security environment → VPC isolation, GuardDuty, WAF, Network Firewall

Output: Security posture (data classification, regulatory drivers, encryption, access control, audit requirements).


1.8 Cost Sensitivity

Purpose: Define cost constraints and optimization priorities.

Guiding Questions:

  • Is this a cost-driven or feature-driven project?
  • What is the acceptable cost range per month, per user, per transaction?
  • Are there cost allocation/showback requirements?
  • Can you commit to reserved capacity, or is on-demand necessary?
  • Is CapEx or OpEx preferred?
  • What cost-optimization tools (CloudWatch, Cost Explorer, Trusted Advisor) are available?

Decision Heuristic:

  • Startup/MVP → On-demand, serverless, minimize fixed costs.
  • Scaling (stable load) → Mix reserved + on-demand; commit 70% of baseline traffic.
  • Cost-sensitive operations → Reserved Instances (up to 75% off), Savings Plans (65% off, more flexible), spot instances (up to 90% off, interruptible workloads).
  • Variable workloads → Serverless (Lambda, Fargate, Athena), DynamoDB on-demand.
  • Long-term projects (3+ years) → 3-year reserved instances provide best ROI.

Output: Cost model (budget, cost per unit, commitment level, optimization targets).


1.9 Operational Complexity

Purpose: Assess team capability and operational overhead tolerance.

Guiding Questions:

  • What is the team's cloud maturity? (beginners, intermediate, advanced)
  • Do you have DevOps, SRE, or platform engineering capabilities?
  • Can the team manage Kubernetes, on-premises infrastructure, or complex integrations?
  • What is the tolerance for manual operational tasks?
  • Are there compliance/audit requirements for infrastructure change management?

Decision Heuristic:

  • Small team, low maturity → Serverless (Lambda, Fargate), managed services (RDS, OpenSearch), minimal custom code.
  • Intermediate maturity → ECS with CloudFormation, RDS Aurora, Glue data pipelines.
  • Advanced maturity + complex requirements → EKS, self-managed databases, custom orchestration.
  • Compliance-heavy environments → Infrastructure as Code (Terraform, CDK), automated deployments, audit trails.

Output: Operational profile (team size, skills, tolerance for complexity, tooling requirements).


1.10 Change Frequency & Extensibility

Purpose: Design for evolution and minimize rework.

Guiding Questions:

  • How often do requirements change? (weekly, monthly, quarterly)
  • Will the system need to integrate with new services or data sources?
  • Is the architecture greenfield (new) or brownfield (existing)?
  • How modular and decoupled should components be?
  • What is the acceptable technical debt?

Decision Heuristic:

  • High change frequency → Microservices, event-driven, loose coupling; use Step Functions, EventBridge for orchestration.
  • Stable requirements → Monolith is acceptable; focus on reliability and scaling.
  • Must integrate with many systems → Event-driven, API-first, API Gateway for versioning.
  • Greenfield + flexible → Start with serverless; migrate to provisioned services as load/patterns stabilize.
  • Brownfield + legacy → Strangler Fig pattern; migrate incrementally; keep old and new running in parallel.

Output: Architecture roadmap (current state, evolution path, integration points, refactoring windows).


Section 2: Workload Classification Engine

Every real-world problem maps to one or more of these generic workload archetypes. Understanding which archetype(s) your system belongs to unlocks the correct service selection.

2.1 Request/Response Systems (Synchronous)

Characteristics:

  • User or system waits for response
  • Latency-sensitive (< 1 second to minutes)
  • Throughput variable or bursty
  • State management required (sessions, user context)

Real-World Examples:

  • Web applications, APIs, mobile backends
  • Search engines, recommendation systems
  • Checkout flows, payment processing
  • Real-time dashboards, monitoring systems

Core AWS Services:

  • Compute: Lambda (cold start sensitive), ECS/Fargate (warm, predictable), EC2 (control, long-running)
  • Network: API Gateway (REST/WebSocket), ALB (internal routing), CloudFront (caching)
  • Database: RDS/Aurora (ACID, sessions), DynamoDB (key-value, sessions), ElastiCache (ephemeral cache)
  • Observability: CloudWatch, X-Ray (trace requests end-to-end)

Architecture Pattern:

Client → CloudFront (cache) → API Gateway → Lambda/ECS/EC2 → RDS/DynamoDB/Cache
         ↓                                                          ↓
         Static assets (S3)                                   Logs to CloudWatch

Key Design Decisions:

  1. Compute choice:

    • Lambda: Best for I/O-bound, variable load; worst for long-running or CPU-intensive work
    • ECS: Best for containerized, moderate load; good cold-start latency
    • EC2: Best for high CPU, long-running, or specialized requirements
  2. Caching strategy:

    • CloudFront: Cache static assets and cacheable API responses globally
    • ElastiCache: Cache database queries, sessions, expensive computations
  3. Database choice:

    • RDS/Aurora: Relational data, complex queries, ACID guarantees
    • DynamoDB: Key-value lookups, horizontal scaling, variable load
    • Hybrid: Both (RDS for transactions, DynamoDB for sessions/cache)
  4. Load balancing:

    • ALB (Layer 7): Route based on path, hostname, headers (microservices)
    • NLB (Layer 4): Route based on IP protocol, port; extreme throughput/low latency
    • API Gateway: REST/GraphQL APIs, rate limiting, schema validation
  5. Scaling and resilience:

    • Auto Scaling Groups (ECS, EC2) for gradual scaling
    • Lambda: Automatic, scales to 1000s of concurrent requests
    • Circuit breakers (Step Functions) to prevent cascading failures
    • Read replicas (RDS, DynamoDB) for read-heavy workloads

Well-Architected Alignment:

  • Operational Excellence: Observability (CloudWatch metrics, logs, X-Ray traces); automated deployments (CodeDeploy)
  • Security: IAM roles, VPC security groups, encryption in transit (HTTPS), KMS at-rest
  • Reliability: Multi-AZ deployment, auto-scaling, health checks, failover
  • Performance: Caching (CloudFront, ElastiCache), connection pooling, query optimization
  • Cost Optimization: Right-size instances, use Savings Plans, cache frequently accessed data
  • Sustainability: Use managed services to avoid idle servers; consolidate workloads

Cost Model:

  • Lambda: $0.0000002 per request + $0.0000166667 per GB-second (pay for actual use)
  • ECS Fargate: $0.04695 per vCPU-hour + $0.00519 per GB-hour (pay for provisioned capacity)
  • RDS: $0.17/hour (db.t3.micro) to $10+/hour (large instances) + storage + backups
  • Break-even: Lambda vs. ECS ≈ 10-100 requests/sec depending on execution time; ECS cheaper above this

2.2 Event-Driven Systems (Asynchronous, Decoupled)

Characteristics:

  • Components communicate via events, not direct calls
  • Loose coupling; easy to add consumers
  • No waiting for response
  • Throughput and latency can vary independently

Real-World Examples:

  • Order processing (order → payment → fulfillment → notification)
  • Data pipeline orchestration (ingest → transform → load → analyze)
  • Multi-tenant SaaS platforms (user action → event → multiple subscribers)
  • IoT systems (sensor → ingestion → storage → analytics → alerting)

Core AWS Services:

  • Event sources: S3, RDS, EventBridge, Kinesis, SNS, SQS, Lambda, DynamoDB Streams
  • Event routers: EventBridge (pub/sub with routing), SNS (fanout), SQS (queue), Kinesis (streaming)
  • Event processors: Lambda, Step Functions, ECS, Fargate, Batch
  • Event storage: DynamoDB, S3, EventBridge Archive

Architecture Pattern:

Source (S3, API, DDB) → EventBridge/SNS/SQS → Lambda/ECS/Batch → Target (DB, S3, Email)
                                          ↓
                              Multiple independent consumers
                                    (fanout)

Key Design Decisions:

  1. Event source:

    • EventBridge: Most flexible; supports 100+ AWS services and custom apps; best for rule-based routing
    • SNS: Simple pub/sub; less configuration; no replay; good for fanout
    • SQS: Queue-based; replay via DLQ; guarantees delivery; good for buffering
    • Kinesis: Streaming; ordered; replay; good for real-time analytics
    • DynamoDB Streams: Change data capture; triggers Lambda; good for change-driven workflows
  2. Fanout vs. queue:

    • Fanout (SNS, EventBridge): One event → multiple independent consumers; each gets a copy
    • Queue (SQS): One event → one consumer (processes and deletes); good for load leveling
  3. Error handling:

    • Idempotence: Events may be delivered multiple times; ensure consumers can handle duplicates
    • Dead-letter queues (DLQ): SQS/SNS DLQs capture failed messages for replay/inspection
    • Retry policies: Exponential backoff; max retries; circuit breaker pattern
  4. Orchestration vs. choreography:

    • Choreography: Each service subscribes to events and emits its own (decoupled, but implicit flows)
    • Orchestration: Central coordinator (Step Functions) directs workflow; explicit, auditable
  5. Consistency model:

    • Eventual consistency: Events propagate asynchronously; ordering may not be guaranteed
    • At-least-once delivery: Event may be delivered multiple times; idempotence required
    • Exactly-once semantics (hard): Use idempotent IDs + database writes with deduplication

Well-Architected Alignment:

  • Operational Excellence: Event tracing (CloudWatch logs, X-Ray), dead-letter queues for debugging, alerts
  • Security: IAM policies for publish/subscribe, encryption at-rest (SNS, SQS, EventBridge)
  • Reliability: Durable queues (SQS, EventBridge); retries; DLQs for failure handling; multi-AZ
  • Performance: EventBridge rules scale to millions/sec; Kinesis scales by partition; fanout is instant
  • Cost Optimization: Pay only for events processed; SQS cheaper than Kinesis for non-streaming; use Kinesis on-demand
  • Sustainability: Event-driven can consolidate workloads; avoid polling

Cost Model:

  • EventBridge: $0.35 per 1M events
  • SNS: $0.50 per 1M publishes
  • SQS: $0.40 per 1M requests (includes sends, receives, deletes)
  • Kinesis on-demand: $0.047 per GB ingested
  • Kinesis provisioned: $0.36 per shard-hour

2.3 Stream Processing (Continuous, Low-Latency Analytics)

Characteristics:

  • Data arrives continuously at high velocity
  • Process and react within seconds or milliseconds
  • Maintain windowed state (e.g., moving averages)
  • Throughput scales by partitioning

Real-World Examples:

  • Real-time fraud detection, financial trading
  • Clickstream analysis, A/B testing
  • Sensor monitoring, anomaly detection
  • Log processing, security analytics

Core AWS Services:

  • Ingestion: Kinesis Data Streams, MSK (Kafka), EventBridge
  • Processing: Kinesis Data Analytics (SQL), Lambda (event-by-event), Flink/Spark on EMR, Kafka Streams
  • State storage: DynamoDB, ElastiCache, Kinesis (windowing)
  • Output: DynamoDB, RDS, S3, Redshift, OpenSearch, Lambda targets

Architecture Pattern:

Source (API, sensor) → Kinesis/MSK → Analytics/Processing → DynamoDB/S3/Redshift
                                            ↓
                                    Real-time alerts/dashboards

Key Design Decisions:

  1. Ingestion service:

    • Kinesis: AWS-native, scales by shard, low-level control, good for AWS ecosystem
    • MSK (Kafka): Open-source, more operational burden, richer ecosystem, good for multi-cloud
    • EventBridge: Rule-based routing; not ideal for high-throughput streaming; best for low-frequency events
  2. Processing framework:

    • Kinesis Data Analytics: SQL-based, managed, good for windowing and aggregations
    • Lambda: Simple transformation; event-by-event processing; cold start may impact throughput
    • Flink/Spark on EMR: Complex stateful processing, machine learning, but requires cluster management
  3. Partitioning strategy:

    • Partition by customer ID, user ID, or geographic region to parallelize processing
    • Avoid hot partitions (one shard falling behind)
    • Auto-scale shards based on throughput metrics
  4. Windowing and state:

    • Tumbling windows: Fixed intervals (1-minute aggregates)
    • Sliding windows: Overlapping intervals (10-second windows, every 1 second)
    • Session windows: Grouped by inactivity (user session breaks when idle > 30 min)
    • Use DynamoDB or ElastiCache for windowed state
  5. Latency vs. throughput:

    • Low latency (< 1 second): Lambda + Kinesis, minimal batching
    • Higher throughput (100k+/sec): Flink, larger batches, higher latency acceptable

Well-Architected Alignment:

  • Operational Excellence: Monitoring lag (Kinesis iterator age), consumer latency, throughput metrics
  • Security: Encryption (TLS in-transit, KMS at-rest), IAM for producer/consumer, audit logging
  • Reliability: Auto-scaling shards, replay from stream (Kinesis retention 24 hours → LONG_TERM backups to S3)
  • Performance: Partition hot spots avoided, optimal shard count, batch size tuning
  • Cost Optimization: On-demand Kinesis for variable workloads; reserved for baseline; consider Kafka if high volume
  • Sustainability: Consolidate stream processing on shared clusters; avoid idle shards

Cost Model:

  • Kinesis provisioned: $0.36 per shard-hour (baseline ~4MB/sec throughput per shard)
  • Kinesis on-demand: $0.047 per GB ingested (no upfront commitment)
  • EMR Flink: $0.24–$0.30 per DPU-hour + EC2 compute costs
  • Lambda streaming: $0.0000002 per request + GB-second (high concurrency best for 100s of partitions)

2.4 Batch Processing (Scheduled, Offline, High-Volume)

Characteristics:

  • Large volumes of data processed at scheduled intervals
  • Latency acceptable (hours, days)
  • Cost-optimized (spot instances, off-peak scheduling)
  • Often part of data pipelines

Real-World Examples:

  • ETL jobs (extract, transform, load)
  • Daily/weekly analytics, reporting
  • ML model training, batch scoring
  • Data cleanup, archival, backups
  • Rendering, video transcoding

Core AWS Services:

  • Scheduling/Orchestration: Step Functions, Lambda scheduled, EventBridge rules, Glue workflows
  • Compute: Batch, Glue, EMR, Lambda (small jobs), SageMaker (ML)
  • Storage: S3 (input/output), RDS (data source), Redshift (output)
  • Distributed Processing: Spark (EMR), Flink (EMR), Hadoop (EMR), Glue (Spark-based)

Architecture Pattern:

S3/RDS (source) → Glue/Batch/EMR (transform) → S3/Redshift/RDS (output) → Analytics/Reporting
                                   ↓
                         Scheduled by Step Functions
                         Failure handling via DLQ/SNS

Key Design Decisions:

  1. Compute platform:

    • Glue: AWS-native, serverless, pay-per-DPU-second, good for AWS-centric pipelines
    • Batch: Managed job queue, auto-scaling EC2, good for large compute jobs, spot instances
    • EMR: Full Hadoop/Spark cluster, complex transformations, good for data science
    • Lambda: Small, short-lived jobs; minimal overhead; limited timeout (15 min)
  2. Scheduling:

    • Step Functions: Visual workflows, error handling, automatic retries, good for multi-step pipelines
    • EventBridge rules: Cron expressions, simple triggering, good for scheduled tasks
    • Glue workflows: Built-in job orchestration, trigger on job completion
    • Apache Airflow (MWAA): Complex DAGs, good for data engineers
  3. Data partitioning:

    • Partition input data by date, customer, or region for parallel processing
    • Output partitioned for efficient querying (Athena, Redshift spectrum)
    • Use S3 object keys intelligently (e.g., s3://bucket/year/month/day/hour/)
  4. Cost optimization:

    • Use spot instances on Batch/EMR (up to 70% savings); interruptions acceptable
    • Schedule batch jobs in off-peak hours (e.g., 2-6 AM)
    • Delete intermediate data; keep only final output
    • Use Glue over EMR for small/medium jobs (cheaper); EMR for complex workloads
  5. Failure handling:

    • Retry logic in Step Functions (exponential backoff, max retries)
    • Dead-letter queues for failed jobs
    • SNS notifications on failure
    • Idempotent job design (can re-run without side effects)

Well-Architected Alignment:

  • Operational Excellence: Job monitoring, retry policies, automated error handling, alerts
  • Security: Data encryption (S3, KMS), IAM roles for job execution, audit logs
  • Reliability: Multi-step workflows with automatic retries, DLQs, checkpointing for restartable jobs
  • Performance: Parallel processing, data partitioning, right-size instances
  • Cost Optimization: Spot instances, off-peak scheduling, delete interim data, managed services (Glue vs. EMR)
  • Sustainability: Batch when possible (consolidate compute); avoid idle clusters; use managed services

Cost Model:

  • Glue: $0.44 per DPU-hour; minimum 1 DPU (0.5 GB memory per DPU)
  • Batch: EC2 spot instances (70% off on-demand) + networking; pay only during execution
  • EMR: $0.07–$0.10 per instance-hour (depends on instance type) + spot discounts
  • Lambda: $0.0000002 per request + $0.0000166667 per GB-second (cheap for small jobs; inefficient for large workloads)

2.5 Long-Running Workflows (State Machines, Sagas)

Characteristics:

  • Multi-step processes with conditional logic and retries
  • Steps may take minutes, hours, or days
  • State must be persisted; can be paused/resumed
  • Distributed, multi-service coordination

Real-World Examples:

  • Loan approval workflows (underwriting → approval → funding)
  • Supply chain tracking (order → warehouse → shipment → delivery)
  • ML training pipelines (data prep → training → evaluation → deployment)
  • Multi-step approval processes (draft → review → approve → execute)

Core AWS Services:

  • Orchestration: Step Functions (standard or express), Apache Airflow (MWAA)
  • Persistence: DynamoDB (state), SQS/SNS (communication), S3 (large payloads)
  • Compute: Lambda, ECS, Batch (individual steps)
  • Monitoring: CloudWatch, X-Ray (trace execution paths)

Architecture Pattern:

Trigger (API, S3, EventBridge) → Step Functions State Machine
                                  ├── Step 1: Validate (Lambda)
                                  ├── Step 2: Process (ECS/Batch)
                                  ├── Step 3: Notify (SNS)
                                  └── Error handling: Retry, DLQ, Alert

Key Design Decisions:

  1. Step Functions flavor:

    • Standard: ≤ 1 year duration, at-least-once execution, good for business workflows
    • Express: ≤ 5 minutes, exactly-once execution, high throughput, good for event processing
  2. Saga pattern (distributed transactions):

    • Choreography: Each step emits events; next steps listen (decoupled but implicit)
    • Orchestration: Central coordinator (Step Functions) directs all steps (explicit, auditable, easier to debug)
    • Compensation: On failure, "undo" previous steps (e.g., reverse payment if fulfillment fails)
  3. Wait strategies:

    • Callback pattern: Step Functions waits for external system to call back; uses DynamoDB + EventBridge
    • Polling: Step Functions checks status periodically (less efficient)
    • Event-driven: External system sends event (SNS/SQS) to notify completion
  4. Error handling:

    • Automatic retries (exponential backoff, max attempts)
    • Catch and handle specific errors (e.g., timeout → default value; validation failure → alert)
    • Dead-letter queues for unrecoverable failures
    • Manual intervention path if needed
  5. State management:

    • Use DynamoDB to persist workflow state
    • Store large payloads in S3; reference by key
    • Avoid Step Functions input/output size limits (32 KB) with S3 references

Well-Architected Alignment:

  • Operational Excellence: Visual workflow monitoring, CloudWatch logs for each step, alerts on failure
  • Security: IAM roles per step (least privilege), encryption of state data
  • Reliability: Automatic retries, compensation logic, no data loss (persisted state)
  • Performance: Parallel step execution, avoid blocking waits
  • Cost Optimization: Pay per state transition; consolidate steps where possible; use spot instances in Batch steps
  • Sustainability: Avoid polling; use event-driven waits

Cost Model:

  • Step Functions Standard: $0.000025 per state transition (1M = $25/month)
  • Step Functions Express: $0.000001667 per invocation + GB-second (useful for high-volume, short workflows)
  • DynamoDB for state: $0.25 per 1M write units (on-demand: $1.25 per 1M writes)

2.6 Data-Intensive Platforms (Data Lakes, Warehouses, Analytics)

Characteristics:

  • Large volume of diverse data sources
  • Multiple consumers (analytics, ML, real-time queries)
  • Governed access; compliance and lineage important
  • Optimized for both ingestion and querying

Real-World Examples:

  • Data lakes (centralized repository for all data)
  • Data warehouses (curated, optimized for BI queries)
  • Data mesh (domain-oriented, decentralized ownership)
  • ML feature stores

Core AWS Services:

  • Ingestion: Glue, Kinesis, MSK, DMS (database migration)
  • Storage: S3 (bronze/silver/gold layers), Redshift (warehouse)
  • Processing: Glue (ETL), EMR (Spark/Flink), Athena (SQL on S3)
  • Governance: Lake Formation (access control), Glue Data Catalog (metadata)
  • Analytics: Redshift, Athena, QuickSight, SageMaker

Architecture Pattern (Modern Data Lakehouse):

Raw Data (S3 Bronze)
    ↓
Glue/EMR (Transform)
    ↓
Curated Data (S3 Silver/Gold)
    ↓
↙           ↓           ↘
Redshift   Athena      OpenSearch
(OLAP)     (Ad-hoc)    (Search)
    ↓           ↓           ↓
BI Tools     Data Apps    Search Apps
(QuickSight)  (Jupyter)    (Kibana)

Key Design Decisions:

  1. Data organization (medallion architecture):

    • Bronze: Raw data as-is (from source systems)
    • Silver: Cleaned, validated, deduplicated; schema applied
    • Gold: Business-ready; aggregated; optimized for use cases
    • Use S3 partitioning intelligently; Glue crawlers auto-discover
  2. Ingestion pattern:

    • Batch: Scheduled Glue jobs; good for daily or less frequent loads
    • Streaming: Kinesis → Glue streaming job → S3 (continuous updates)
    • Change Data Capture (CDC): DMS → S3 (real-time sync from source databases)
  3. Governance and security:

    • Lake Formation: Centralized permissions; grant access to databases/tables; tag-based controls
    • Data Catalog: Glue Data Catalog for metadata; search/lineage
    • Encryption: S3-SSE, KMS for sensitive data; Glue connection encryption for DB credentials
    • Compliance: S3 versioning, Object Lock for immutability; access logging
  4. Query optimization:

    • Redshift: Fast queries; supports complex joins; good for BI teams; requires provisioning
    • Athena: Serverless SQL on S3; slower but no provisioning; good for ad-hoc queries
    • Partitioning: Partition by date, customer, region; Athena scans only needed partitions (cost ↓)
    • File format: Parquet (columnar) better than CSV; Glue auto-converts
  5. Cost optimization:

    • Use Athena on-demand for variable queries; no idle infrastructure
    • Use Redshift Spectrum to query S3 without loading (data stays in S3)
    • Partition and compress; use columnar formats (Parquet, ORC)
    • Delete old data (TTL, Intelligent-Tiering); archive to Glacier if needed

Well-Architected Alignment:

  • Operational Excellence: Crawlers for auto-discovery, Data Catalog for searchability, Glue workflows for orchestration
  • Security: Lake Formation permissions, encryption, audit logging, data classification
  • Reliability: Data versioning (S3), immutability (S3 Object Lock), backup (cross-region replication)
  • Performance: Partitioning, columnar formats, Spectrum for external queries, Redshift for complex OLAP
  • Cost Optimization: Athena pay-per-query, Redshift reserved for predictable workloads, S3 Intelligent-Tiering
  • Sustainability: Consolidate analytics on shared platforms; avoid silos; centralized data reduces duplication

Cost Model:

  • S3: $0.023 per GB/month (standard); $0.0125 (infrequent access)
  • Athena: $6.25 per TB scanned (queries only scan partitions needed)
  • Glue: $0.44 per DPU-hour
  • Redshift: $1.26/hour (dc2.large); $8.02/hour (ra3.xlplus); storage separate
  • Lake Formation: $1 per million metadata requests; ~$0 for permissions if using Glue

2.7 AI/ML & Agentic Systems

Characteristics:

  • Data-driven decision making and automation
  • Model training, evaluation, serving pipelines
  • Real-time inference or batch scoring
  • Emerging: Agentic AI (agents that reason and take actions)

Real-World Examples:

  • Recommendation systems, personalization
  • Fraud detection, anomaly detection
  • Natural language processing (chatbots, document analysis)
  • Computer vision (image classification, object detection)
  • Agentic AI: Autonomous agents that interact with systems, query databases, make decisions

Core AWS Services:

  • ML Platform: SageMaker (end-to-end), Bedrock (foundation models, agents)
  • Model Training: SageMaker Training, Batch, EMR (Spark/PySpark)
  • Feature Store: SageMaker Feature Store (managed; offline/online feature serving)
  • Model Registry: SageMaker Model Registry (version control, approval workflow)
  • Inference: SageMaker endpoints (real-time), SageMaker Batch Transform (offline), Lambda@Edge (edge ML)
  • Agentic AI: Bedrock Agents, Amazon Q (enterprise search/automation), Step Functions (orchestration)
  • Tools/APIs: Lambda (tool execution), API Gateway (external integrations), DynamoDB (state)
  • Data: S3 (training data), RDS/DynamoDB (features, state), OpenSearch (vector search for RAG)

Architecture Pattern (Traditional ML):

Data → SageMaker Training (S3 input) → Model Registry
                                    ↓
                         SageMaker Endpoint (real-time inference)
                         or
                         SageMaker Batch Transform (offline scoring)
                                    ↓
                         Application / BI Tool

Architecture Pattern (Agentic AI):

User Query → Bedrock Agent (orchestration) → Foundation Model (reasoning)
                                ↓
                        Tool Selection & Execution
                    ├── Lambda (query database)
                    ├── API Gateway (call external services)
                    ├── DynamoDB (read/write state)
                    └── OpenSearch (semantic search, RAG)
                                ↓
                        Response to User

Key Design Decisions:

  1. Foundation Model (Bedrock Agents):

    • Use Claude, Llama, Mistral, etc. via Bedrock (no provisioning)
    • Custom fine-tuned models (SageMaker) for domain-specific tasks
    • Prompt engineering and in-context learning for performance
  2. Agentic system architecture:

    • Tools: Lambda functions exposing APIs (query database, update CRM, send email)
    • Knowledge bases: OpenSearch or Bedrock Knowledge Base (vector embeddings for RAG)
    • Memory: DynamoDB for conversation history, user state, execution traces
    • Orchestration: Bedrock Agents handle reasoning; Step Functions for complex multi-step workflows
  3. Feature engineering and serving:

    • Offline: Glue/Spark to compute features in batch; store in S3
    • Online: SageMaker Feature Store for low-latency feature retrieval during inference
    • Real-time: Compute on-the-fly in inference code if lightweight; Lambda acceptable
  4. Model training pipeline:

    • Data preparation: Glue (structured), SageMaker Processing (custom code)
    • Training: SageMaker Training (managed) or EMR (Spark MLlib)
    • Evaluation: SageMaker Experiments (track metrics); model registry for approval
    • Deployment: SageMaker Endpoints (auto-scaling) or Lambda container images
  5. Inference patterns:

    • Real-time (< 1 sec): SageMaker endpoints (auto-scaling), Lambda (smaller models)
    • Near-real-time (seconds): Lambda + concurrent invocations, ECS
    • Batch (hours): SageMaker Batch Transform, Batch, Glue
    • Edge: Lambda@Edge, Greengrass for on-device inference
  6. RAG (Retrieval-Augmented Generation):

    • Knowledge base: OpenSearch with vector embeddings or Bedrock Knowledge Base
    • Retrieval: Semantic similarity search on user query
    • Generation: Agent passes retrieved context to LLM for response

Well-Architected Alignment:

  • Operational Excellence: Experiment tracking (MLflow in SageMaker), model registry, automated retraining, A/B testing
  • Security: Encryption (KMS for S3 training data), IAM roles, PII redaction in prompts (for agents), audit logging
  • Reliability: Model versioning, canary deployments (gradual rollout), monitoring for model drift
  • Performance: Feature store for low-latency retrieval, batching for throughput, multi-GPU training
  • Cost Optimization: Use Bedrock for inference (pay-per-invocation) vs. SageMaker endpoint provisioning; spot training
  • Sustainability: Batch inference for non-urgent tasks; consolidate models; avoid redundant retraining

Cost Model:

  • Bedrock: $0.001–$0.015 per 1K input tokens (Claude Sonnet: $0.003/1K in, $0.015/1K out)
  • SageMaker endpoint (ml.m5.large): $0.0864/hour + data transfer
  • SageMaker Training (ml.p3.2xlarge with GPU): $3.06/hour
  • SageMaker Feature Store: $0.40 per million requests (online); $0.002 per GB/month (offline)
  • Bedrock Knowledge Base: $0.2 per input token for Titan embedding model

2.8 Edge, IoT, and Real-Time Communication

Characteristics:

  • Data originates at edge devices (sensors, mobile, on-prem)
  • Local processing and decision-making required (low latency)
  • Intermittent or unreliable connectivity
  • Scale: thousands to millions of devices

Real-World Examples:

  • Predictive maintenance (machine sensors → local ML → cloud)
  • Smart home automation
  • Mobile offline-first apps
  • Autonomous vehicles (local processing + cloud sync)
  • Real-time collaboration (low-latency WebSockets)

Core AWS Services:

  • Edge Compute: IoT Greengrass (on-device), Lambda@Edge (CloudFront), Outposts (on-prem)
  • IoT Connectivity: IoT Core (MQTT), IoT Wireless
  • Local Storage: Greengrass local resource access
  • Cloud Sync: DynamoDB global tables (eventual consistency), AppSync (real-time GraphQL)
  • Real-time APIs: API Gateway WebSocket, AppSync
  • Analytics: Kinesis, Timestream (time-series), OpenSearch

Architecture Pattern (IoT):

Sensors → Greengrass (local processing) → IoT Core (MQTT) → Kinesis/DynamoDB → Analytics
                                               ↓
                                        Local decision-making
                                        (low latency)

Architecture Pattern (Real-Time Collaboration):

Client (WebSocket) → API Gateway (WS) → Lambda (connect/message/disconnect)
                                             ↓
                                        DynamoDB (active connections)
                                        SNS/SQS (broadcast)
                                             ↓
                                        API Gateway (push to clients)

Key Design Decisions:

  1. Edge vs. cloud processing:

    • Local ML inference: Greengrass for low-latency decisions; models updated from cloud
    • Cloud processing: IoT Core → Kinesis → Lambda for aggregation and advanced analytics
    • Hybrid: Train in cloud; inference at edge; feedback loop for retraining
  2. Connectivity:

    • Always-on: IoT Core MQTT (publish/subscribe), reliable connection
    • Intermittent: Greengrass syncs when connected; local queue until reconnection
    • Low bandwidth: Compress data; send only deltas; batching
  3. Real-time communication:

    • WebSocket (API Gateway): Two-way, low-latency, good for < 100k concurrent connections
    • AppSync: GraphQL, built-in subscriptions, automatic connection management
    • Kinesis (event stream): High-throughput, ordered by partition, good for > 1M events/sec
  4. Data durability at edge:

    • Store locally in Greengrass; sync to cloud when connectivity restored
    • Use DynamoDB global tables for eventual consistency (multi-region)
    • S3 event notifications for file-based data
  5. Security:

    • IoT Core certificate-based auth; Greengrass runs as daemon
    • TLS encryption; end-to-end if needed
    • Least-privilege IAM roles

Well-Architected Alignment:

  • Operational Excellence: Device shadowing (Greengrass), fleet-wide updates, telemetry
  • Security: Certificate rotation, encrypted communication, local data at-rest encryption
  • Reliability: Local queue on disconnect, eventual sync, multi-region global tables
  • Performance: Local ML inference (milliseconds), batched cloud sync, Kinesis for ordering
  • Cost Optimization: IoT Core MQTT cheaper than constant cloud calls; Greengrass reduces cloud traffic
  • Sustainability: Local processing reduces cloud load; edge devices only send necessary data

Cost Model:

  • IoT Core: $1.00 per million messages
  • IoT Greengrass core device: $1.00/month per device
  • API Gateway WebSocket: $0.35 per million messages
  • Timestream: $0.30 per million writes + $0.01 per million query units

2.9 Hybrid & Multi-Account Systems

Characteristics:

  • Systems span on-premises, AWS, or multiple AWS accounts/regions
  • Data and workloads need to integrate seamlessly
  • Governance and cost attribution across boundaries
  • Network connectivity maintained reliably

Real-World Examples:

  • Large enterprises with on-premises datacenters migrating to cloud
  • Multi-tenant SaaS (separate AWS account per customer)
  • Regulated industries (separate account for prod, dev, compliance)
  • Global companies (separate regions for data residency)

Core AWS Services:

  • Connectivity: AWS Direct Connect (dedicated network), VPN (encrypted tunnel), Transit Gateway (hub-and-spoke)
  • Multi-Account Mgmt: AWS Organizations, Control Tower, Security Hub, Config
  • Cross-Account Access: IAM roles (assume role across accounts), Resource-based policies
  • Data Sync: DataSync (on-prem ↔ S3), DMS (database replication), S3 cross-account replication
  • Governance: Lake Formation cross-account access, Tag policies, SCPs (Service Control Policies)

Architecture Pattern (Hybrid Cloud):

On-Premises ← Direct Connect / VPN → AWS
    ↓                                    ↓
Data Center            VPC + Private subnets
    ↓                                    ↓
Legacy Apps         Modern Apps (EC2/Lambda)
    ↓                                    ↓
Database ← DataSync/DMS → RDS/DynamoDB
    ↓                                    ↓
Monitoring ← EventBridge → CloudWatch

Architecture Pattern (Multi-Account SaaS):

AWS Organizations
├── Master (billing, security)
├── Prod Account 1 (Tenant A)
├── Prod Account 2 (Tenant B)
├── Dev Account
└── Security/Logging Account

Cross-Account:
- Assume role (Tenant A's apps assume role in Tenant A's account)
- Log aggregation (each account sends logs to central Security account)
- Data sharing (Lake Formation: Tenant A shares data with Tenant B via central account)

Key Design Decisions:

  1. Connectivity:

    • Direct Connect: Dedicated network; consistent bandwidth; good for large data transfers
    • VPN: Encrypted tunnel; supports multiple connections; good for backup/failover
    • Transit Gateway: Hub-and-spoke; simplifies multi-VPC and on-prem connectivity
  2. Cross-account access:

    • Role assumption: Service A in Account 1 assumes role in Account 2 to access resource
    • Resource-based policies: S3 bucket allows Account 2 principal to access
    • Federation: Active Directory users assume AWS roles via SAML/OIDC
  3. Data consistency:

    • Synchronous replication (DMS): Real-time sync; good for transactional databases
    • Asynchronous replication (S3 cross-account, cross-region): Eventual consistency; lower latency impact
    • Event-based sync: EventBridge rules replicate data; decoupled, scalable
  4. Governance:

    • Organizations: Centrally managed accounts; consolidated billing; SCPs for guardrails
    • Control Tower: Baseline accounts with security/compliance guardrails
    • Lake Formation: Centralized data lake with cross-account access; tag-based permissions
  5. Cost allocation:

    • Organization consolidates billing; cost tags for chargebacks
    • Cost anomaly detection per account
    • Separate AWS accounts per business unit or customer (clean billing, security isolation)

Well-Architected Alignment:

  • Operational Excellence: Centralized logging (Security account), Config for compliance, Systems Manager for patching
  • Security: Least-privilege cross-account roles, encryption in-transit (Direct Connect, VPN), audit trails per account
  • Reliability: Multi-path connectivity (Direct Connect + VPN), cross-account backups, replication
  • Performance: Direct Connect for consistent bandwidth; Transit Gateway for simplified routing
  • Cost Optimization: Reserved capacity per account; consolidated billing for discounts; shutdown unused accounts
  • Sustainability: Consolidate under-utilized accounts; turn off dev accounts when not in use

Cost Model:

  • Direct Connect: $0.30 per hour + $0.02 per GB outbound (cheaper than data transfer for high volumes)
  • VPN: $0.05 per hour + standard data transfer charges
  • Transit Gateway: $0.05 per hour + $0.02 per GB processed through TGW

Section 3: AWS Service Decision Matrix (Core Framework)

For every major AWS service category, this section provides structured guidance on when to use, when NOT to use, tradeoffs, cost model, scaling behavior, and operational burden.

3.1 Compute Services

AWS Lambda

When to Use:

  • Synchronous request/response (API calls, webhooks)
  • Asynchronous event processing (S3 triggers, SQS, SNS, EventBridge)
  • Short-duration workloads (< 15 minutes)
  • Variable or bursty load (scale from 0 to 1000s automatically)
  • Cost-sensitive low-frequency tasks

When NOT to Use:

  • Long-running processes (> 15 min timeout; use ECS or EC2)
  • CPU-intensive workloads without strict latency constraints (EC2 cheaper)
  • Stateful applications with persistent connections (ECS/EC2)
  • Workloads with consistent, predictable high throughput (ECS/EC2 with Savings Plans cheaper)
  • Real-time latency-critical (< 100ms consistently; ECS/EC2 warm; Lambda cold start ~1-2 sec)

Tradeoffs:

  • Pros: No infrastructure to manage; auto-scales; pay only for execution; fast deployments
  • Cons: Cold starts (100ms-2s), 15-min timeout, 10GB max memory, vendor lock-in, difficult debugging

Cost Model:

  • Pricing: $0.0000002 per request + $0.0000166667 per GB-second
  • Example: 1M requests, 1GB RAM, 100ms duration = $0.2 (requests) + $1.67 (compute) = ~$1.87/month
  • Free tier: 1M requests + 400k GB-seconds/month

Scaling Behavior:

  • Auto-scales from 0 to 1000 concurrent executions (soft limit; request increase)
  • Reserved concurrency for baseline; provisioned concurrency for predictable warm starts
  • Burst: 500 containers per minute; after burst, 100 containers/min scaling rate

Operational Burden:

  • Low: No patching, scaling, or infrastructure management
  • Debugging: CloudWatch logs, X-Ray tracing, local SAM testing
  • Versioning: Built-in; easy blue-green deployments

Well-Architected Mapping:

  • Cost: Ideal for variable workloads; reserved concurrency for baseline (cheaper for high consistent load)
  • Performance: Cold start sensitive; provisioned concurrency adds cost but ensures warm
  • Reliability: Automatically retried on transient failures; DLQ for async failures
  • Security: IAM role per function; no SSH access

Amazon ECS (Elastic Container Service)

When to Use:

  • Containerized applications (Docker, non-Kubernetes)
  • Moderate to high load (100s to 1000s requests/sec)
  • Long-running services (> 15 min)
  • Need for quick startup (< 1 sec) and predictable latency
  • Mixed workloads on shared cluster

When NOT to Use:

  • Variable, bursty workloads (Lambda more cost-effective for low utilization)
  • Complex orchestration (Kubernetes / EKS recommended)
  • Minimal infrastructure (Lambda/API Gateway less operational burden)

Tradeoffs:

  • Pros: Fast startup, warm containers, good for stateful services, mixed workload consolidation, less overhead than EKS
  • Cons: Requires container registry (ECR), task definitions, service scaling configuration; no built-in persistent volumes (use EFS)

Cost Model:

  • Fargate (serverless): $0.04695 per vCPU-hour + $0.00519 per GB-hour
  • Example: 1 vCPU, 2GB, always-on = $0.0470 + $0.0104 = $33.65/month
  • EC2 (self-managed): Pay for EC2 instance (larger volumes, better ROI with high utilization)

Scaling Behavior:

  • Auto Scaling Group scales EC2 instances; ECS scheduler places tasks
  • Fargate scales near-instantly; EC2 scaling depends on instance launch time (~1-2 min)
  • Target tracking (CPU, memory); step scaling for complex rules

Operational Burden:

  • Moderate: Manage task definitions, scaling policies, blue-green deployments
  • Monitoring: CloudWatch metrics, container logs
  • Patching: Update task definitions and redeploy; ECS handles rolling updates

Well-Architected Mapping:

  • Cost: Fargate for simple, variable workloads; EC2 for predictable, high-utilization
  • Performance: No cold start; warm containers; can be more efficient than Lambda for sustained load
  • Reliability: Auto-restart failed tasks; service health checks; distributed across AZs
  • Security: IAM task role; container-level isolation

Amazon EKS (Elastic Kubernetes Service)

When to Use:

  • Complex microservices ecosystems (10s-100s of services)
  • Kubernetes-native tooling and expertise available
  • Advanced orchestration needs (canary deployments, traffic shifting, service mesh)
  • Workloads already containerized in Kubernetes

When NOT to Use:

  • Small teams without Kubernetes expertise (ECS simpler)
  • Simple applications (Lambda or ECS sufficient)
  • Minimal operational overhead desired

Tradeoffs:

  • Pros: Powerful orchestration, extensive ecosystem (Istio, Helm, Prometheus), vendor-agnostic (portable to other clouds)
  • Cons: Operational complexity; requires cluster management (node updates, networking, security); higher learning curve

Cost Model:

  • EKS cluster: $0.10 per hour (control plane) + EC2 or Fargate for worker nodes
  • Full cost: $73/month (cluster) + compute costs
  • Good for: 10+ services; high utilization; can amortize control plane cost

Scaling Behavior:

  • Cluster autoscaler adds/removes nodes based on pod resource requests
  • Horizontal Pod Autoscaler (HPA) scales pod replicas by CPU/memory/custom metrics
  • Complex multi-dimensional scaling (per-deployment, per-namespace)

Operational Burden:

  • High: Cluster upgrades, node patching, networking (CNI plugins), monitoring (add-ons)
  • Requires dedicated SRE/platform team for large-scale

Well-Architected Mapping:

  • Cost: Only justified for complex workloads; simpler apps should use ECS
  • Performance: Fine-grained control; service mesh for advanced routing
  • Reliability: Self-healing, automated failover, rolling updates
  • Security: RBAC, network policies, pod security policies

Amazon EC2 (Elastic Compute Cloud)

When to Use:

  • Legacy applications requiring specific OS or drivers
  • High CPU/memory workloads (compute-intensive)
  • Persistent connections (WebSockets, SSH, persistent databases)
  • Need for dedicated hardware or licensing

When NOT to Use:

  • Stateless, short-lived workloads (Lambda cheaper)
  • Variable load without auto-scaling configured
  • Greenfield modern applications (serverless preferred)

Tradeoffs:

  • Pros: Full control, any OS/software, persistent storage (EBS), good for sustained workloads
  • Cons: Operational burden (patching, scaling, security), must manage capacity, higher baseline cost

Cost Model:

  • On-demand: $0.0116/hour (t3.micro) to $10+/hour (large instances)
  • Reserved (1-year): ~40% discount; 3-year: ~65% discount
  • Spot: Up to 90% off on-demand; interruption risk
  • Good for: Baseline capacity (reserved); variable load (on-demand + spot)

Scaling Behavior:

  • Auto Scaling Group scales by launch time (~1-2 min), not instant
  • Gradual scaling (step scaling, target tracking) for stability
  • Spot fleet for cost optimization

Operational Burden:

  • High: OS patching, security groups, IAM roles, monitoring, backups (EBS snapshots)
  • AMI management for consistent deployments

Well-Architected Mapping:

  • Cost: Optimize with Reserved Instances + Savings Plans; use Savings Plans for flexibility across instance families
  • Performance: Predictable performance; tune instance type for workload
  • Reliability: Multi-AZ ASG; EBS volumes for persistence; snapshots for backup
  • Security: Security groups, IAM instance roles, encrypted EBS

AWS Batch

When to Use:

  • Large-scale batch processing (1000s of parallel jobs)
  • Cost-optimized batch workloads (use spot instances)
  • Scheduled jobs (daily, weekly ETL)
  • Distributed processing without Spark/Hadoop complexity

When NOT to Use:

  • Real-time or interactive workloads (ECS, Lambda)
  • Complex data transformations (Glue, EMR)

Tradeoffs:

  • Pros: Managed job queue, auto-scaling, spot instances for cost, simple job definitions
  • Cons: Not real-time; latency for job startup; limited monitoring compared to ECS

Cost Model:

  • Compute: Pay for EC2/Fargate underlying (Batch orchestration free)
  • Spot instances: 70% off on-demand
  • Example: 1000 jobs, 1 hour each, m5.large spot = ~50 parallel instances × $0.0174/hour × 1 hour × 70% discount = $61

Scaling Behavior:

  • Managed job queue with auto-scaling
  • Jobs queued; Batch launches instances on demand
  • Scale-down after job completion (if no backlog)

Operational Burden:

  • Low-moderate: Define job definitions; submit jobs; Batch manages the rest
  • Monitoring: CloudWatch metrics, job logs

Well-Architected Mapping:

  • Cost: Excellent for batch workloads; spot instances reduce cost significantly
  • Reliability: Automatic retries on failure; DLQs for failed jobs
  • Performance: Parallel execution scales linearly

AWS Elastic Beanstalk

When to Use:

  • Simple web applications or APIs
  • Team comfortable with code, not infrastructure
  • Want managed platform without container complexity

When NOT to Use:

  • Complex microservices architectures (EKS)
  • Highly customized infrastructure
  • Needing fine-grained control

Tradeoffs:

  • Pros: Managed; simple deployments (git push or CLI); handles scaling, load balancing, monitoring
  • Cons: Less control than ECS; less powerful than EKS; can be "magic" (hard to debug)

Cost Model:

  • Same as underlying EC2 + ALB + RDS (if used)
  • No additional Beanstalk fee
  • Good for: Simple apps where operational simplicity outweighs cost

Scaling Behavior:

  • Auto Scaling Group scales EC2 instances
  • Health checks monitor instances

Operational Burden:

  • Very Low: Push code; Beanstalk handles deployment, scaling, logging

Well-Architected Mapping:

  • Operational Excellence: Managed deployments; built-in monitoring; environment cloning
  • Cost: Transparent cost; same as self-managed
  • Performance: Good for small to medium workloads

3.2 Storage Services

Amazon S3 (Simple Storage Service)

When to Use:

  • Object storage (files, media, logs, backups, data lake)
  • Static website hosting
  • Archive (Glacier tiers)
  • Durability requirement (11 nines)

When NOT to Use:

  • Block storage (use EBS)
  • File system access patterns (use EFS)
  • Database (use RDS, DynamoDB)
  • Real-time read/write latency (milliseconds)

Tradeoffs:

  • Pros: Infinitely scalable, highly durable, cheap at scale, multi-region replication, lifecycle policies
  • Cons: Not a file system (eventual consistency for overwrites in older regions), latency ~100-200ms, complex IAM

Cost Model:

  • Standard: $0.023 per GB/month
  • Intelligent-Tiering: $0.0125 per GB/month (auto-moves to cheaper tiers based on access)
  • Glacier Instant: $0.004 per GB/month (instant retrieval)
  • Glacier Flexible: $0.0036 per GB/month (1-12 hour retrieval)
  • Requests: $0.0004 per 1k PUT, $0.000004 per 1k GET
  • Data transfer out: $0.09 per GB (in-region free)

Scaling Behavior:

  • Unlimited storage; auto-scales
  • Throughput: 3,500 PUT/COPY/POST/DELETE per second per prefix; 5,500 GET per prefix
  • For higher throughput, use different key prefixes (randomize first characters)

Operational Burden:

  • Very Low: Fully managed; no provisioning, scaling, or patching
  • Configuration: Bucket policies, CORS, versioning, lifecycle, replication

Well-Architected Mapping:

  • Cost: Incredibly cheap at scale; Intelligent-Tiering for unknown access patterns; Glacier for archival
  • Reliability: 11 nines durability; cross-region replication for disaster recovery; versioning for data protection
  • Security: Bucket policies, ACLs, encryption (SSE-S3, SSE-KMS), Block Public Access, access logging
  • Sustainability: Intelligent-Tiering reduces carbon footprint; archive old data to Glacier

Amazon EBS (Elastic Block Store)

When to Use:

  • Block storage for EC2 instances (root volume, data volumes)
  • Persistent storage for databases
  • High-performance workloads (SSD gp3, io2)

When NOT to Use:

  • Object storage (use S3)
  • Shared file system (use EFS)
  • Archival (use Glacier)

Tradeoffs:

  • Pros: Block-level, high performance, snapshots for backup, encryption
  • Cons: Must be attached to EC2 instance; limited to single AZ (unless snapshot-replicated)

Cost Model:

  • gp3 (general-purpose SSD): $0.08 per GB/month (includes 3k IOPS, 125 MB/s throughput)
  • io2 (high I/O SSD): $0.125 per GB/month; $0.065 per IOPS provisioned
  • st1 (throughput-optimized HDD): $0.045 per GB/month
  • Snapshots: $0.05 per GB/month (incremental)
  • Example: 100 GB gp3 = $8/month

Scaling Behavior:

  • Fixed at provisioning; can be increased (online) but not decreased (offline only)
  • IOPS scale independently of size (gp3: up to 16k IOPS)

Operational Burden:

  • Low: Fully managed; snapshots are automated if configured
  • Monitoring: Volume metrics in CloudWatch

Well-Architected Mapping:

  • Reliability: Snapshots for backup; replicate snapshots for cross-AZ recovery
  • Performance: Choose instance-store for max IOPS (no persistence); EBS for balance
  • Cost: gp3 cheaper than gp2 at same performance; delete unused snapshots

Amazon EFS (Elastic File System)

When to Use:

  • Shared file system for multiple EC2 instances
  • NFS access required
  • Scaling file system across AZs

When NOT to Use:

  • High-performance (EBS/instance-store faster)
  • Archive (use S3)
  • Windows (use FSx for Windows File Server)

Tradeoffs:

  • Pros: Elastic (grow/shrink), multi-AZ, scalable across instances
  • Cons: Latency higher than EBS; NFS protocol overhead; more expensive

Cost Model:

  • Standard: $0.30 per GB/month
  • One Zone: $0.16 per GB/month (single AZ, 20% cheaper)
  • Provisioned throughput: $0.01 per MB/s (on-demand)
  • Example: 100 GB multi-AZ = $30/month

Scaling Behavior:

  • Auto-scales; no provisioning needed
  • Throughput: Bursting (up to 500 MB/s for 100 GB); provisioned for higher sustained

Operational Burden:

  • Low: Fully managed; no patching or replication

Well-Architected Mapping:

  • Reliability: Data replicated across AZs automatically
  • Performance: Throughput mode for parallelized workloads
  • Cost: More expensive than S3 for archive; less expensive than maintaining NFS servers

Amazon FSx for Windows File Server / Lustre

When to Use:

  • Windows-native file sharing (SMB/CIFS)
  • Lustre (high-performance computing, machine learning)

When NOT to Use:

  • NFS needed (use EFS)
  • General object storage (use S3)

Tradeoffs:

  • Pros: Fully managed; Windows-native; high performance
  • Cons: More expensive; less flexible than self-managed file servers

Cost Model:

  • Windows File Server: $0.012 per hour per GB/month allocated (SSD)
  • Lustre: $0.015-$0.021 per hour per TB/month

Well-Architected Mapping:

  • Performance: Low-latency file access; good for Windows environments
  • Cost: Justified only for Windows workloads needing shared files

3.3 Database Services

Amazon RDS (Relational Database Service)

When to Use:

  • Structured data with complex queries (SQL)
  • ACID transactions required
  • Existing relational database workloads (MySQL, PostgreSQL, Oracle, SQL Server, MariaDB)
  • < 100 TB data

When NOT to Use:

  • NoSQL access patterns (key-value; use DynamoDB)
  • Unstructured data (use S3)
  • Extreme scale (> 100 TB; use Redshift or specialized database)
  • Real-time analytics (use Redshift, Athena, or OpenSearch)

Tradeoffs:

  • Pros: Managed, automated backups, Multi-AZ failover, read replicas, encryption
  • Cons: Limited to single-master writes (though read replicas help); must right-size; managed backups limited to 35 days

Cost Model:

  • Instance: $0.17/hour (db.t3.micro) to $10+/hour (db.r6g.16xlarge)
  • Storage: $0.10 per GB/month (gp2 SSD); $0.12 (io1 SSD)
  • Backups: First backup free; additional storage $0.10 per GB/month
  • Data transfer: $0.01 per GB (out-of-region)
  • Example: db.t3.small (small app) = ~$50/month + storage

Scaling Behavior:

  • Vertical scaling (change instance type; requires downtime or read-replica promotion)
  • Read replicas for horizontal read scaling (async replication; eventual consistency)
  • Auto Scaling for storage (grow up to max)

Operational Burden:

  • Low-moderate: Backups, upgrades managed; must monitor CPU/disk; schema management
  • Multi-AZ automatic failover for HA

Well-Architected Mapping:

  • Reliability: Multi-AZ for automatic failover (2x cost); read replicas for DR and scaling
  • Performance: Connection pooling; query optimization; appropriate indexes
  • Cost: Right-size instance type; use Savings Plans (1-year: 31% off, 3-year: 43% off); gp2 → gp3 for cost reduction
  • Security: Encryption at-rest (KMS), in-transit (SSL); IAM database authentication; encrypted backups

Amazon Aurora (MySQL/PostgreSQL-compatible)

When to Use:

  • High-throughput, low-latency SQL workloads
  • Mission-critical applications requiring high availability
  • Need for read scaling (15 read replicas)
  • Up to 128 TB storage

When NOT to Use:

  • Simple applications (RDS enough)
  • Cost-sensitive (Aurora 2-3x RDS cost)
  • Oracle-licensed (RDS Oracle only)

Tradeoffs:

  • Pros: 5x faster than MySQL, 3x faster than PostgreSQL; 15 read replicas; auto-scaling storage; distributed architecture
  • Cons: More expensive; multi-master not default (requires Aurora MySQL 5.7+); Aurora Serverless has cold starts

Cost Model:

  • DB instance: $0.15/hour (db.t3.small) to $3.26/hour (db.r6g.16xlarge) (cheaper than comparable RDS per hour)
  • Storage: $0.10 per GB/month (only pay for used; auto-scaling)
  • Example: 100 GB, db.r6g.large (1 writer + 2 readers) = ~$250/month (better ROI for scale)

Scaling Behavior:

  • Read replicas scale independently (up to 15)
  • Auto Scaling Aurora Serverless (aurora-mysql) for variable load; pause when idle
  • Storage auto-scales; no disk-full risk

Operational Burden:

  • Very Low: Managed; replication automatic; backups automated (35-day retention)
  • Multi-AZ automatic; cross-region read replica option

Well-Architected Mapping:

  • Reliability: Multi-AZ primary + read replicas; automatic failover; cross-region DR
  • Performance: 5x throughput gains; up to 100k write capacity; 500k read capacity
  • Cost: Higher upfront; lower per-transaction cost at scale; Serverless variant for variable loads
  • Sustainability: Shared compute for read replicas (efficient); auto-pause for low utilization

Amazon DynamoDB

When to Use:

  • Key-value, document, or time-series data
  • Predictable access patterns (query by partition key)
  • Variable or bursty load (on-demand pricing)
  • Millisecond latency required
  • Serverless architecture

When NOT to Use:

  • Complex queries across unrelated attributes (RDS, Redshift)
  • Relational integrity (RDS)
  • Full-text search (OpenSearch)
  • Transactional consistency across multiple items (limited to single partition)

Tradeoffs:

  • Pros: Fully managed, auto-scales, millisecond latency, serverless, global tables
  • Cons: Query flexibility limited (must know partition key); eventual consistency for global tables; complex secondary indexes

Cost Model:

  • Provisioned: Read: $0.00013 per RCU-hour; Write: $0.00065 per WCU-hour
  • On-demand: $1.25 per 1M write requests; $0.25 per 1M read requests
  • Storage: $0.25 per GB/month
  • Example: 100 GB, on-demand, 100k write + 500k read/month = $0.125 + $0.125 + $25 = $25.25/month
  • Breakeven: Provisioned cheaper at > 1M writes/day or > 3M reads/day

Scaling Behavior:

  • Provisioned: Set RCU/WCU; auto-scales up within limits; scales down with delay (conservative)
  • On-demand: Scales instantly; no capacity planning; higher latency at extreme peak
  • Global Tables: Replicate to any region; read-your-writes consistency

Operational Burden:

  • Very Low: Fully managed; no patching, replication, or backups to configure
  • Monitoring: Consumed capacity, throttling, latency in CloudWatch

Well-Architected Mapping:

  • Cost: On-demand for variable; provisioned for predictable; reserve capacity (Savings Plans) for baseline
  • Reliability: Automatic backups (Point-in-Time Recovery); Global Tables for multi-region
  • Performance: Millisecond latency; partition key design critical to avoid hot partitions
  • Security: Encryption at-rest (KMS), IAM fine-grained access, TTL for automatic deletion
  • Sustainability: Managed service; consolidate workloads

Amazon Redshift

When to Use:

  • Data warehousing (TB-PB scale)
  • Complex OLAP queries (star schema, large joins)
  • BI/analytics workloads
  • Time-series analytics (rollups, trends)

When NOT to Use:

  • OLTP/transactional (RDS, DynamoDB)
  • Ad-hoc queries (Athena cheaper)
  • Real-time ingestion (Kinesis better)

Tradeoffs:

  • Pros: Petabyte-scale, fast complex queries, mature BI integration, Spectrum for querying S3
  • Cons: Requires provisioning and management; cluster outage for node failures (unless multi-node with auto-scaling)

Cost Model:

  • dc2.large: $1.26/hour (~$900/month)
  • ra3.xlplus (managed storage): $4.02/hour + $0.008 per GB/month for managed storage
  • Example: 100 GB dc2.large = ~$900/month (good for sustained warehouse workload)

Scaling Behavior:

  • Vertical (resize cluster type) or horizontal (add nodes)
  • Resizing requires cluster downtime or elastic resize (newer, no downtime)

Operational Burden:

  • Moderate: Cluster maintenance, node replacement, VACUUM/ANALYZE
  • Spectrum: Query S3 without loading (larger effective warehouse without cluster growth)

Well-Architected Mapping:

  • Cost: Fixed monthly cost; break-even against Athena at ~1 TB queries/month; reserved instances available
  • Performance: Complex OLAP queries; star schema optimization; Spectrum for external data
  • Reliability: Snapshots for backup; cross-region snapshot copy for DR
  • Security: Encryption at-rest, in-transit; IAM; audit logging (via CloudWatch/S3)

Amazon OpenSearch (formerly Elasticsearch)

When to Use:

  • Full-text search (logs, documents)
  • Time-series analytics (logs with timestamps)
  • Real-time dashboards (Kibana visualization)
  • Vector search (embeddings for ML, RAG)

When NOT to Use:

  • Traditional OLAP (Redshift)
  • Transactional (RDS)
  • Pure archival (S3, Glacier)

Tradeoffs:

  • Pros: Fast full-text search, powerful aggregations, real-time visualization, vector search for ML
  • Cons: Requires cluster; complex cluster configuration (node types, shard allocation); eventual consistency

Cost Model:

  • Single node (t3.small): $0.123/hour (~$90/month)
  • Multi-node (3 data nodes, r5.large): $1.26/hour ($920/month)
  • Storage: Included in node cost; scales with node type
  • Example: Small logging cluster (3 nodes) ~$900/month + data retention

Scaling Behavior:

  • Horizontal scaling (add nodes); shard allocation balances data
  • Auto-scaling available but requires careful configuration
  • Manual index rollover for time-series (daily indices)

Operational Burden:

  • Moderate: Shard allocation, index management, cluster health monitoring
  • Snapshot repository for backup (S3)

Well-Architected Mapping:

  • Cost: Fixed; justified for search/logging volume > 1TB/month
  • Performance: Real-time search; aggregations on large datasets
  • Reliability: Snapshots to S3; replica shards for HA; cross-region replication
  • Security: Encryption, IAM, fine-grained access control via plugins

Amazon DynamoDB + Timestream

Timestream (for time-series data):

When to Use:

  • Metrics, sensor data, stock prices (time-series)
  • High-volume, append-only workloads
  • Automatic retention policies

When NOT to Use:

  • General analytics (Redshift, Athena)
  • Complex queries (OpenSearch better for logs)

Cost Model: $0.30 per million writes; $0.01 per million queries


3.4 Integration & Messaging Services

Amazon SQS (Simple Queue Service)

When to Use:

  • Async decoupling (producer → queue → consumer)
  • Buffer burst traffic (queue absorbs spikes)
  • Reliable delivery guarantee
  • Work scheduling (Lambda pulling messages)

When NOT to Use:

  • Real-time pub/sub (SNS)
  • Complex routing (EventBridge)
  • Immediate delivery (SNS faster)

Tradeoffs:

  • Pros: Durable queue, at-least-once delivery, DLQ for failed messages, simple FIFO option
  • Cons: Polling model (not push); no broad fanout (one consumer per message)

Cost Model:

  • Standard SQS: $0.40 per 1M requests
  • FIFO: $0.50 per 1M requests + deduplication/group messaging
  • Example: 1M messages/month = $0.40 (essentially free at small scale)

Scaling Behavior:

  • Unlimited message count; auto-scales
  • Message retention: 15 min to 14 days (configurable)
  • Consumers poll (or Lambda Event Source Mapping triggers Lambda)

Operational Burden:

  • Very Low: Fully managed; configure retention, visibility timeout, DLQ

Well-Architected Mapping:

  • Reliability: Durable queue; DLQ for failed messages; at-least-once delivery (idempotent processing required)
  • Cost: Extremely cheap; pay per 1M requests
  • Performance: Polling latency 0-20 sec (depends on ReceiveMessageWaitTimeSeconds)

Amazon SNS (Simple Notification Service)

When to Use:

  • Fanout (one message → many consumers)
  • Pub/Sub pattern
  • Notifications (email, SMS, push)
  • Integration with SQS (SNS → SQS for durability)

When NOT to Use:

  • Ordered delivery (FIFO not as robust as SQS FIFO)
  • Complex routing (EventBridge)
  • Message history/replay needed (Kinesis)

Tradeoffs:

  • Pros: Simple pub/sub, instant delivery, fanout, many targets
  • Cons: No message history; no durability guarantees (best-effort delivery); no replay

Cost Model:

  • $0.50 per 1M publishes
  • Notifications: $0.02 per SMS, variable for email/HTTP
  • Example: 1M publishes → $0.50/month; negligible at small scale

Scaling Behavior:

  • Unlimited subscribers; auto-scales
  • Delivery attempts: 3 (exponential backoff)

Operational Burden:

  • Very Low: Fully managed; configure subscriptions, filters

Well-Architected Mapping:

  • Reliability: Couple with SQS for durability (SNS → SQS → Consumer)
  • Cost: Very cheap
  • Performance: Instant delivery

Amazon EventBridge

When to Use:

  • Event-driven architecture (100+ event sources)
  • Rule-based routing (complex filtering)
  • Multi-target fanout (90+ targets)
  • Integration with AWS services and SaaS apps
  • Event replay and archiving

When NOT to use:

  • Simple queue (SQS)
  • Just fanout (SNS)
  • High-throughput streaming (Kinesis)

Tradeoffs:

  • Pros: Flexible routing, extensive targets, schema registry, event replay
  • Cons: More complex than SNS/SQS; slightly higher latency; limited throughput vs. Kinesis

Cost Model:

  • $0.35 per 1M events
  • Archive retention: $0.023 per GB/month
  • Example: 1M events/month = $0.35 (negligible)

Scaling Behavior:

  • Millions of events per second per account
  • Auto-scales

Operational Burden:

  • Low: Define rules; manage targets

Well-Architected Mapping:

  • Reliability: Event replay from archive; DLQ for failed targets
  • Cost: Extremely cheap
  • Performance: ~50-100ms latency; Kinesis faster for streaming

Amazon Kinesis Data Streams

When to Use:

  • Streaming (continuous data flow)
  • Ordered by partition (shard)
  • Real-time processing (seconds latency)
  • 24-hour message history (replay)
  • High throughput (100k+ events/sec)

When NOT to Use:

  • Simple queuing (SQS/SNS)
  • Low-volume events (EventBridge cheaper)
  • Long message history (Kafka/MSK)

Tradeoffs:

  • Pros: Real-time streaming, partitioned for ordering/parallelization, replay capability
  • Cons: Requires shard management or on-demand pricing; higher cost than SQS

Cost Model:

  • Provisioned: $0.36 per shard-hour; $0.02 per 1M PutRecord requests
  • On-demand: $0.047 per GB ingested; $0.315 per 1M GetRecords
  • Example: 1M events/hour (1 KB each) on-demand = 1GB/hour × $0.047 = $0.047/hour ≈ $35/month; provisioned (1 shard) = $262/month

Scaling Behavior:

  • Provisioned: Auto-scaling by shard count
  • On-demand: Scales instantly
  • Throughput: 1 MB/sec per shard (provisioned); billed per GB on-demand

Operational Burden:

  • Low-moderate: Shard management, consumer lag monitoring, partition key design

Well-Architected Mapping:

  • Cost: Provisioned for baseline; on-demand for variable
  • Performance: Real-time streaming, 24-hour replay, partition ordering
  • Reliability: Replicated across AZs; consumer checkpointing for fault tolerance

Amazon MSK (Managed Streaming for Apache Kafka)

When to Use:

  • Kafka ecosystem expertise available
  • Cross-cloud/on-prem Kafka integration needed
  • Complex streaming logic (Kafka Streams, Flink)
  • Larger throughput than Kinesis economically viable

When NOT to Use:

  • AWS-only workloads (Kinesis simpler)
  • Low-volume (EventBridge, SQS cheaper)
  • Need for hands-off (Kinesis more managed)

Tradeoffs:

  • Pros: Kafka ecosystem, portability, higher throughput at scale
  • Cons: More operational burden; cluster management; higher baseline cost

Cost Model:

  • Broker cost: $0.15 per hour per broker (3 brokers = $1080/month baseline)
  • Storage: $0.10 per GB/month
  • Data transfer: $0.02 per GB (out-region)
  • Example: 3 brokers, 100 GB storage = $1080 + $10 = $1090/month

Scaling Behavior:

  • Broker count and storage size scales
  • Auto-scaling available via IAM roles

Operational Burden:

  • High: Broker configuration, topic management, consumer group coordination, monitoring

Well-Architected Mapping:

  • Cost: Fixed baseline; good for high-volume/long-term commitments
  • Performance: High throughput; complex streaming logic
  • Reliability: Multi-AZ; broker redundancy; replication factor

AWS Step Functions

When to Use:

  • Multi-step workflows with conditional logic
  • Long-running processes (minutes to days)
  • Error handling and retries needed
  • Visual workflow definition
  • Orchestrating async services

When NOT to Use:

  • Simple task execution (Lambda sufficient)
  • Strict real-time (latency implications)

Tradeoffs:

  • Pros: Visual workflow, error handling, retry policies, state persistence, waiting
  • Cons: State transitions cost money; additional latency per step; limited native support for some use cases

Cost Model:

  • Standard: $0.000025 per state transition
  • Express: $0.000001667 per invocation + $0.000000208 per GB-second
  • Example: 1M workflows, 10 steps = 10M transitions = $250/month

Scaling Behavior:

  • Millions of concurrent executions
  • Automatic

Operational Burden:

  • Low: Define state machine in JSON; Step Functions handles orchestration

Well-Architected Mapping:

  • Reliability: Automatic retries, catch/throw errors, compensation logic
  • Cost: Extremely cheap for state transitions; Express mode for high-volume
  • Performance: Step execution ~0.5-2 sec latency; suited for async workflows, not real-time

3.5 Networking Services

Amazon VPC (Virtual Private Cloud)

When to Use:

  • Almost every workload (default)
  • Network isolation required
  • Custom IP addressing
  • Network ACLs, security groups

When NOT to Use:

  • Unheard of; VPC is foundational

Architecture Pattern:

  • Public subnet: IGW for internet; ALB, NAT Gateway
  • Private subnet: No direct internet; NAT Gateway for outbound
  • Database subnet: Isolated; only accessible from app tier

Scaling Behavior:

  • Elastic; auto-scales
  • VPC can have max 5 CIDR blocks

Operational Burden:

  • Moderate: Design subnets, security groups, route tables, NAT Gateway (hourly cost)

Well-Architected Mapping:

  • Security: Network isolation; security groups per tier; NACLs for stateless filtering
  • Reliability: Multi-AZ subnets; redundant NAT Gateways
  • Cost: Free VPC; pay for NAT Gateway ($32/month per AZ), data transfer

Application Load Balancer (ALB) / Network Load Balancer (NLB)

ALB (Layer 7):

  • When to use: HTTP/HTTPS, microservices (path/hostname routing), WebSocket
  • Cost: $22/month base + $0.006 per LCU (load balancer capacity unit)
  • Latency: 100-200ms overhead

NLB (Layer 4):

  • When to use: Extreme throughput (millions/sec), low latency (< 100ms), non-HTTP protocols (TCP, UDP)
  • Cost: $32/month base + $0.006 per LCU (higher pricing)
  • Latency: 10-50ms overhead

Well-Architected Mapping:

  • Reliability: Health checks; automatic failover; cross-AZ
  • Performance: ALB for flexibility; NLB for extreme throughput
  • Cost: $22-32/month baseline; shared across multiple services if possible

Amazon API Gateway

When to Use:

  • REST or GraphQL APIs
  • Rate limiting, authentication (API keys, OAuth)
  • Request/response transformation
  • Caching

When NOT to Use:

  • Internal service-to-service (VPC Endpoints)
  • Extreme throughput (NLB better)

Cost Model:

  • $3.50 per 1M requests (regional)
  • $0.60 per 1M for WebSocket API
  • Data transfer: $0.09 per GB out

Scaling Behavior:

  • Auto-scales; no provisioning

Operational Burden:

  • Low: Define API, configure integrations, set up throttling

Well-Architected Mapping:

  • Security: API keys, OAuth, request validation, WAF integration
  • Reliability: Throttling prevents cascading failures
  • Performance: Caching reduces backend load
  • Cost: Extremely cheap; $3.50 per 1M requests (~$3.50/month for 1M requests)

Amazon CloudFront

When to Use:

  • Distribute static content globally
  • Cache API responses (cache headers)
  • DDoS protection (Shield Standard included)
  • Origin Shield for cache efficiency

When NOT to Use:

  • Single-region, low-traffic workloads

Cost Model:

  • $0.085 per GB (varies by region, USA cheapest)
  • Request: $0.01 per 10k

Scaling Behavior:

  • Auto-scales globally; no provisioning

Operational Burden:

  • Very Low: Configure origin, cache behavior, invalidation

Well-Architected Mapping:

  • Performance: Global latency reduction (users served from nearest edge); significant for global audiences
  • Cost: Minimal request cost; high if transferring large GB (offset by origin load reduction)
  • Security: DDoS protection, WAF, Origin Shield for burst traffic
  • Sustainability: Reduced origin load; distributed edge computing

AWS Transit Gateway

When to Use:

  • Hub-and-spoke connectivity (multiple VPCs, on-prem)
  • Simplified multi-VPC architecture
  • On-premises integration via Direct Connect

When NOT to use:

  • Single VPC (unnecessary)

Cost Model:

  • $0.05 per hour (~$36/month)
  • $0.02 per GB processed

Well-Architected Mapping:

  • Reliability: Centralized connectivity; simplified failover
  • Cost: Justified for 3+ VPCs or multi-region

AWS Direct Connect

When to Use:

  • Dedicated network from on-premises to AWS
  • Consistent bandwidth
  • Large data transfers (cheaper than internet data transfer)

When NOT to Use:

  • Small, intermittent connections (VPN sufficient)

Cost Model:

  • $0.30 per hour (~$218/month)
  • $0.02 per GB output

Well-Architected Mapping:

  • Reliability: Dedicated connection; predictable performance
  • Cost: Justified for > 10 TB/month transfers

3.6 Security Services

AWS IAM (Identity & Access Management)

When to use: Always

  • Every workload needs IAM roles and policies
  • Fine-grained permissions (least privilege)
  • Cross-account access via roles

Cost Model: Free

Well-Architected Mapping:

  • Security: Least privilege; assume role model; remove console access

AWS KMS (Key Management Service)

When to use:

  • Encrypt sensitive data (PII, financial, secrets)
  • At-rest encryption (S3, RDS, EBS)

Cost Model:

  • $1.00 per month per key
  • $0.03 per 10k requests

Well-Architected Mapping:

  • Security: Encryption; audit trail (CloudTrail); key rotation
  • Compliance: Required for regulated data (PII, HIPAA, PCI)

AWS Secrets Manager

When to use:

  • Store database credentials, API keys, tokens
  • Automatic rotation

Cost Model:

  • $0.40 per secret per month
  • $0.05 per 10k API calls

Well-Architected Mapping:

  • Security: No hardcoded credentials; rotation; audit

Amazon GuardDuty

When to use:

  • Threat detection
  • Continuous monitoring for malicious activity
  • Integration with Security Hub

Cost Model:

  • $1.00 per 1M events analyzed per month

Well-Architected Mapping:

  • Security: Automated threat detection; alerts; findings

AWS WAF (Web Application Firewall)

When to use:

  • Protect web applications from attacks (SQL injection, XSS, bot attacks)
  • Integration with CloudFront, ALB, API Gateway

Cost Model:

  • $5.00 per month per rule group
  • $0.60 per 1M requests

Well-Architected Mapping:

  • Security: Application-layer protection; rate limiting; IP blocking

3.7 Observability Services

Amazon CloudWatch

When to use: Every workload

  • Metrics (CPU, memory, custom)
  • Logs (application, system)
  • Alarms (trigger auto-scaling, SNS)
  • Dashboards (visualization)

Cost Model:

  • Logs: $0.50 per GB ingested; $0.03 per GB stored
  • Metrics: $0.30 per custom metric per month
  • Alarms: $0.10 per alarm per month

Well-Architected Mapping:

  • Operational Excellence: Observability; alerts; dashboards
  • Reliability: Alarms trigger auto-scaling; identify bottlenecks
  • Cost: Identify over-provisioned resources; optimize

AWS X-Ray

When to use:

  • Distributed tracing across microservices
  • Identify latency bottlenecks
  • Understand service dependencies

Cost Model:

  • $5.00 per 1M recorded traces
  • $0.50 per 1M retrieved traces

Well-Architected Mapping:

  • Operational Excellence: Trace requests end-to-end; identify latency
  • Reliability: Understand failure propagation

Amazon Managed Prometheus / Grafana

When to use:

  • Kubernetes metrics (via Prometheus scraping)
  • Long-term metric storage
  • Custom visualization (Grafana)

Cost Model:

  • Prometheus: $0.90 per 1M ingested samples
  • Grafana: $9.00 per workspace per month

Well-Architected Mapping:

  • Operational Excellence: Kubernetes-native monitoring
  • Cost: For EKS workloads


Section 4: Architecture Pattern Library (Domain-Independent)

Every system, regardless of domain, is composed of these reusable patterns.

4.1 CRUD Backend

Definition: Create, Read, Update, Delete operations on a data model.

Pattern:

Client (mobile, web, API) → API Gateway → Lambda → RDS/DynamoDB

When to use: Always (fundamental pattern)

Well-Architected:

  • Security: IAM roles; request validation; encryption
  • Reliability: Error handling; retry logic; transaction handling
  • Performance: Caching (ElastiCache); query optimization; connection pooling

Cost Optimization:

  • Use DynamoDB for variable load
  • RDS with read replicas for read-heavy
  • Cache frequently accessed data

4.2 Event-Driven Orchestration

Definition: Components communicate via asynchronous events; central coordinator (Step Functions) directs workflow.

Pattern:

Trigger → Step Functions → [Lambda1, ECS2, Batch3, Lambda4] → EventBridge → Notifications
                                      ↓
                           DynamoDB (state storage)

When to use: Multi-step business workflows (order processing, loan approvals, ML pipelines)

Well-Architected:

  • Reliability: Automatic retries; compensation logic; DLQ for failures
  • Operational Excellence: CloudWatch logs per step; alerts on failure
  • Cost: Pay per state transition; extremely cheap

4.3 Saga Pattern (Distributed Transactions)

Definition: Long-running transaction across multiple services; compensating transactions for rollback.

Pattern:

Service A (order) → EventBridge → Service B (payment) → EventBridge → Service C (fulfillment)
                                          ↓
                                    If payment fails:
                                    Compensate Service A (cancel order)

When to use: Multi-service transactions without distributed locks

Well-Architected:

  • Reliability: Compensation logic; idempotent operations
  • Operational Excellence: Audit trail (events); replayable

4.4 Fan-Out / Fan-In

Definition: One event triggers multiple parallel processors; results aggregated.

Pattern (fan-out):

SNS/EventBridge → Lambda1 (send email)
                → Lambda2 (update metrics)
                → Lambda3 (trigger analytics)

Pattern (fan-in):

Lambda1 \
Lambda2  → Step Functions → Lambda4 (aggregate results)
Lambda3 /

When to use: Parallel processing; decoupled consumers

Well-Architected:

  • Reliability: Fan-out is parallel; fanin aggregates results
  • Performance: Parallel execution reduces latency

4.5 Data Lakehouse (Medallion Architecture)

Definition: Multi-tier data organization (bronze → silver → gold)

Pattern:

Source (API, DB) → S3 Bronze (raw) → Glue (transform) → S3 Silver (cleaned)
                                                             ↓
                                                      Lake Formation (govern)
                                                             ↓
                                                    S3 Gold (curated)
                                                             ↓
                                            Redshift/Athena/QuickSight

When to use: Centralized data platform with multi-consumer access

Well-Architected:

  • Operational Excellence: Glue Data Catalog; Lake Formation permissions
  • Security: Encryption; access control; audit logging
  • Cost: S3 + Athena for ad-hoc; Redshift for BI; Glacier for archive

4.6 CQRS (Command Query Responsibility Segregation)

Definition: Separate read and write models; write commands to event store; query from read-optimized views.

Pattern:

Write Path:             Read Path:
Command → Aggregate → Event Store → Denormalize → Read View
                                                        ↓
                                                    Query (fast, optimized)

When to use: Complex domain logic; audit requirements; multiple read views

Well-Architected:

  • Reliability: Event sourcing provides audit trail; replay for recovery
  • Performance: Read model optimized for queries
  • Cost: DynamoDB for event store + read model; cheap at scale

4.7 Command Pipeline (Batch Job Chain)

Definition: Sequential batch jobs; each transforms data and passes to next

Pattern:

Schedule → Glue1 (extract) → S3 (temp) → Glue2 (transform) → S3 (output) → Redshift (load)
                                                   ↓
                                            Step Functions (orchestration)

When to use: ETL pipelines; scheduled data processing

Well-Architected:

  • Operational Excellence: Step Functions for orchestration; error handling
  • Cost: Glue on-demand for variable workloads; schedule off-peak
  • Reliability: Checkpointing for restartable jobs; DLQ for failed steps

4.8 Agent-Tool Execution (Agentic AI)

Definition: AI agent reasons over user query; selects and executes tools; iterates until complete.

Pattern:

User Query → Bedrock Agent (reasoning) → Tool Selection
                                             ├── Lambda1 (query DB)
                                             ├── Lambda2 (call API)
                                             ├── DynamoDB (read state)
                                             └── OpenSearch (semantic search)
                                                     ↓
                                            Feedback loop (iterate if needed)
                                                     ↓
                                            Final response to user

When to use: Autonomous automation; conversational AI; decision support

Well-Architected:

  • Security: Tool access control; query validation; PII redaction
  • Reliability: Fallback tools; error recovery
  • Observability: Trace tool calls; audit agent decisions

4.9 Streaming Ingestion (Real-Time Data Pipeline)

Definition: Continuous data ingestion; real-time processing; multiple sinks.

Pattern:

Sensors/APIs → Kinesis → Analytics (windowed agg) → DynamoDB (current state)
                                 ↓
                            Lambda (enrichment)
                                 ↓
                    Multiple sinks:
                    ├── DynamoDB (dashboard)
                    ├── S3 (cold storage)
                    ├── OpenSearch (search/alerting)
                    └── SNS (alerts)

When to use: Real-time analytics; alerting; monitoring

Well-Architected:

  • Performance: Partition by shard; parallel processing
  • Cost: On-demand Kinesis for variable; provisioned for baseline
  • Reliability: Consumer checkpointing; DLQ for failed events

4.10 Multi-Region Active-Active

Definition: Same workload deployed in multiple regions; users routed to nearest; data replicated.

Pattern:

Global User → Route 53 (geolocation routing) → Region 1 (ALB → ECS → RDS Aurora Global)
                                              → Region 2 (ALB → ECS → RDS Aurora Global)
                                              → Region 3 (ALB → ECS → RDS Aurora Global)

When to use: Global applications; disaster recovery; low-latency for globally distributed users

Well-Architected:

  • Reliability: Automatic failover; regional disaster recovery
  • Performance: Users served from nearest region
  • Cost: 3x infrastructure cost; justified for critical, global workloads
  • Sustainability: Distributed load


Section 5: AWS Well-Architected Integration (Mandatory)

Every architectural decision must align with the six pillars of the AWS Well-Architected Framework. This section maps decisions to pillars and provides measurable indicators.

Pillar Alignment Template

For every major decision (service choice, architecture pattern, data design), answer:

  1. Operational Excellence: Can teams operate this? Is it observable? Are procedures clear?
  2. Security: Are data and access protected? Is compliance addressed?
  3. Reliability: Can it recover from failure? What is RTO/RPO?
  4. Performance Efficiency: Does it meet latency/throughput targets? Is it optimized?
  5. Cost Optimization: Is it cost-effective? Are there cheaper alternatives?
  6. Sustainability: Does it minimize energy/carbon? Is it resource-efficient?

5.1 Operational Excellence Pillar

Design Principles:

  • Organize teams around business outcomes
  • Implement observability for actionable insights
  • Safely automate where possible
  • Make frequent, small, reversible changes
  • Refine operations procedures frequently
  • Anticipate failure
  • Learn from operational events
  • Use managed services

Key Questions:

  • Are teams organized to own their systems end-to-end?
  • Is every component observable (metrics, logs, traces)?
  • Are deployments automated and safe (canary, blue-green)?
  • Can operators quickly diagnose and respond to issues?
  • Are runbooks and procedures documented and tested?

Best Practices by Service:

ServiceObservabilityAutomationDisaster Response
LambdaCloudWatch Logs, X-RaySAM, CDK, CodePipelineDLQ, retries, reserved concurrency
ECSCloudWatch metrics, Container InsightsCodeDeploy, ECS task placementService health checks, auto-restart
RDSEnhanced monitoring, Performance Insights, CloudWatchAWS Config, automated backupsMulti-AZ failover, read replicas
DynamoDBCloudWatch metrics, X-Ray tracing, TTL monitoringPoint-in-time recovery, backup serviceGlobal tables, on-demand scaling
S3Access logging, CloudTrail, CloudWatch metricsLifecycle policies, inventory, replicationVersioning, cross-region replication, Glacier
KinesisIterator age, consumer lag, CloudWatch metricsAuto-scaling shards, Lambda event mappingShard backup via S3, 24-hour retention

Measurable Indicators:

  • MTTR (Mean Time To Recovery): Target < 15 min for P1 incidents
  • Change failure rate: < 10% of deployments cause incidents
  • Deployment frequency: Daily or more
  • On-call fatigue: < 1 page/week per engineer

Anti-Patterns:

  • ❌ Manual deployments; no version control
  • ❌ No monitoring; discovering issues from customers
  • ❌ Monolithic applications; all-or-nothing deployments
  • ❌ Undocumented procedures; tribal knowledge

5.2 Security Pillar

Design Principles:

  • Implement strong identity foundation (least privilege)
  • Maintain traceability (audit all actions)
  • Protect data in transit and at-rest
  • Detect and investigate security events
  • Protect infrastructure
  • Prepare for security events

Key Questions:

  • Are all principals (users, roles, services) authenticated?
  • Are permissions least-privilege (does role have exactly needed permissions)?
  • Is all data encrypted (transit TLS, at-rest KMS)?
  • Is access logged and audited?
  • Are data classification and sensitivity levels defined?
  • Are compliance requirements met (HIPAA, PCI, GDPR)?

Best Practices by Tier:

LayerBest PracticeImplementation
Identity & AccessLeast privilege, role-based accessIAM policies, assume roles across accounts, MFA for console
InfrastructureNetwork isolation, security groupsVPC, security groups (stateful), NACLs (stateless), WAF for web
Data ProtectionEncrypt all dataKMS at-rest (S3, RDS, EBS), TLS in-transit, Secrets Manager for credentials
DetectionMonitor and alert on anomaliesCloudTrail (API audit), GuardDuty (threats), Config (compliance), Security Hub (aggregation)
Incident ResponseAutomated and manual playbooksCloudWatch alarms → SNS → Lambda/email; IAM roles for responders

Measurable Indicators:

  • 100% of data encrypted at-rest and in-transit
  • All API calls logged to CloudTrail
  • Zero exposed credentials (Secrets Manager, no hardcoding)
  • Compliance audit pass rate: 100% on critical controls
  • Incident detection time: < 5 min for automated, < 1 hour for manual

Anti-Patterns:

  • ❌ Hardcoded credentials in code/config
  • ❌ Public S3 buckets (unless intentional)
  • ❌ Over-permissive IAM roles (e.g., AdministratorAccess)
  • ❌ No encryption; no audit logging
  • ❌ Manual credential rotation

5.3 Reliability Pillar

Design Principles:

  • Automatically recover from failure
  • Test failure scenarios
  • Stop guessing capacity
  • Manage change via automation

Key Questions:

  • Can the system recover automatically from failures?
  • Have failure scenarios been tested (chaos engineering)?
  • Is capacity provisioned to handle peaks without manual intervention?
  • Are changes made safely (automated rollback)?

Best Practices by Scenario:

ScenarioStrategyImplementation
Service FailureAuto-restart, circuit breakerECS health checks, ALB target deregistration, Step Functions retries
Database FailureMulti-AZ failover, read replicasAurora Multi-AZ, RDS Multi-AZ, DynamoDB auto-replication
Data LossBackups, point-in-time recoveryRDS automated backups, S3 versioning, DynamoDB PITR, DMS (CDC)
Region FailureDisaster recovery, multi-regionCross-region replication (S3, RDS, DynamoDB global tables), Route 53 failover
Capacity OverloadAuto-scaling, circuit breakersASG, Lambda concurrency, SQS queue buffering, Step Functions error handling

RTO & RPO by Workload:

Workload CriticalityRTORPOStrategy
Critical (financial, healthcare)< 1 hourZero data lossMulti-AZ + multi-region, synchronous replication, event sourcing
High (revenue-impacting)1-4 hours< 1 hourMulti-AZ, async replication, hourly backups
Medium (operational)4-24 hours< 1 daySingle AZ, daily snapshots
Low (dev/test)> 1 dayNot applicableBackups acceptable; manual recovery OK

Measurable Indicators:

  • Availability: 99.9% for critical, 99% for high
  • MTTR: < 5 min for automated recovery
  • Unplanned downtime: < 43 min/month for 99.9%
  • Recovery test success: 100% of DR scenarios annually
  • Change rollback success: < 5 min

Anti-Patterns:

  • ❌ Single points of failure (single AZ, single instance)
  • ❌ No backups or unverified recovery
  • ❌ Manual scaling (capacity runs out during spikes)
  • ❌ No circuit breakers (cascading failures)
  • ❌ Changes without automated rollback

5.4 Performance Efficiency Pillar

Design Principles:

  • Democratize advanced technologies
  • Go global in minutes (CloudFront)
  • Use serverless for variable workloads
  • Experiment often
  • Mechanical sympathy (align tech to workload)

Key Questions:

  • Does the system meet latency targets?
  • Is throughput optimized for the workload?
  • Are expensive operations (queries, compute) optimized?
  • Is caching used where beneficial?

Optimization by Service:

ServiceKey OptimizationsMeasurements
API GatewayCaching, throttling, request validationp99 latency < 200ms
LambdaProvisioned concurrency, memory tuning, connection reuseCold start < 100ms, warm < 50ms
RDS/AuroraRead replicas, indexes, query optimization, connection poolingQuery p95 < 100ms, throughput per core
DynamoDBPartition key design, GSI, DAX cache, batch operationsp99 < 10ms, no hot partitions
S3Multipart upload, batch operations, Transfer Acceleration, Intelligent-TieringUpload throughput, object retrieval latency
CloudFrontOrigin Shield, cache headers, compression, HTTP/2p99 latency < 100ms globally

Performance by Workload:

Workload TypeLatency TargetOptimization
Synchronous (API)p99 < 200msCaching, query optimization, parallel requests
Real-timep99 < 100msLocal cache (ElastiCache), connection reuse, batch operations
BatchThroughput optimizedParallel processing, partitioning, appropriate instance size
StreamingSub-second processingPartition key design, Kinesis shards, parallel Lambda invocation

Measurable Indicators:

  • p99 latency: < target threshold
  • Throughput: Scales linearly with resources (no bottlenecks)
  • Resource utilization: 60-80% for optimal cost/performance

Anti-Patterns:

  • ❌ N+1 queries (repeated DB calls in loops)
  • ❌ No caching (repeated expensive operations)
  • ❌ Synchronous processing (blocking calls)
  • ❌ Single-threaded/single-shard processing (can't parallelize)
  • ❌ Unoptimized queries (full table scans)

5.5 Cost Optimization Pillar

Design Principles:

  • Implement Cloud Financial Management
  • Measure and attribute expenditure
  • Stop paying for under-utilized resources
  • Analyze and optimize over time
  • Use managed services to reduce operational cost

Key Questions:

  • Is the current spend justified by business value?
  • Are there cheaper service alternatives?
  • Are discounts being used (Reserved Instances, Savings Plans)?
  • Is capacity right-sized?
  • Are unused resources being cleaned up?

Cost Optimization by Service:

ServiceCost DriverOptimization Strategy
LambdaRequests + GB-secondsOn-demand for variable; consolidate functions if baseline traffic
ECSEC2 instance hoursFargate for variable; EC2 + Savings Plans for stable
RDSInstance-hours + storageRight-size instances; use Savings Plans (1-yr: 31% off, 3-yr: 43% off); read replicas for read-heavy
DynamoDBProvisioned RCU/WCU or on-demandOn-demand for unpredictable; provisioned for baseline; consider reserve capacity
S3Storage + requests + transferIntelligent-Tiering for unknown access; Glacier for archive; delete unused data
RedshiftInstance-hours + storageReserved instances; use Spectrum for external data; pause during low traffic
KinesisShard-hours or GB ingestedOn-demand for variable; provisioned for baseline; batch and compress

Pricing Model Selection:

Workload PatternRecommendedCost Savings
Predictable, high utilizationReserved Instances (3-year)65% off on-demand
Predictable, multi-service baselineSavings Plans (3-year)60-65% off on-demand; flexible across services
Variable, burstyOn-demand or serverlessNo commitment; higher per-unit cost
Batch, interruptibleSpot instances70% off on-demand

Cost Monitoring & Attribution:

  • Cost tags per application, team, cost center
  • Budget alerts in Cost Explorer
  • Savings Plans recommendations (automated)
  • Trusted Advisor for cost optimization opportunities

Measurable Indicators:

  • Cost per transaction: Trending down quarter-over-quarter
  • Utilization: EC2 CPU 60-80%; under-utilized instances identified and removed
  • Discount penetration: > 70% of compute cost on Reserved/Savings Plans
  • Monthly savings from optimization: Documented and tracked

Anti-Patterns:

  • ❌ Large reserved instance commitment for unproven workloads
  • ❌ Oversized instances (paying for unused capacity)
  • ❌ Running dev/test environments 24/7
  • ❌ Keeping data in expensive storage (not using Intelligent-Tiering)
  • ❌ No cost allocation or visibility

5.6 Sustainability Pillar

Design Principles:

  • Understand your sustainability impact
  • Establish goals and measure impact
  • Maximize utilization (reduce waste)
  • Adopt efficient hardware and architecture
  • Use managed services (AWS optimizes for efficiency)
  • Reduce downstream impact (minimize data transfer)

Key Sustainability Decisions:

DecisionHigh-Impact OptionImpact
Compute selectionServerless + managed servicesNo idle infrastructure; AWS amortizes overhead
Regional placementUse AWS Regions with renewable energyCheck AWS Sustainability Report; PPA-backed regions
Data storageS3 Intelligent-Tiering → GlacierReduces storage footprint; archive old data
Instance typesGraviton, Trainium (AWS-built chips)Higher energy efficiency than x86
ArchitectureBatch processing, off-peak schedulingConsolidate; avoid running 24/7 if not needed
Data transferMinimize inter-region/public data transferCloudFront for global distribution; VPC Endpoints for internal

Measurable Indicators:

  • Kilograms CO2 per transaction: Trending down
  • Instances with < 20% utilization: Identified and consolidated
  • Data stored in cold tiers (Glacier): Percentage of total
  • Energy efficiency score (AWS Carbon Intelligence): Tracking vs. industry baseline

Anti-Patterns:

  • ❌ Running compute 24/7 (especially development/test)
  • ❌ No data lifecycle policies (keeping hot storage forever)
  • ❌ Using x86 instances when Graviton available
  • ❌ Not considering region sustainability impact

Section 6: Cost & Scale Modeling Framework

6.1 Fixed vs. Variable Cost Analysis

Fixed Cost (per month, regardless of usage):

  • NAT Gateway: $32/month per AZ
  • ALB: $22/month base
  • Redshift cluster: $1.26/hour (minimum)
  • RDS instance: $50-$1000+/month (instance-based)

Variable Cost (per unit of usage):

  • Lambda: $0.0000002 per request + $0.0000166667 per GB-second
  • SQS: $0.40 per 1M requests
  • S3: $0.023 per GB/month + $0.0004 per 1k requests
  • Data transfer out: $0.09 per GB

6.2 Cost at Different Scales

Scenario: Build an API backend

Low Scale (100 req/sec, < 1 TB data):

  • Lambda: $4/month (requests) + $100/month (compute) = $104/month ✓ Cheapest
  • ECS Fargate: $350/month (1 vCPU, 2GB always-on) + $0 RDS = $350
  • RDS t3.micro: $17/month

Medium Scale (1000 req/sec, 100 GB data):

  • Lambda + DynamoDB: $40 (requests) + $1000 (compute) + $50 (DB) = $1090
  • ECS + RDS: $1400 (compute) + $200 (RDS t3.small) = $1600
  • Winner: Lambda still cheaper (less infrastructure overhead)

High Scale (10,000 req/sec, 1 TB data):

  • Lambda + DynamoDB: $400 + $10,000 + $500 = $10,900
  • ECS + RDS Aurora: $4000 (compute) + $1000 (Aurora) + $1000 (storage) = $6000 ✓ Cheaper
  • Winner: ECS + provisioned services (fixed cost amortized over high traffic)

6.3 Break-Even Analysis

Lambda vs. ECS Example:

  • Lambda: $0.0000002 per request + $0.0000166667 per GB-second

  • For 128 MB Lambda, 100ms execution:

    • Cost per request: $0.0000002 + (128/1024 × 0.0000166667 × 0.1) = $0.00000176 per request
    • 1M requests = $1.76
  • ECS (1 vCPU, 2 GB): $0.04695 + $0.00519 = $0.05214/hour = $1250/month

  • At 100ms execution time: 1 vCPU handles ~10 req/sec = 864k req/day = 26M req/month

  • Cost per request: $1250 / 26M = $0.000048 per request

Breakeven: Lambda cheaper below ~5k req/sec; ECS cheaper above (depending on execution time)

6.4 Data Gravity Analysis

Question: Should data stay in one region or be replicated?

Factors:

  • Data transfer cost out: $0.09 per GB (public internet)
  • S3 cross-region replication: Included (paid per GB replicated)
  • RDS Multi-AZ: Free (internal replication)
  • RDS cross-region: Paid data transfer ($0.02/GB)

Decision Matrix:

  • Single region, high data volume (100 TB+) and distributed users: CloudFront for distribution (caching layer)
  • Multi-region for disaster recovery: Use cross-region replication; amortize cost over disaster scenarios
  • Multi-region active-active: High cost; only for mission-critical global workloads

6.5 Cost Optimization Levers

Ordered by impact:

  1. Right-sizing (biggest impact): Reduce instance size; use Intelligent-Tiering

    • Potential savings: 30-50% if currently over-provisioned
  2. Commitment discounts (Savings Plans, Reserved Instances): 40-65% off

    • Potential savings: 40-65% of compute cost
    • Requirement: 70%+ predictable baseline
  3. Reserved capacity (DynamoDB, Redshift): 50%+ off

    • Potential savings: 50% of baseline database cost
    • Requirement: Stable, known baseline
  4. Spot instances (for interruptible workloads): 70% off

    • Potential savings: 70% of batch compute cost
    • Trade-off: Interruption risk acceptable
  5. Architecture changes (Lambda vs ECS, S3 Intelligent-Tiering): 20-50% off

    • Potential savings: Varies by workload; biggest impact for right-sizing to service tier
  6. Data transfer optimization: 20-30% off data costs

    • Use CloudFront for global distribution; keep data local; compress

Section 7: Evolution & Change Strategy

7.1 MVP → Scale

MVP Phase (0-6 months, 1-100 users):

  • Use serverless (Lambda, Fargate, DynamoDB on-demand)
  • Minimize operational burden
  • Fast iteration; don't optimize prematurely
  • Architecture: API Gateway → Lambda → DynamoDB
  • Cost: ~$50-500/month

Scale Phase (6+ months, 1k-1M users):

  • Identify bottlenecks; migrate to provisioned services as needed
  • Add caching (CloudFront, ElastiCache)
  • Optimize costs: Reserved Instances, Savings Plans
  • Architecture: API Gateway → ECS/Lambda → RDS Aurora with read replicas → CloudFront
  • Cost: $1k-10k/month

Mature Phase (2+ years, 1M+ users):

  • Multi-region active-active for resilience
  • Advanced caching and CDN
  • Dedicated infrastructure; right-sized instances
  • Cost: $10k+/month

7.2 Monolith → Microservices

Phase 1: Decompose (0-3 months)

  • Identify service boundaries (domain-driven design)
  • Build event-driven orchestration (EventBridge, Step Functions, SNS/SQS)
  • Keep monolith and microservices running in parallel

Phase 2: Strangle Fig (3-12 months)

  • Gradually route traffic to microservices
  • Retire monolith modules as traffic shifts
  • Maintain backward compatibility

Phase 3: Mature (12+ months)

  • All traffic on microservices
  • Optimize service communication (caching, circuit breakers)
  • Consider service mesh (Istio) if > 20 services

7.3 Single-Region → Multi-Region

Phase 1: Failover (0-3 months)

  • Set up cross-region backup/restore
  • Automate backup and restore testing
  • RTO: hours; manual failover

Phase 2: Active-Passive (3-6 months)

  • Set up read replicas in secondary region
  • Automate failover via Route 53 health checks
  • RTO: < 5 min; automatic

Phase 3: Active-Active (6-12 months)

  • Data replicated bidirectionally (Aurora Global, DynamoDB Global Tables)
  • Load balanced across regions (Route 53 geolocation)
  • RTO: < 1 min; automatic; full active workload in both regions

7.4 Manual → Fully Automated

Infrastructure as Code (Week 0-2):

  • CloudFormation, Terraform, or CDK
  • Version-control infrastructure
  • Enable reproducible deployments

CI/CD Pipeline (Week 2-4):

  • GitHub/CodeCommit → CodeBuild → CodeDeploy
  • Automated tests before deployment
  • Blue-green or canary deployments

Operations Automation (Week 4-8):

  • CloudWatch alarms → SNS → Lambda (auto-remediation)
  • Patch automation (SSM Patch Manager)
  • Infrastructure health monitoring (Config, GuardDuty)

Observability (Week 8-12):

  • CloudWatch Logs Insights for querying
  • X-Ray for distributed tracing
  • Custom dashboards; alerts on SLO breaches

7.5 Static Systems → AI-Driven

Phase 1: Monitoring & Insights (0-3 months)

  • Collect metrics, logs, traces
  • Identify anomalies (CloudWatch Anomaly Detection)
  • Manual decision-making based on data

Phase 2: Basic Automation (3-6 months)

  • Auto-scaling rules based on metrics
  • CloudWatch alarms trigger Lambda for remediation
  • Chatbots for simple queries (Lex)

Phase 3: Intelligent Decision-Making (6-12 months)

  • ML models predict optimal resource allocation
  • Optimization recommendations (Cost Optimization Hub)
  • Agentic AI for complex decision workflows

Phase 4: Autonomous Operations (12+ months)

  • Agents autonomously execute remediation
  • Human-in-the-loop for high-impact decisions
  • Continuous learning from outcomes

Section 8: Decision Playbooks & Checklists

8.1 Universal Architecture Decision Checklist

Phase 1: Requirements Gathering (Week 1)

  • Define business intent and success metrics
  • Identify users, actors, and access patterns
  • Classify data (volume, velocity, variety, sensitivity)
  • Determine workload type (sync/async, batch/streaming)
  • Estimate traffic patterns and scale requirements
  • Define availability targets (RTO, RPO)
  • Document security and compliance needs
  • Set cost constraints
  • Assess team capability and operational tolerance
  • Plan for extensibility and evolution

Output: Requirement document (1-2 pages)


Phase 2: Service Selection (Week 2)

Compute:

  • Lambda, ECS, EKS, EC2, or Batch?
  • Decision factor: Scale, latency, operational burden
  • Cost analysis: Provisioned vs. on-demand

Storage:

  • S3 (objects), EBS (block), EFS (file), RDS/DynamoDB (DB)?
  • Durability and availability targets met?
  • Encryption, compliance requirements met?

Database:

  • RDS (relational), DynamoDB (key-value), Redshift (warehouse), OpenSearch (search)?
  • Access patterns validated against service?
  • Scaling behavior acceptable?

Integration:

  • Lambda functions, API Gateway, EventBridge, SQS/SNS, Kinesis, Step Functions?
  • Coupling/decoupling adequate?
  • Error handling strategy defined?

Output: Service decision matrix (1 page)


Phase 3: Architecture Design (Week 3)

  • Draw high-level architecture (compute → storage → database)
  • Identify data flows (sync vs. async)
  • Map to architecture patterns (CRUD, event-driven, streaming, etc.)
  • Define failure scenarios and recovery strategy
  • Calculate cost at baseline, peak, and 2x peak scale
  • Identify cost optimization opportunities
  • Review against Well-Architected pillars

Output: Architecture diagram + Well-Architected scorecard


Phase 4: Implementation & Deployment (Weeks 4-6)

  • Code infrastructure (CloudFormation, Terraform, CDK)
  • Implement logging, monitoring, alerting (CloudWatch, X-Ray)
  • Set up CI/CD pipeline (CodePipeline, CodeBuild, CodeDeploy)
  • Write runbooks for operational procedures
  • Test failure scenarios (chaos engineering)
  • Performance test at 2x expected peak load
  • Security audit and penetration testing
  • Compliance validation (HIPAA, PCI, etc. if applicable)

Output: Deployment checklist, runbooks, test results


Phase 5: Go-Live (Week 7)

  • Production readiness review (deployment, security, operations)
  • Gradual rollout (5% → 25% → 50% → 100% traffic)
  • Monitor golden signals (latency, error rate, throughput)
  • Alert thresholds defined and tested
  • Incident response team briefed and on-call
  • Backup/disaster recovery procedures verified

Output: Incident response playbook, monitoring dashboard


Phase 6: Optimization & Learning (Week 8+)

  • Review cost monthly; identify optimizations
  • Analyze performance metrics; optimize hot paths
  • Conduct retrospectives on incidents
  • Update architecture based on learnings
  • Refine automation and runbooks

Output: Quarterly optimization report


8.2 Service Selection Flowchart

Compute Decision Tree:

"How long does the job run?"
├─ < 15 min
│  └─ "Variable or bursty load?"
│     ├─ Yes → Lambda
│     └─ No → "Latency critical?"
│        ├─ Yes → ECS (Fargate, warm)
│        └─ No → EC2 (if always-on cheaper)
├─ 15 min - 1 hour
│  └─ Batch (for batch jobs) or ECS (for services)
└─ > 1 hour
   └─ "Distributed processing needed?"
      ├─ Yes → EMR (Spark/Hadoop)
      └─ No → EC2, ECS, or SageMaker

Database Decision Tree:

"What type of queries?"
├─ SQL, complex joins, transactions
│  └─ RDS (MySQL, PostgreSQL) or Aurora (high throughput)
├─ Key-value, documents, real-time
│  └─ DynamoDB (provisioned for baseline, on-demand for variable)
├─ Large-scale analytics, BI
│  └─ Redshift (petabyte-scale OLAP)
├─ Full-text search, time-series logs
│  └─ OpenSearch
└─ Graph, time-series, other
   └─ Neptune (graph), Timestream (time-series), Keyspaces (Cassandra)

8.3 Failure Scenario Modeling

Template: For each critical component, document:

  1. Component: e.g., RDS database, API Gateway, Lambda function
  2. Failure Mode: e.g., instance crash, network partition, application error
  3. Impact: e.g., "500 errors for 5 min, 1000 failed requests"
  4. Detection: e.g., "CloudWatch alarm: 5xx error rate > 1%"
  5. Recovery Time: e.g., "< 1 min (automatic failover)"
  6. Prevention: e.g., "Multi-AZ, read replicas, automated health checks"

Example Scenarios:

ComponentFailureImpactDetectionRecoveryPrevention
RDS (primary)Instance crashDB unavailableCloudWatch metricsAuto-failover to read replica (1 min)Multi-AZ
Lambda functionCode error500 responsesCloudWatch error metrics, X-RayAutomatic retry; DLQ for inspectionUnit tests, canary deployment
API GatewayDDoS attackRequest throttlingCloudWatch request countAuto-scaling, WAF, ShieldWAF rules, rate limiting
S3 bucketAccidentally deletedData lossCloudWatch metrics dropRestore from versioning or backupVersioning enabled, lifecycle policies

8.4 Security Threat Modeling

Threat Model Template (STRIDE):

Threat CategoryThreatMitigationImplementation
S - SpoofingAttacker impersonates API callerStrong authentication, API key validationAPI Gateway API keys, IAM roles, OAuth
T - TamperingAttacker modifies data in transitEncryption, integrity checksTLS, HMAC, message signing
R - RepudiationAttacker denies actionAudit logging, immutable recordsCloudTrail, DynamoDB event sourcing
I - Information DisclosureAttacker accesses sensitive dataEncryption, access controlKMS, IAM, data classification
D - Denial of ServiceAttacker floods systemRate limiting, auto-scaling, WAFAPI Gateway throttling, WAF rules, Shield
E - Elevation of PrivilegeAttacker gains higher accessLeast privilege, MFA, role separationIAM policies, MFA, role assumption logs

Section 9: Production Deployment Patterns

9.1 Blue-Green Deployment

Definition: Run two identical production environments (blue, green); switch traffic to green after validation.

Benefits:

  • Zero-downtime deployments
  • Easy rollback (switch back to blue)
  • Thorough testing in production environment before traffic
  • Cost: 2x infrastructure during deployment (brief window)

Implementation:

Users → ALB → Blue (v1) [current]
           → Green (v2) [standby, being deployed]

Test Green thoroughly; if successful:
ALB → Green (v2) [becomes current]

9.2 Canary Deployment

Definition: Gradually route traffic to new version; rollback if error rate exceeds threshold.

Benefits:

  • Low-risk; easy rollback
  • Real user traffic tests new code
  • Immediate detection of issues
  • Cost: Minimal additional infrastructure

Implementation:

Minute 0: 5% traffic → v2
Minute 5: 25% → v2 (if error rate normal)
Minute 10: 50% → v2
Minute 15: 100% → v2

If error rate exceeds threshold at any step: rollback to 0% → v2

9.3 Feature Flags

Definition: Deploy code without enabling features; toggle features on/off without redeployment.

Benefits:

  • Decouple deployment from release
  • Kill switches for problematic features
  • Gradual feature rollout

Implementation:

1if feature_flags.get("new_checkout_flow"): 2 # New code path 3else: 4 # Old code path

Section 10: AWS Service Selection Quick Reference

Use CasePrimary ServiceAlternativeTrade-off
Static websiteS3 + CloudFrontAPI Gateway + LambdaSimpler (S3) vs. dynamic (Lambda)
REST APIAPI Gateway + LambdaECS + ALBServerless (Lambda) vs. control (ECS)
MicroservicesECS + ALB + SQS/SNSEKSSimplicity (ECS) vs. power (EKS)
Database (SQL)RDS AuroraRDS PostgreSQLPerformance/scale (Aurora) vs. cost (basic RDS)
Real-time databaseDynamoDBRDSMilliseconds (DynamoDB) vs. complex queries (RDS)
Data warehouseRedshiftAthenaComplex queries (Redshift) vs. ad-hoc (Athena)
Log analysisOpenSearchCloudWatch Logs InsightsPowered search (OpenSearch) vs. simple (CloudWatch)
Batch processingGlue or BatchEMRSimplicity (Glue) vs. control (EMR)
StreamingKinesisMSKAWS-native (Kinesis) vs. portable (MSK)
Event routingEventBridgeSNS + SQSRich routing (EventBridge) vs. simple (SNS/SQS)
Workflow orchestrationStep FunctionsApache Airflow (MWAA)Simplicity (Step Functions) vs. power (Airflow)
ML model trainingSageMakerEC2 + JupyterManaged (SageMaker) vs. DIY (EC2)
LLM applicationsBedrockSageMaker + custom modelsEase (Bedrock) vs. control (SageMaker)

Conclusion

This universal framework enables architects to:

  1. Decompose any problem into 10 fundamental dimensions
  2. Classify workloads into generic archetypes (request/response, event-driven, streaming, batch, workflows, data platforms, AI/ML, edge, hybrid)
  3. Select AWS services with explicit decision criteria, tradeoffs, cost models, and scaling behavior
  4. Design architectures using proven patterns (CRUD, event-driven, saga, fan-out/fan-in, data lakehouse, CQRS, streaming, multi-region)
  5. Align with Well-Architected pillars (operational excellence, security, reliability, performance, cost, sustainability) with measurable indicators
  6. Model costs and scale across different workload profiles and identify breakeven points
  7. Plan evolution from MVP to scale, monolith to microservices, single-region to multi-region
  8. Execute with confidence using playbooks, checklists, and deployment patterns

This guide is domain-agnostic and applies to financial systems, consumer apps, data platforms, AI/ML systems, real-time systems, batch workloads, legacy migrations, greenfield and brownfield architectures.

Use this as a reference handbook for system design, an enterprise architecture playbook for organizations, a teaching and onboarding reference for cloud teams, and a foundation for automating AWS architectural decisions.


Version: 1.0 (December 2024)
Last Updated: December 15, 2024
Framework Alignment: AWS Well-Architected Framework (June 2024)

Back to all documentation
Last updated: Dec 15, 2025