AWS Universal Architecture Guide
Comprehensive domain-agnostic framework for analyzing, decomposing, and implementing any cloud system on AWS with best practices.
AWS Universal Architecture Design & Decision Framework
A Domain-Agnostic Reference Guide for Analyzing, Decomposing, and Implementing Any Cloud System on AWS
Executive Summary
This guide provides a universal, repeatable methodology for transforming any business requirement, technical problem, or system constraint into a correct, scalable, and cost-efficient AWS architecture. It is domain-agnostic and works for financial systems, consumer apps, data platforms, AI/ML systems, real-time workloads, batch processes, legacy migrations, and unknown system types.
The guide functions as:
- A system design handbook for architects and engineers
- An enterprise architecture playbook for organizations
- A teaching and onboarding reference for cloud teams
- A foundation for automating AWS architectural decisions
Section 1: Universal Requirement Decomposition Framework
Any architectural problem—regardless of domain—must be decomposed across 10 critical dimensions. This framework ensures no aspect of your system is overlooked.
1.1 Business Intent
Purpose: Understand the "why" before the "what."
Guiding Questions:
- What is the core business value this system delivers?
- What are the primary revenue drivers or cost reduction targets?
- Who are the end users (internal employees, external customers, partners)?
- What is the go-to-market timeline and business phase (MVP, scaling, mature)?
- Are there regulatory, compliance, or governance mandates?
- What are the competitive pressures or market differentiators?
Decision Heuristic: Business intent drives all architectural priorities. A fast market entry (MVP) justifies serverless and managed services over optimal scaling. A mission-critical financial system prioritizes reliability and auditability over speed-to-market.
Output: A single-paragraph "business thesis" that frames all downstream decisions.
1.2 User & System Actors
Purpose: Identify who interacts with the system and how.
Guiding Questions:
- Who are the primary users? (internal ops, external customers, partners, machines)
- How many concurrent users? (10, 10k, 1M+)
- What is the geographic distribution? (single region, multi-region, global)
- Are there bots, APIs, or automated integrators?
- What are the access patterns? (browser, mobile, API, batch, event-driven)
- Are there downstream systems that depend on this?
Decision Heuristic:
- Large distributed user base → CDN, global load balancing, edge computing.
- Batch/job consumers → Event-driven, Step Functions orchestration.
- API integrators → API Gateway, rate limiting, schema validation.
- Internal tools only → Simplified networking, reduced durability requirements.
Output: Actor matrix (type, count, geography, interaction pattern, SLA expectations).
1.3 Data Characteristics
Purpose: Classify the Volume, Velocity, Variety, and Sensitivity of data flowing through the system.
Volume
- Gigabytes → RDS, DynamoDB, S3
- Terabytes → Redshift, EMR, Athena on S3
- Petabytes → Data lakes, distributed processing (Spark/Flink on EMR)
Velocity
- Batch (hourly/daily) → Glue jobs, Lambda scheduled, batch processing
- Near-real-time (seconds) → Kinesis, MSK, Lambda streaming
- Real-time (milliseconds) → Kinesis with parallel consumers, DynamoDB Streams, SQS
- Streaming (continuous) → Kinesis Data Analytics, Flink, Kafka
Variety
- Structured (SQL schemas) → RDS, Aurora, DynamoDB
- Semi-structured (JSON, Avro, Parquet) → S3 + Athena, Glue Data Catalog
- Unstructured (images, video, logs) → S3, OpenSearch for text search
- Mixed → Lake Formation for unified governance
Sensitivity (Data Classification)
- Public → No encryption, standard S3 access
- Internal → Standard encryption, IAM access control
- Confidential → KMS encryption, VPC isolation, audit logging
- PII/Regulated (HIPAA, PCI, GDPR) → Encryption, access auditing, data retention policies, DLP tools
Decision Heuristic:
- High volume + high sensitivity → Use AWS Glue for data classification, Lake Formation for fine-grained access control.
- High velocity + high variety → Kinesis or MSK for ingestion; Lambda + DynamoDB for processing.
- Mixed sensitivity → Separate data layers by classification; encrypt all; audit all access.
Output: Data catalog (volume, velocity, variety, sensitivity, retention, lineage).
1.4 Workload Type
Purpose: Classify the synchronicity and scheduling of work.
Synchronous Workloads
- User waits for response
- Low latency required (milliseconds to seconds)
- Examples: web requests, API calls, real-time queries
- AWS Services: Lambda, API Gateway, ECS, RDS, DynamoDB, ElastiCache
- Scaling: Auto-scale based on concurrent requests; cold starts matter
Asynchronous Workloads
- User does not wait; work happens later
- Moderate latency acceptable (seconds to hours)
- Examples: email notifications, data processing, report generation
- AWS Services: SQS, SNS, EventBridge, Step Functions, Glue, Batch, EMR
- Scaling: Auto-scale workers; decouple producers from consumers
Batch Workloads
- Large volumes of data processed at scheduled intervals
- Latency can be hours or days
- Examples: ETL, data analytics, ML training, backups
- AWS Services: Glue, Batch, EMR, Lambda scheduled, Redshift, SageMaker
- Scaling: Fixed or dynamic parallelization; cost-optimized
Streaming Workloads
- Continuous or near-continuous data flow
- Low latency per event (milliseconds to seconds)
- Examples: sensor data, clickstreams, logs, financial ticks
- AWS Services: Kinesis, MSK, Lambda, DynamoDB Streams, EventBridge
- Scaling: Partition by shard; auto-scale shards
Decision Heuristic:
- Synchronous + low latency → Lambda (cold start < 100ms) or ECS (predictable latency).
- Synchronous + consistent load → ECS, EC2 (warm instances).
- Asynchronous + bursty → SQS with Lambda, EventBridge rules.
- Batch + scheduled → Glue, Batch, or Lambda scheduled.
- Streaming + high throughput → Kinesis or MSK with parallel consumers.
Output: Workload classification (sync/async, latency target, throughput, scheduling).
1.5 Traffic & Scale Patterns
Purpose: Understand load profile, growth trajectory, and burst capacity.
Guiding Questions:
- What is baseline traffic? (requests/sec, MB/sec)
- What is peak traffic? (seasonal, event-driven, predictable or not)
- What is growth rate? (stable, linear, exponential)
- Are there time-zone or geographic spikes?
- What is acceptable latency during peak load?
- Can your system gracefully degrade, or must it handle all traffic?
Decision Heuristic:
- Stable, predictable load → Reserved Instances, provisioned capacity (RDS, Redshift, Kinesis), Savings Plans.
- Bursty, unpredictable load → Serverless (Lambda, Fargate), on-demand (DynamoDB), auto-scaling.
- Rapid growth (0 → scale) → Serverless initially; migrate to provisioned as load stabilizes.
- Time-zone dependent → Multi-region or scheduled scaling.
- Graceful degradation required → Circuit breakers, queue shedding, read replicas for read-heavy workloads.
Output: Traffic profile (baseline, peak, growth, patterns, cost/performance targets).
1.6 Availability & Durability Targets
Purpose: Define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each component.
RTO = How long can the system be down before it impacts business?
- Critical (< 1 hour) → Multi-AZ, multi-region failover, hot standby
- High (1-4 hours) → Multi-AZ failover, automated recovery
- Medium (4-24 hours) → Single AZ with automated backups, manual recovery
- Low (> 24 hours) → Manual recovery acceptable; development/test environments
RPO = How much data loss is acceptable?
- Critical (zero data loss) → Synchronous replication, read replicas, event sourcing
- High (minutes) → Asynchronous replication, hourly backups
- Medium (hours) → Daily snapshots
- Low (days) → Weekly or monthly backups
Decision Heuristic:
- RTO < 1 hour, RPO zero → Multi-AZ Aurora with read replicas, DynamoDB global tables, cross-region failover.
- RTO 1-4 hours, RPO hours → Multi-AZ RDS with automated backups, S3 cross-region replication.
- RTO > 1 day, RPO > 1 day → Single AZ with backups; manual recovery acceptable.
AWS Service Mapping:
- High availability: Aurora, DynamoDB, Lambda, ALB/NLB
- Disaster recovery: S3 cross-region replication, RDS backups, Backup service, Route 53 health checks
- Durability: S3 (11 nines), EBS snapshots, Data Lifecycle Manager
Output: Availability matrix (component, RTO, RPO, strategy, cost implications).
1.7 Security & Compliance Needs
Purpose: Identify confidentiality, integrity, availability requirements and regulatory constraints.
Guiding Questions:
- What data is handled? (PII, PHI, financial, trade secrets)
- What are regulatory requirements? (HIPAA, PCI-DSS, GDPR, SOX, CCPA)
- Must data stay in-region? (data residency, data sovereignty)
- Who can access what data? (role-based, attribute-based)
- Is encryption required? (in-transit, at-rest, key management)
- Are audit logs and compliance reports required?
- Are there penetration testing or security assessment requirements?
Decision Heuristic:
- PII/GDPR → Encryption (KMS), access auditing, data classification, DLP (Macie), deletion policies
- HIPAA/PHI → Encryption, audit logging, VPC isolation, Business Associates Agreement (BAA)
- PCI-DSS → Network isolation, encryption, access logging, vulnerability scanning (Inspector)
- Financial/SOX → Immutable audit logs (CloudTrail), change controls, segregation of duties
- Data residency required → Single-region architecture, explicit region selection
- High-security environment → VPC isolation, GuardDuty, WAF, Network Firewall
Output: Security posture (data classification, regulatory drivers, encryption, access control, audit requirements).
1.8 Cost Sensitivity
Purpose: Define cost constraints and optimization priorities.
Guiding Questions:
- Is this a cost-driven or feature-driven project?
- What is the acceptable cost range per month, per user, per transaction?
- Are there cost allocation/showback requirements?
- Can you commit to reserved capacity, or is on-demand necessary?
- Is CapEx or OpEx preferred?
- What cost-optimization tools (CloudWatch, Cost Explorer, Trusted Advisor) are available?
Decision Heuristic:
- Startup/MVP → On-demand, serverless, minimize fixed costs.
- Scaling (stable load) → Mix reserved + on-demand; commit 70% of baseline traffic.
- Cost-sensitive operations → Reserved Instances (up to 75% off), Savings Plans (65% off, more flexible), spot instances (up to 90% off, interruptible workloads).
- Variable workloads → Serverless (Lambda, Fargate, Athena), DynamoDB on-demand.
- Long-term projects (3+ years) → 3-year reserved instances provide best ROI.
Output: Cost model (budget, cost per unit, commitment level, optimization targets).
1.9 Operational Complexity
Purpose: Assess team capability and operational overhead tolerance.
Guiding Questions:
- What is the team's cloud maturity? (beginners, intermediate, advanced)
- Do you have DevOps, SRE, or platform engineering capabilities?
- Can the team manage Kubernetes, on-premises infrastructure, or complex integrations?
- What is the tolerance for manual operational tasks?
- Are there compliance/audit requirements for infrastructure change management?
Decision Heuristic:
- Small team, low maturity → Serverless (Lambda, Fargate), managed services (RDS, OpenSearch), minimal custom code.
- Intermediate maturity → ECS with CloudFormation, RDS Aurora, Glue data pipelines.
- Advanced maturity + complex requirements → EKS, self-managed databases, custom orchestration.
- Compliance-heavy environments → Infrastructure as Code (Terraform, CDK), automated deployments, audit trails.
Output: Operational profile (team size, skills, tolerance for complexity, tooling requirements).
1.10 Change Frequency & Extensibility
Purpose: Design for evolution and minimize rework.
Guiding Questions:
- How often do requirements change? (weekly, monthly, quarterly)
- Will the system need to integrate with new services or data sources?
- Is the architecture greenfield (new) or brownfield (existing)?
- How modular and decoupled should components be?
- What is the acceptable technical debt?
Decision Heuristic:
- High change frequency → Microservices, event-driven, loose coupling; use Step Functions, EventBridge for orchestration.
- Stable requirements → Monolith is acceptable; focus on reliability and scaling.
- Must integrate with many systems → Event-driven, API-first, API Gateway for versioning.
- Greenfield + flexible → Start with serverless; migrate to provisioned services as load/patterns stabilize.
- Brownfield + legacy → Strangler Fig pattern; migrate incrementally; keep old and new running in parallel.
Output: Architecture roadmap (current state, evolution path, integration points, refactoring windows).
Section 2: Workload Classification Engine
Every real-world problem maps to one or more of these generic workload archetypes. Understanding which archetype(s) your system belongs to unlocks the correct service selection.
2.1 Request/Response Systems (Synchronous)
Characteristics:
- User or system waits for response
- Latency-sensitive (< 1 second to minutes)
- Throughput variable or bursty
- State management required (sessions, user context)
Real-World Examples:
- Web applications, APIs, mobile backends
- Search engines, recommendation systems
- Checkout flows, payment processing
- Real-time dashboards, monitoring systems
Core AWS Services:
- Compute: Lambda (cold start sensitive), ECS/Fargate (warm, predictable), EC2 (control, long-running)
- Network: API Gateway (REST/WebSocket), ALB (internal routing), CloudFront (caching)
- Database: RDS/Aurora (ACID, sessions), DynamoDB (key-value, sessions), ElastiCache (ephemeral cache)
- Observability: CloudWatch, X-Ray (trace requests end-to-end)
Architecture Pattern:
Client → CloudFront (cache) → API Gateway → Lambda/ECS/EC2 → RDS/DynamoDB/Cache ↓ ↓ Static assets (S3) Logs to CloudWatch
Key Design Decisions:
-
Compute choice:
- Lambda: Best for I/O-bound, variable load; worst for long-running or CPU-intensive work
- ECS: Best for containerized, moderate load; good cold-start latency
- EC2: Best for high CPU, long-running, or specialized requirements
-
Caching strategy:
- CloudFront: Cache static assets and cacheable API responses globally
- ElastiCache: Cache database queries, sessions, expensive computations
-
Database choice:
- RDS/Aurora: Relational data, complex queries, ACID guarantees
- DynamoDB: Key-value lookups, horizontal scaling, variable load
- Hybrid: Both (RDS for transactions, DynamoDB for sessions/cache)
-
Load balancing:
- ALB (Layer 7): Route based on path, hostname, headers (microservices)
- NLB (Layer 4): Route based on IP protocol, port; extreme throughput/low latency
- API Gateway: REST/GraphQL APIs, rate limiting, schema validation
-
Scaling and resilience:
- Auto Scaling Groups (ECS, EC2) for gradual scaling
- Lambda: Automatic, scales to 1000s of concurrent requests
- Circuit breakers (Step Functions) to prevent cascading failures
- Read replicas (RDS, DynamoDB) for read-heavy workloads
Well-Architected Alignment:
- Operational Excellence: Observability (CloudWatch metrics, logs, X-Ray traces); automated deployments (CodeDeploy)
- Security: IAM roles, VPC security groups, encryption in transit (HTTPS), KMS at-rest
- Reliability: Multi-AZ deployment, auto-scaling, health checks, failover
- Performance: Caching (CloudFront, ElastiCache), connection pooling, query optimization
- Cost Optimization: Right-size instances, use Savings Plans, cache frequently accessed data
- Sustainability: Use managed services to avoid idle servers; consolidate workloads
Cost Model:
- Lambda: $0.0000002 per request + $0.0000166667 per GB-second (pay for actual use)
- ECS Fargate: $0.04695 per vCPU-hour + $0.00519 per GB-hour (pay for provisioned capacity)
- RDS: $0.17/hour (db.t3.micro) to $10+/hour (large instances) + storage + backups
- Break-even: Lambda vs. ECS ≈ 10-100 requests/sec depending on execution time; ECS cheaper above this
2.2 Event-Driven Systems (Asynchronous, Decoupled)
Characteristics:
- Components communicate via events, not direct calls
- Loose coupling; easy to add consumers
- No waiting for response
- Throughput and latency can vary independently
Real-World Examples:
- Order processing (order → payment → fulfillment → notification)
- Data pipeline orchestration (ingest → transform → load → analyze)
- Multi-tenant SaaS platforms (user action → event → multiple subscribers)
- IoT systems (sensor → ingestion → storage → analytics → alerting)
Core AWS Services:
- Event sources: S3, RDS, EventBridge, Kinesis, SNS, SQS, Lambda, DynamoDB Streams
- Event routers: EventBridge (pub/sub with routing), SNS (fanout), SQS (queue), Kinesis (streaming)
- Event processors: Lambda, Step Functions, ECS, Fargate, Batch
- Event storage: DynamoDB, S3, EventBridge Archive
Architecture Pattern:
Source (S3, API, DDB) → EventBridge/SNS/SQS → Lambda/ECS/Batch → Target (DB, S3, Email) ↓ Multiple independent consumers (fanout)
Key Design Decisions:
-
Event source:
- EventBridge: Most flexible; supports 100+ AWS services and custom apps; best for rule-based routing
- SNS: Simple pub/sub; less configuration; no replay; good for fanout
- SQS: Queue-based; replay via DLQ; guarantees delivery; good for buffering
- Kinesis: Streaming; ordered; replay; good for real-time analytics
- DynamoDB Streams: Change data capture; triggers Lambda; good for change-driven workflows
-
Fanout vs. queue:
- Fanout (SNS, EventBridge): One event → multiple independent consumers; each gets a copy
- Queue (SQS): One event → one consumer (processes and deletes); good for load leveling
-
Error handling:
- Idempotence: Events may be delivered multiple times; ensure consumers can handle duplicates
- Dead-letter queues (DLQ): SQS/SNS DLQs capture failed messages for replay/inspection
- Retry policies: Exponential backoff; max retries; circuit breaker pattern
-
Orchestration vs. choreography:
- Choreography: Each service subscribes to events and emits its own (decoupled, but implicit flows)
- Orchestration: Central coordinator (Step Functions) directs workflow; explicit, auditable
-
Consistency model:
- Eventual consistency: Events propagate asynchronously; ordering may not be guaranteed
- At-least-once delivery: Event may be delivered multiple times; idempotence required
- Exactly-once semantics (hard): Use idempotent IDs + database writes with deduplication
Well-Architected Alignment:
- Operational Excellence: Event tracing (CloudWatch logs, X-Ray), dead-letter queues for debugging, alerts
- Security: IAM policies for publish/subscribe, encryption at-rest (SNS, SQS, EventBridge)
- Reliability: Durable queues (SQS, EventBridge); retries; DLQs for failure handling; multi-AZ
- Performance: EventBridge rules scale to millions/sec; Kinesis scales by partition; fanout is instant
- Cost Optimization: Pay only for events processed; SQS cheaper than Kinesis for non-streaming; use Kinesis on-demand
- Sustainability: Event-driven can consolidate workloads; avoid polling
Cost Model:
- EventBridge: $0.35 per 1M events
- SNS: $0.50 per 1M publishes
- SQS: $0.40 per 1M requests (includes sends, receives, deletes)
- Kinesis on-demand: $0.047 per GB ingested
- Kinesis provisioned: $0.36 per shard-hour
2.3 Stream Processing (Continuous, Low-Latency Analytics)
Characteristics:
- Data arrives continuously at high velocity
- Process and react within seconds or milliseconds
- Maintain windowed state (e.g., moving averages)
- Throughput scales by partitioning
Real-World Examples:
- Real-time fraud detection, financial trading
- Clickstream analysis, A/B testing
- Sensor monitoring, anomaly detection
- Log processing, security analytics
Core AWS Services:
- Ingestion: Kinesis Data Streams, MSK (Kafka), EventBridge
- Processing: Kinesis Data Analytics (SQL), Lambda (event-by-event), Flink/Spark on EMR, Kafka Streams
- State storage: DynamoDB, ElastiCache, Kinesis (windowing)
- Output: DynamoDB, RDS, S3, Redshift, OpenSearch, Lambda targets
Architecture Pattern:
Source (API, sensor) → Kinesis/MSK → Analytics/Processing → DynamoDB/S3/Redshift ↓ Real-time alerts/dashboards
Key Design Decisions:
-
Ingestion service:
- Kinesis: AWS-native, scales by shard, low-level control, good for AWS ecosystem
- MSK (Kafka): Open-source, more operational burden, richer ecosystem, good for multi-cloud
- EventBridge: Rule-based routing; not ideal for high-throughput streaming; best for low-frequency events
-
Processing framework:
- Kinesis Data Analytics: SQL-based, managed, good for windowing and aggregations
- Lambda: Simple transformation; event-by-event processing; cold start may impact throughput
- Flink/Spark on EMR: Complex stateful processing, machine learning, but requires cluster management
-
Partitioning strategy:
- Partition by customer ID, user ID, or geographic region to parallelize processing
- Avoid hot partitions (one shard falling behind)
- Auto-scale shards based on throughput metrics
-
Windowing and state:
- Tumbling windows: Fixed intervals (1-minute aggregates)
- Sliding windows: Overlapping intervals (10-second windows, every 1 second)
- Session windows: Grouped by inactivity (user session breaks when idle > 30 min)
- Use DynamoDB or ElastiCache for windowed state
-
Latency vs. throughput:
- Low latency (< 1 second): Lambda + Kinesis, minimal batching
- Higher throughput (100k+/sec): Flink, larger batches, higher latency acceptable
Well-Architected Alignment:
- Operational Excellence: Monitoring lag (Kinesis iterator age), consumer latency, throughput metrics
- Security: Encryption (TLS in-transit, KMS at-rest), IAM for producer/consumer, audit logging
- Reliability: Auto-scaling shards, replay from stream (Kinesis retention 24 hours → LONG_TERM backups to S3)
- Performance: Partition hot spots avoided, optimal shard count, batch size tuning
- Cost Optimization: On-demand Kinesis for variable workloads; reserved for baseline; consider Kafka if high volume
- Sustainability: Consolidate stream processing on shared clusters; avoid idle shards
Cost Model:
- Kinesis provisioned: $0.36 per shard-hour (baseline ~4MB/sec throughput per shard)
- Kinesis on-demand: $0.047 per GB ingested (no upfront commitment)
- EMR Flink: $0.24–$0.30 per DPU-hour + EC2 compute costs
- Lambda streaming: $0.0000002 per request + GB-second (high concurrency best for 100s of partitions)
2.4 Batch Processing (Scheduled, Offline, High-Volume)
Characteristics:
- Large volumes of data processed at scheduled intervals
- Latency acceptable (hours, days)
- Cost-optimized (spot instances, off-peak scheduling)
- Often part of data pipelines
Real-World Examples:
- ETL jobs (extract, transform, load)
- Daily/weekly analytics, reporting
- ML model training, batch scoring
- Data cleanup, archival, backups
- Rendering, video transcoding
Core AWS Services:
- Scheduling/Orchestration: Step Functions, Lambda scheduled, EventBridge rules, Glue workflows
- Compute: Batch, Glue, EMR, Lambda (small jobs), SageMaker (ML)
- Storage: S3 (input/output), RDS (data source), Redshift (output)
- Distributed Processing: Spark (EMR), Flink (EMR), Hadoop (EMR), Glue (Spark-based)
Architecture Pattern:
S3/RDS (source) → Glue/Batch/EMR (transform) → S3/Redshift/RDS (output) → Analytics/Reporting ↓ Scheduled by Step Functions Failure handling via DLQ/SNS
Key Design Decisions:
-
Compute platform:
- Glue: AWS-native, serverless, pay-per-DPU-second, good for AWS-centric pipelines
- Batch: Managed job queue, auto-scaling EC2, good for large compute jobs, spot instances
- EMR: Full Hadoop/Spark cluster, complex transformations, good for data science
- Lambda: Small, short-lived jobs; minimal overhead; limited timeout (15 min)
-
Scheduling:
- Step Functions: Visual workflows, error handling, automatic retries, good for multi-step pipelines
- EventBridge rules: Cron expressions, simple triggering, good for scheduled tasks
- Glue workflows: Built-in job orchestration, trigger on job completion
- Apache Airflow (MWAA): Complex DAGs, good for data engineers
-
Data partitioning:
- Partition input data by date, customer, or region for parallel processing
- Output partitioned for efficient querying (Athena, Redshift spectrum)
- Use S3 object keys intelligently (e.g.,
s3://bucket/year/month/day/hour/)
-
Cost optimization:
- Use spot instances on Batch/EMR (up to 70% savings); interruptions acceptable
- Schedule batch jobs in off-peak hours (e.g., 2-6 AM)
- Delete intermediate data; keep only final output
- Use Glue over EMR for small/medium jobs (cheaper); EMR for complex workloads
-
Failure handling:
- Retry logic in Step Functions (exponential backoff, max retries)
- Dead-letter queues for failed jobs
- SNS notifications on failure
- Idempotent job design (can re-run without side effects)
Well-Architected Alignment:
- Operational Excellence: Job monitoring, retry policies, automated error handling, alerts
- Security: Data encryption (S3, KMS), IAM roles for job execution, audit logs
- Reliability: Multi-step workflows with automatic retries, DLQs, checkpointing for restartable jobs
- Performance: Parallel processing, data partitioning, right-size instances
- Cost Optimization: Spot instances, off-peak scheduling, delete interim data, managed services (Glue vs. EMR)
- Sustainability: Batch when possible (consolidate compute); avoid idle clusters; use managed services
Cost Model:
- Glue: $0.44 per DPU-hour; minimum 1 DPU (0.5 GB memory per DPU)
- Batch: EC2 spot instances (70% off on-demand) + networking; pay only during execution
- EMR: $0.07–$0.10 per instance-hour (depends on instance type) + spot discounts
- Lambda: $0.0000002 per request + $0.0000166667 per GB-second (cheap for small jobs; inefficient for large workloads)
2.5 Long-Running Workflows (State Machines, Sagas)
Characteristics:
- Multi-step processes with conditional logic and retries
- Steps may take minutes, hours, or days
- State must be persisted; can be paused/resumed
- Distributed, multi-service coordination
Real-World Examples:
- Loan approval workflows (underwriting → approval → funding)
- Supply chain tracking (order → warehouse → shipment → delivery)
- ML training pipelines (data prep → training → evaluation → deployment)
- Multi-step approval processes (draft → review → approve → execute)
Core AWS Services:
- Orchestration: Step Functions (standard or express), Apache Airflow (MWAA)
- Persistence: DynamoDB (state), SQS/SNS (communication), S3 (large payloads)
- Compute: Lambda, ECS, Batch (individual steps)
- Monitoring: CloudWatch, X-Ray (trace execution paths)
Architecture Pattern:
Trigger (API, S3, EventBridge) → Step Functions State Machine ├── Step 1: Validate (Lambda) ├── Step 2: Process (ECS/Batch) ├── Step 3: Notify (SNS) └── Error handling: Retry, DLQ, Alert
Key Design Decisions:
-
Step Functions flavor:
- Standard: ≤ 1 year duration, at-least-once execution, good for business workflows
- Express: ≤ 5 minutes, exactly-once execution, high throughput, good for event processing
-
Saga pattern (distributed transactions):
- Choreography: Each step emits events; next steps listen (decoupled but implicit)
- Orchestration: Central coordinator (Step Functions) directs all steps (explicit, auditable, easier to debug)
- Compensation: On failure, "undo" previous steps (e.g., reverse payment if fulfillment fails)
-
Wait strategies:
- Callback pattern: Step Functions waits for external system to call back; uses DynamoDB + EventBridge
- Polling: Step Functions checks status periodically (less efficient)
- Event-driven: External system sends event (SNS/SQS) to notify completion
-
Error handling:
- Automatic retries (exponential backoff, max attempts)
- Catch and handle specific errors (e.g., timeout → default value; validation failure → alert)
- Dead-letter queues for unrecoverable failures
- Manual intervention path if needed
-
State management:
- Use DynamoDB to persist workflow state
- Store large payloads in S3; reference by key
- Avoid Step Functions input/output size limits (32 KB) with S3 references
Well-Architected Alignment:
- Operational Excellence: Visual workflow monitoring, CloudWatch logs for each step, alerts on failure
- Security: IAM roles per step (least privilege), encryption of state data
- Reliability: Automatic retries, compensation logic, no data loss (persisted state)
- Performance: Parallel step execution, avoid blocking waits
- Cost Optimization: Pay per state transition; consolidate steps where possible; use spot instances in Batch steps
- Sustainability: Avoid polling; use event-driven waits
Cost Model:
- Step Functions Standard: $0.000025 per state transition (1M = $25/month)
- Step Functions Express: $0.000001667 per invocation + GB-second (useful for high-volume, short workflows)
- DynamoDB for state: $0.25 per 1M write units (on-demand: $1.25 per 1M writes)
2.6 Data-Intensive Platforms (Data Lakes, Warehouses, Analytics)
Characteristics:
- Large volume of diverse data sources
- Multiple consumers (analytics, ML, real-time queries)
- Governed access; compliance and lineage important
- Optimized for both ingestion and querying
Real-World Examples:
- Data lakes (centralized repository for all data)
- Data warehouses (curated, optimized for BI queries)
- Data mesh (domain-oriented, decentralized ownership)
- ML feature stores
Core AWS Services:
- Ingestion: Glue, Kinesis, MSK, DMS (database migration)
- Storage: S3 (bronze/silver/gold layers), Redshift (warehouse)
- Processing: Glue (ETL), EMR (Spark/Flink), Athena (SQL on S3)
- Governance: Lake Formation (access control), Glue Data Catalog (metadata)
- Analytics: Redshift, Athena, QuickSight, SageMaker
Architecture Pattern (Modern Data Lakehouse):
Raw Data (S3 Bronze) ↓ Glue/EMR (Transform) ↓ Curated Data (S3 Silver/Gold) ↓ ↙ ↓ ↘ Redshift Athena OpenSearch (OLAP) (Ad-hoc) (Search) ↓ ↓ ↓ BI Tools Data Apps Search Apps (QuickSight) (Jupyter) (Kibana)
Key Design Decisions:
-
Data organization (medallion architecture):
- Bronze: Raw data as-is (from source systems)
- Silver: Cleaned, validated, deduplicated; schema applied
- Gold: Business-ready; aggregated; optimized for use cases
- Use S3 partitioning intelligently; Glue crawlers auto-discover
-
Ingestion pattern:
- Batch: Scheduled Glue jobs; good for daily or less frequent loads
- Streaming: Kinesis → Glue streaming job → S3 (continuous updates)
- Change Data Capture (CDC): DMS → S3 (real-time sync from source databases)
-
Governance and security:
- Lake Formation: Centralized permissions; grant access to databases/tables; tag-based controls
- Data Catalog: Glue Data Catalog for metadata; search/lineage
- Encryption: S3-SSE, KMS for sensitive data; Glue connection encryption for DB credentials
- Compliance: S3 versioning, Object Lock for immutability; access logging
-
Query optimization:
- Redshift: Fast queries; supports complex joins; good for BI teams; requires provisioning
- Athena: Serverless SQL on S3; slower but no provisioning; good for ad-hoc queries
- Partitioning: Partition by date, customer, region; Athena scans only needed partitions (cost ↓)
- File format: Parquet (columnar) better than CSV; Glue auto-converts
-
Cost optimization:
- Use Athena on-demand for variable queries; no idle infrastructure
- Use Redshift Spectrum to query S3 without loading (data stays in S3)
- Partition and compress; use columnar formats (Parquet, ORC)
- Delete old data (TTL, Intelligent-Tiering); archive to Glacier if needed
Well-Architected Alignment:
- Operational Excellence: Crawlers for auto-discovery, Data Catalog for searchability, Glue workflows for orchestration
- Security: Lake Formation permissions, encryption, audit logging, data classification
- Reliability: Data versioning (S3), immutability (S3 Object Lock), backup (cross-region replication)
- Performance: Partitioning, columnar formats, Spectrum for external queries, Redshift for complex OLAP
- Cost Optimization: Athena pay-per-query, Redshift reserved for predictable workloads, S3 Intelligent-Tiering
- Sustainability: Consolidate analytics on shared platforms; avoid silos; centralized data reduces duplication
Cost Model:
- S3: $0.023 per GB/month (standard); $0.0125 (infrequent access)
- Athena: $6.25 per TB scanned (queries only scan partitions needed)
- Glue: $0.44 per DPU-hour
- Redshift: $1.26/hour (dc2.large); $8.02/hour (ra3.xlplus); storage separate
- Lake Formation: $1 per million metadata requests; ~$0 for permissions if using Glue
2.7 AI/ML & Agentic Systems
Characteristics:
- Data-driven decision making and automation
- Model training, evaluation, serving pipelines
- Real-time inference or batch scoring
- Emerging: Agentic AI (agents that reason and take actions)
Real-World Examples:
- Recommendation systems, personalization
- Fraud detection, anomaly detection
- Natural language processing (chatbots, document analysis)
- Computer vision (image classification, object detection)
- Agentic AI: Autonomous agents that interact with systems, query databases, make decisions
Core AWS Services:
- ML Platform: SageMaker (end-to-end), Bedrock (foundation models, agents)
- Model Training: SageMaker Training, Batch, EMR (Spark/PySpark)
- Feature Store: SageMaker Feature Store (managed; offline/online feature serving)
- Model Registry: SageMaker Model Registry (version control, approval workflow)
- Inference: SageMaker endpoints (real-time), SageMaker Batch Transform (offline), Lambda@Edge (edge ML)
- Agentic AI: Bedrock Agents, Amazon Q (enterprise search/automation), Step Functions (orchestration)
- Tools/APIs: Lambda (tool execution), API Gateway (external integrations), DynamoDB (state)
- Data: S3 (training data), RDS/DynamoDB (features, state), OpenSearch (vector search for RAG)
Architecture Pattern (Traditional ML):
Data → SageMaker Training (S3 input) → Model Registry ↓ SageMaker Endpoint (real-time inference) or SageMaker Batch Transform (offline scoring) ↓ Application / BI Tool
Architecture Pattern (Agentic AI):
User Query → Bedrock Agent (orchestration) → Foundation Model (reasoning) ↓ Tool Selection & Execution ├── Lambda (query database) ├── API Gateway (call external services) ├── DynamoDB (read/write state) └── OpenSearch (semantic search, RAG) ↓ Response to User
Key Design Decisions:
-
Foundation Model (Bedrock Agents):
- Use Claude, Llama, Mistral, etc. via Bedrock (no provisioning)
- Custom fine-tuned models (SageMaker) for domain-specific tasks
- Prompt engineering and in-context learning for performance
-
Agentic system architecture:
- Tools: Lambda functions exposing APIs (query database, update CRM, send email)
- Knowledge bases: OpenSearch or Bedrock Knowledge Base (vector embeddings for RAG)
- Memory: DynamoDB for conversation history, user state, execution traces
- Orchestration: Bedrock Agents handle reasoning; Step Functions for complex multi-step workflows
-
Feature engineering and serving:
- Offline: Glue/Spark to compute features in batch; store in S3
- Online: SageMaker Feature Store for low-latency feature retrieval during inference
- Real-time: Compute on-the-fly in inference code if lightweight; Lambda acceptable
-
Model training pipeline:
- Data preparation: Glue (structured), SageMaker Processing (custom code)
- Training: SageMaker Training (managed) or EMR (Spark MLlib)
- Evaluation: SageMaker Experiments (track metrics); model registry for approval
- Deployment: SageMaker Endpoints (auto-scaling) or Lambda container images
-
Inference patterns:
- Real-time (< 1 sec): SageMaker endpoints (auto-scaling), Lambda (smaller models)
- Near-real-time (seconds): Lambda + concurrent invocations, ECS
- Batch (hours): SageMaker Batch Transform, Batch, Glue
- Edge: Lambda@Edge, Greengrass for on-device inference
-
RAG (Retrieval-Augmented Generation):
- Knowledge base: OpenSearch with vector embeddings or Bedrock Knowledge Base
- Retrieval: Semantic similarity search on user query
- Generation: Agent passes retrieved context to LLM for response
Well-Architected Alignment:
- Operational Excellence: Experiment tracking (MLflow in SageMaker), model registry, automated retraining, A/B testing
- Security: Encryption (KMS for S3 training data), IAM roles, PII redaction in prompts (for agents), audit logging
- Reliability: Model versioning, canary deployments (gradual rollout), monitoring for model drift
- Performance: Feature store for low-latency retrieval, batching for throughput, multi-GPU training
- Cost Optimization: Use Bedrock for inference (pay-per-invocation) vs. SageMaker endpoint provisioning; spot training
- Sustainability: Batch inference for non-urgent tasks; consolidate models; avoid redundant retraining
Cost Model:
- Bedrock: $0.001–$0.015 per 1K input tokens (Claude Sonnet: $0.003/1K in, $0.015/1K out)
- SageMaker endpoint (ml.m5.large): $0.0864/hour + data transfer
- SageMaker Training (ml.p3.2xlarge with GPU): $3.06/hour
- SageMaker Feature Store: $0.40 per million requests (online); $0.002 per GB/month (offline)
- Bedrock Knowledge Base: $0.2 per input token for Titan embedding model
2.8 Edge, IoT, and Real-Time Communication
Characteristics:
- Data originates at edge devices (sensors, mobile, on-prem)
- Local processing and decision-making required (low latency)
- Intermittent or unreliable connectivity
- Scale: thousands to millions of devices
Real-World Examples:
- Predictive maintenance (machine sensors → local ML → cloud)
- Smart home automation
- Mobile offline-first apps
- Autonomous vehicles (local processing + cloud sync)
- Real-time collaboration (low-latency WebSockets)
Core AWS Services:
- Edge Compute: IoT Greengrass (on-device), Lambda@Edge (CloudFront), Outposts (on-prem)
- IoT Connectivity: IoT Core (MQTT), IoT Wireless
- Local Storage: Greengrass local resource access
- Cloud Sync: DynamoDB global tables (eventual consistency), AppSync (real-time GraphQL)
- Real-time APIs: API Gateway WebSocket, AppSync
- Analytics: Kinesis, Timestream (time-series), OpenSearch
Architecture Pattern (IoT):
Sensors → Greengrass (local processing) → IoT Core (MQTT) → Kinesis/DynamoDB → Analytics ↓ Local decision-making (low latency)
Architecture Pattern (Real-Time Collaboration):
Client (WebSocket) → API Gateway (WS) → Lambda (connect/message/disconnect) ↓ DynamoDB (active connections) SNS/SQS (broadcast) ↓ API Gateway (push to clients)
Key Design Decisions:
-
Edge vs. cloud processing:
- Local ML inference: Greengrass for low-latency decisions; models updated from cloud
- Cloud processing: IoT Core → Kinesis → Lambda for aggregation and advanced analytics
- Hybrid: Train in cloud; inference at edge; feedback loop for retraining
-
Connectivity:
- Always-on: IoT Core MQTT (publish/subscribe), reliable connection
- Intermittent: Greengrass syncs when connected; local queue until reconnection
- Low bandwidth: Compress data; send only deltas; batching
-
Real-time communication:
- WebSocket (API Gateway): Two-way, low-latency, good for < 100k concurrent connections
- AppSync: GraphQL, built-in subscriptions, automatic connection management
- Kinesis (event stream): High-throughput, ordered by partition, good for > 1M events/sec
-
Data durability at edge:
- Store locally in Greengrass; sync to cloud when connectivity restored
- Use DynamoDB global tables for eventual consistency (multi-region)
- S3 event notifications for file-based data
-
Security:
- IoT Core certificate-based auth; Greengrass runs as daemon
- TLS encryption; end-to-end if needed
- Least-privilege IAM roles
Well-Architected Alignment:
- Operational Excellence: Device shadowing (Greengrass), fleet-wide updates, telemetry
- Security: Certificate rotation, encrypted communication, local data at-rest encryption
- Reliability: Local queue on disconnect, eventual sync, multi-region global tables
- Performance: Local ML inference (milliseconds), batched cloud sync, Kinesis for ordering
- Cost Optimization: IoT Core MQTT cheaper than constant cloud calls; Greengrass reduces cloud traffic
- Sustainability: Local processing reduces cloud load; edge devices only send necessary data
Cost Model:
- IoT Core: $1.00 per million messages
- IoT Greengrass core device: $1.00/month per device
- API Gateway WebSocket: $0.35 per million messages
- Timestream: $0.30 per million writes + $0.01 per million query units
2.9 Hybrid & Multi-Account Systems
Characteristics:
- Systems span on-premises, AWS, or multiple AWS accounts/regions
- Data and workloads need to integrate seamlessly
- Governance and cost attribution across boundaries
- Network connectivity maintained reliably
Real-World Examples:
- Large enterprises with on-premises datacenters migrating to cloud
- Multi-tenant SaaS (separate AWS account per customer)
- Regulated industries (separate account for prod, dev, compliance)
- Global companies (separate regions for data residency)
Core AWS Services:
- Connectivity: AWS Direct Connect (dedicated network), VPN (encrypted tunnel), Transit Gateway (hub-and-spoke)
- Multi-Account Mgmt: AWS Organizations, Control Tower, Security Hub, Config
- Cross-Account Access: IAM roles (assume role across accounts), Resource-based policies
- Data Sync: DataSync (on-prem ↔ S3), DMS (database replication), S3 cross-account replication
- Governance: Lake Formation cross-account access, Tag policies, SCPs (Service Control Policies)
Architecture Pattern (Hybrid Cloud):
On-Premises ← Direct Connect / VPN → AWS ↓ ↓ Data Center VPC + Private subnets ↓ ↓ Legacy Apps Modern Apps (EC2/Lambda) ↓ ↓ Database ← DataSync/DMS → RDS/DynamoDB ↓ ↓ Monitoring ← EventBridge → CloudWatch
Architecture Pattern (Multi-Account SaaS):
AWS Organizations ├── Master (billing, security) ├── Prod Account 1 (Tenant A) ├── Prod Account 2 (Tenant B) ├── Dev Account └── Security/Logging Account Cross-Account: - Assume role (Tenant A's apps assume role in Tenant A's account) - Log aggregation (each account sends logs to central Security account) - Data sharing (Lake Formation: Tenant A shares data with Tenant B via central account)
Key Design Decisions:
-
Connectivity:
- Direct Connect: Dedicated network; consistent bandwidth; good for large data transfers
- VPN: Encrypted tunnel; supports multiple connections; good for backup/failover
- Transit Gateway: Hub-and-spoke; simplifies multi-VPC and on-prem connectivity
-
Cross-account access:
- Role assumption: Service A in Account 1 assumes role in Account 2 to access resource
- Resource-based policies: S3 bucket allows Account 2 principal to access
- Federation: Active Directory users assume AWS roles via SAML/OIDC
-
Data consistency:
- Synchronous replication (DMS): Real-time sync; good for transactional databases
- Asynchronous replication (S3 cross-account, cross-region): Eventual consistency; lower latency impact
- Event-based sync: EventBridge rules replicate data; decoupled, scalable
-
Governance:
- Organizations: Centrally managed accounts; consolidated billing; SCPs for guardrails
- Control Tower: Baseline accounts with security/compliance guardrails
- Lake Formation: Centralized data lake with cross-account access; tag-based permissions
-
Cost allocation:
- Organization consolidates billing; cost tags for chargebacks
- Cost anomaly detection per account
- Separate AWS accounts per business unit or customer (clean billing, security isolation)
Well-Architected Alignment:
- Operational Excellence: Centralized logging (Security account), Config for compliance, Systems Manager for patching
- Security: Least-privilege cross-account roles, encryption in-transit (Direct Connect, VPN), audit trails per account
- Reliability: Multi-path connectivity (Direct Connect + VPN), cross-account backups, replication
- Performance: Direct Connect for consistent bandwidth; Transit Gateway for simplified routing
- Cost Optimization: Reserved capacity per account; consolidated billing for discounts; shutdown unused accounts
- Sustainability: Consolidate under-utilized accounts; turn off dev accounts when not in use
Cost Model:
- Direct Connect: $0.30 per hour + $0.02 per GB outbound (cheaper than data transfer for high volumes)
- VPN: $0.05 per hour + standard data transfer charges
- Transit Gateway: $0.05 per hour + $0.02 per GB processed through TGW
Section 3: AWS Service Decision Matrix (Core Framework)
For every major AWS service category, this section provides structured guidance on when to use, when NOT to use, tradeoffs, cost model, scaling behavior, and operational burden.
3.1 Compute Services
AWS Lambda
When to Use:
- Synchronous request/response (API calls, webhooks)
- Asynchronous event processing (S3 triggers, SQS, SNS, EventBridge)
- Short-duration workloads (< 15 minutes)
- Variable or bursty load (scale from 0 to 1000s automatically)
- Cost-sensitive low-frequency tasks
When NOT to Use:
- Long-running processes (> 15 min timeout; use ECS or EC2)
- CPU-intensive workloads without strict latency constraints (EC2 cheaper)
- Stateful applications with persistent connections (ECS/EC2)
- Workloads with consistent, predictable high throughput (ECS/EC2 with Savings Plans cheaper)
- Real-time latency-critical (< 100ms consistently; ECS/EC2 warm; Lambda cold start ~1-2 sec)
Tradeoffs:
- Pros: No infrastructure to manage; auto-scales; pay only for execution; fast deployments
- Cons: Cold starts (100ms-2s), 15-min timeout, 10GB max memory, vendor lock-in, difficult debugging
Cost Model:
- Pricing: $0.0000002 per request + $0.0000166667 per GB-second
- Example: 1M requests, 1GB RAM, 100ms duration = $0.2 (requests) + $1.67 (compute) = ~$1.87/month
- Free tier: 1M requests + 400k GB-seconds/month
Scaling Behavior:
- Auto-scales from 0 to 1000 concurrent executions (soft limit; request increase)
- Reserved concurrency for baseline; provisioned concurrency for predictable warm starts
- Burst: 500 containers per minute; after burst, 100 containers/min scaling rate
Operational Burden:
- Low: No patching, scaling, or infrastructure management
- Debugging: CloudWatch logs, X-Ray tracing, local SAM testing
- Versioning: Built-in; easy blue-green deployments
Well-Architected Mapping:
- Cost: Ideal for variable workloads; reserved concurrency for baseline (cheaper for high consistent load)
- Performance: Cold start sensitive; provisioned concurrency adds cost but ensures warm
- Reliability: Automatically retried on transient failures; DLQ for async failures
- Security: IAM role per function; no SSH access
Amazon ECS (Elastic Container Service)
When to Use:
- Containerized applications (Docker, non-Kubernetes)
- Moderate to high load (100s to 1000s requests/sec)
- Long-running services (> 15 min)
- Need for quick startup (< 1 sec) and predictable latency
- Mixed workloads on shared cluster
When NOT to Use:
- Variable, bursty workloads (Lambda more cost-effective for low utilization)
- Complex orchestration (Kubernetes / EKS recommended)
- Minimal infrastructure (Lambda/API Gateway less operational burden)
Tradeoffs:
- Pros: Fast startup, warm containers, good for stateful services, mixed workload consolidation, less overhead than EKS
- Cons: Requires container registry (ECR), task definitions, service scaling configuration; no built-in persistent volumes (use EFS)
Cost Model:
- Fargate (serverless): $0.04695 per vCPU-hour + $0.00519 per GB-hour
- Example: 1 vCPU, 2GB, always-on = $0.0470 + $0.0104 = $33.65/month
- EC2 (self-managed): Pay for EC2 instance (larger volumes, better ROI with high utilization)
Scaling Behavior:
- Auto Scaling Group scales EC2 instances; ECS scheduler places tasks
- Fargate scales near-instantly; EC2 scaling depends on instance launch time (~1-2 min)
- Target tracking (CPU, memory); step scaling for complex rules
Operational Burden:
- Moderate: Manage task definitions, scaling policies, blue-green deployments
- Monitoring: CloudWatch metrics, container logs
- Patching: Update task definitions and redeploy; ECS handles rolling updates
Well-Architected Mapping:
- Cost: Fargate for simple, variable workloads; EC2 for predictable, high-utilization
- Performance: No cold start; warm containers; can be more efficient than Lambda for sustained load
- Reliability: Auto-restart failed tasks; service health checks; distributed across AZs
- Security: IAM task role; container-level isolation
Amazon EKS (Elastic Kubernetes Service)
When to Use:
- Complex microservices ecosystems (10s-100s of services)
- Kubernetes-native tooling and expertise available
- Advanced orchestration needs (canary deployments, traffic shifting, service mesh)
- Workloads already containerized in Kubernetes
When NOT to Use:
- Small teams without Kubernetes expertise (ECS simpler)
- Simple applications (Lambda or ECS sufficient)
- Minimal operational overhead desired
Tradeoffs:
- Pros: Powerful orchestration, extensive ecosystem (Istio, Helm, Prometheus), vendor-agnostic (portable to other clouds)
- Cons: Operational complexity; requires cluster management (node updates, networking, security); higher learning curve
Cost Model:
- EKS cluster: $0.10 per hour (control plane) + EC2 or Fargate for worker nodes
- Full cost: $73/month (cluster) + compute costs
- Good for: 10+ services; high utilization; can amortize control plane cost
Scaling Behavior:
- Cluster autoscaler adds/removes nodes based on pod resource requests
- Horizontal Pod Autoscaler (HPA) scales pod replicas by CPU/memory/custom metrics
- Complex multi-dimensional scaling (per-deployment, per-namespace)
Operational Burden:
- High: Cluster upgrades, node patching, networking (CNI plugins), monitoring (add-ons)
- Requires dedicated SRE/platform team for large-scale
Well-Architected Mapping:
- Cost: Only justified for complex workloads; simpler apps should use ECS
- Performance: Fine-grained control; service mesh for advanced routing
- Reliability: Self-healing, automated failover, rolling updates
- Security: RBAC, network policies, pod security policies
Amazon EC2 (Elastic Compute Cloud)
When to Use:
- Legacy applications requiring specific OS or drivers
- High CPU/memory workloads (compute-intensive)
- Persistent connections (WebSockets, SSH, persistent databases)
- Need for dedicated hardware or licensing
When NOT to Use:
- Stateless, short-lived workloads (Lambda cheaper)
- Variable load without auto-scaling configured
- Greenfield modern applications (serverless preferred)
Tradeoffs:
- Pros: Full control, any OS/software, persistent storage (EBS), good for sustained workloads
- Cons: Operational burden (patching, scaling, security), must manage capacity, higher baseline cost
Cost Model:
- On-demand: $0.0116/hour (t3.micro) to $10+/hour (large instances)
- Reserved (1-year): ~40% discount; 3-year: ~65% discount
- Spot: Up to 90% off on-demand; interruption risk
- Good for: Baseline capacity (reserved); variable load (on-demand + spot)
Scaling Behavior:
- Auto Scaling Group scales by launch time (~1-2 min), not instant
- Gradual scaling (step scaling, target tracking) for stability
- Spot fleet for cost optimization
Operational Burden:
- High: OS patching, security groups, IAM roles, monitoring, backups (EBS snapshots)
- AMI management for consistent deployments
Well-Architected Mapping:
- Cost: Optimize with Reserved Instances + Savings Plans; use Savings Plans for flexibility across instance families
- Performance: Predictable performance; tune instance type for workload
- Reliability: Multi-AZ ASG; EBS volumes for persistence; snapshots for backup
- Security: Security groups, IAM instance roles, encrypted EBS
AWS Batch
When to Use:
- Large-scale batch processing (1000s of parallel jobs)
- Cost-optimized batch workloads (use spot instances)
- Scheduled jobs (daily, weekly ETL)
- Distributed processing without Spark/Hadoop complexity
When NOT to Use:
- Real-time or interactive workloads (ECS, Lambda)
- Complex data transformations (Glue, EMR)
Tradeoffs:
- Pros: Managed job queue, auto-scaling, spot instances for cost, simple job definitions
- Cons: Not real-time; latency for job startup; limited monitoring compared to ECS
Cost Model:
- Compute: Pay for EC2/Fargate underlying (Batch orchestration free)
- Spot instances: 70% off on-demand
- Example: 1000 jobs, 1 hour each, m5.large spot = ~50 parallel instances × $0.0174/hour × 1 hour × 70% discount = $61
Scaling Behavior:
- Managed job queue with auto-scaling
- Jobs queued; Batch launches instances on demand
- Scale-down after job completion (if no backlog)
Operational Burden:
- Low-moderate: Define job definitions; submit jobs; Batch manages the rest
- Monitoring: CloudWatch metrics, job logs
Well-Architected Mapping:
- Cost: Excellent for batch workloads; spot instances reduce cost significantly
- Reliability: Automatic retries on failure; DLQs for failed jobs
- Performance: Parallel execution scales linearly
AWS Elastic Beanstalk
When to Use:
- Simple web applications or APIs
- Team comfortable with code, not infrastructure
- Want managed platform without container complexity
When NOT to Use:
- Complex microservices architectures (EKS)
- Highly customized infrastructure
- Needing fine-grained control
Tradeoffs:
- Pros: Managed; simple deployments (git push or CLI); handles scaling, load balancing, monitoring
- Cons: Less control than ECS; less powerful than EKS; can be "magic" (hard to debug)
Cost Model:
- Same as underlying EC2 + ALB + RDS (if used)
- No additional Beanstalk fee
- Good for: Simple apps where operational simplicity outweighs cost
Scaling Behavior:
- Auto Scaling Group scales EC2 instances
- Health checks monitor instances
Operational Burden:
- Very Low: Push code; Beanstalk handles deployment, scaling, logging
Well-Architected Mapping:
- Operational Excellence: Managed deployments; built-in monitoring; environment cloning
- Cost: Transparent cost; same as self-managed
- Performance: Good for small to medium workloads
3.2 Storage Services
Amazon S3 (Simple Storage Service)
When to Use:
- Object storage (files, media, logs, backups, data lake)
- Static website hosting
- Archive (Glacier tiers)
- Durability requirement (11 nines)
When NOT to Use:
- Block storage (use EBS)
- File system access patterns (use EFS)
- Database (use RDS, DynamoDB)
- Real-time read/write latency (milliseconds)
Tradeoffs:
- Pros: Infinitely scalable, highly durable, cheap at scale, multi-region replication, lifecycle policies
- Cons: Not a file system (eventual consistency for overwrites in older regions), latency ~100-200ms, complex IAM
Cost Model:
- Standard: $0.023 per GB/month
- Intelligent-Tiering: $0.0125 per GB/month (auto-moves to cheaper tiers based on access)
- Glacier Instant: $0.004 per GB/month (instant retrieval)
- Glacier Flexible: $0.0036 per GB/month (1-12 hour retrieval)
- Requests: $0.0004 per 1k PUT, $0.000004 per 1k GET
- Data transfer out: $0.09 per GB (in-region free)
Scaling Behavior:
- Unlimited storage; auto-scales
- Throughput: 3,500 PUT/COPY/POST/DELETE per second per prefix; 5,500 GET per prefix
- For higher throughput, use different key prefixes (randomize first characters)
Operational Burden:
- Very Low: Fully managed; no provisioning, scaling, or patching
- Configuration: Bucket policies, CORS, versioning, lifecycle, replication
Well-Architected Mapping:
- Cost: Incredibly cheap at scale; Intelligent-Tiering for unknown access patterns; Glacier for archival
- Reliability: 11 nines durability; cross-region replication for disaster recovery; versioning for data protection
- Security: Bucket policies, ACLs, encryption (SSE-S3, SSE-KMS), Block Public Access, access logging
- Sustainability: Intelligent-Tiering reduces carbon footprint; archive old data to Glacier
Amazon EBS (Elastic Block Store)
When to Use:
- Block storage for EC2 instances (root volume, data volumes)
- Persistent storage for databases
- High-performance workloads (SSD gp3, io2)
When NOT to Use:
- Object storage (use S3)
- Shared file system (use EFS)
- Archival (use Glacier)
Tradeoffs:
- Pros: Block-level, high performance, snapshots for backup, encryption
- Cons: Must be attached to EC2 instance; limited to single AZ (unless snapshot-replicated)
Cost Model:
- gp3 (general-purpose SSD): $0.08 per GB/month (includes 3k IOPS, 125 MB/s throughput)
- io2 (high I/O SSD): $0.125 per GB/month; $0.065 per IOPS provisioned
- st1 (throughput-optimized HDD): $0.045 per GB/month
- Snapshots: $0.05 per GB/month (incremental)
- Example: 100 GB gp3 = $8/month
Scaling Behavior:
- Fixed at provisioning; can be increased (online) but not decreased (offline only)
- IOPS scale independently of size (gp3: up to 16k IOPS)
Operational Burden:
- Low: Fully managed; snapshots are automated if configured
- Monitoring: Volume metrics in CloudWatch
Well-Architected Mapping:
- Reliability: Snapshots for backup; replicate snapshots for cross-AZ recovery
- Performance: Choose instance-store for max IOPS (no persistence); EBS for balance
- Cost: gp3 cheaper than gp2 at same performance; delete unused snapshots
Amazon EFS (Elastic File System)
When to Use:
- Shared file system for multiple EC2 instances
- NFS access required
- Scaling file system across AZs
When NOT to Use:
- High-performance (EBS/instance-store faster)
- Archive (use S3)
- Windows (use FSx for Windows File Server)
Tradeoffs:
- Pros: Elastic (grow/shrink), multi-AZ, scalable across instances
- Cons: Latency higher than EBS; NFS protocol overhead; more expensive
Cost Model:
- Standard: $0.30 per GB/month
- One Zone: $0.16 per GB/month (single AZ, 20% cheaper)
- Provisioned throughput: $0.01 per MB/s (on-demand)
- Example: 100 GB multi-AZ = $30/month
Scaling Behavior:
- Auto-scales; no provisioning needed
- Throughput: Bursting (up to 500 MB/s for 100 GB); provisioned for higher sustained
Operational Burden:
- Low: Fully managed; no patching or replication
Well-Architected Mapping:
- Reliability: Data replicated across AZs automatically
- Performance: Throughput mode for parallelized workloads
- Cost: More expensive than S3 for archive; less expensive than maintaining NFS servers
Amazon FSx for Windows File Server / Lustre
When to Use:
- Windows-native file sharing (SMB/CIFS)
- Lustre (high-performance computing, machine learning)
When NOT to Use:
- NFS needed (use EFS)
- General object storage (use S3)
Tradeoffs:
- Pros: Fully managed; Windows-native; high performance
- Cons: More expensive; less flexible than self-managed file servers
Cost Model:
- Windows File Server: $0.012 per hour per GB/month allocated (SSD)
- Lustre: $0.015-$0.021 per hour per TB/month
Well-Architected Mapping:
- Performance: Low-latency file access; good for Windows environments
- Cost: Justified only for Windows workloads needing shared files
3.3 Database Services
Amazon RDS (Relational Database Service)
When to Use:
- Structured data with complex queries (SQL)
- ACID transactions required
- Existing relational database workloads (MySQL, PostgreSQL, Oracle, SQL Server, MariaDB)
- < 100 TB data
When NOT to Use:
- NoSQL access patterns (key-value; use DynamoDB)
- Unstructured data (use S3)
- Extreme scale (> 100 TB; use Redshift or specialized database)
- Real-time analytics (use Redshift, Athena, or OpenSearch)
Tradeoffs:
- Pros: Managed, automated backups, Multi-AZ failover, read replicas, encryption
- Cons: Limited to single-master writes (though read replicas help); must right-size; managed backups limited to 35 days
Cost Model:
- Instance: $0.17/hour (db.t3.micro) to $10+/hour (db.r6g.16xlarge)
- Storage: $0.10 per GB/month (gp2 SSD); $0.12 (io1 SSD)
- Backups: First backup free; additional storage $0.10 per GB/month
- Data transfer: $0.01 per GB (out-of-region)
- Example: db.t3.small (small app) = ~$50/month + storage
Scaling Behavior:
- Vertical scaling (change instance type; requires downtime or read-replica promotion)
- Read replicas for horizontal read scaling (async replication; eventual consistency)
- Auto Scaling for storage (grow up to max)
Operational Burden:
- Low-moderate: Backups, upgrades managed; must monitor CPU/disk; schema management
- Multi-AZ automatic failover for HA
Well-Architected Mapping:
- Reliability: Multi-AZ for automatic failover (2x cost); read replicas for DR and scaling
- Performance: Connection pooling; query optimization; appropriate indexes
- Cost: Right-size instance type; use Savings Plans (1-year: 31% off, 3-year: 43% off); gp2 → gp3 for cost reduction
- Security: Encryption at-rest (KMS), in-transit (SSL); IAM database authentication; encrypted backups
Amazon Aurora (MySQL/PostgreSQL-compatible)
When to Use:
- High-throughput, low-latency SQL workloads
- Mission-critical applications requiring high availability
- Need for read scaling (15 read replicas)
- Up to 128 TB storage
When NOT to Use:
- Simple applications (RDS enough)
- Cost-sensitive (Aurora 2-3x RDS cost)
- Oracle-licensed (RDS Oracle only)
Tradeoffs:
- Pros: 5x faster than MySQL, 3x faster than PostgreSQL; 15 read replicas; auto-scaling storage; distributed architecture
- Cons: More expensive; multi-master not default (requires Aurora MySQL 5.7+); Aurora Serverless has cold starts
Cost Model:
- DB instance: $0.15/hour (db.t3.small) to $3.26/hour (db.r6g.16xlarge) (cheaper than comparable RDS per hour)
- Storage: $0.10 per GB/month (only pay for used; auto-scaling)
- Example: 100 GB, db.r6g.large (1 writer + 2 readers) = ~$250/month (better ROI for scale)
Scaling Behavior:
- Read replicas scale independently (up to 15)
- Auto Scaling Aurora Serverless (aurora-mysql) for variable load; pause when idle
- Storage auto-scales; no disk-full risk
Operational Burden:
- Very Low: Managed; replication automatic; backups automated (35-day retention)
- Multi-AZ automatic; cross-region read replica option
Well-Architected Mapping:
- Reliability: Multi-AZ primary + read replicas; automatic failover; cross-region DR
- Performance: 5x throughput gains; up to 100k write capacity; 500k read capacity
- Cost: Higher upfront; lower per-transaction cost at scale; Serverless variant for variable loads
- Sustainability: Shared compute for read replicas (efficient); auto-pause for low utilization
Amazon DynamoDB
When to Use:
- Key-value, document, or time-series data
- Predictable access patterns (query by partition key)
- Variable or bursty load (on-demand pricing)
- Millisecond latency required
- Serverless architecture
When NOT to Use:
- Complex queries across unrelated attributes (RDS, Redshift)
- Relational integrity (RDS)
- Full-text search (OpenSearch)
- Transactional consistency across multiple items (limited to single partition)
Tradeoffs:
- Pros: Fully managed, auto-scales, millisecond latency, serverless, global tables
- Cons: Query flexibility limited (must know partition key); eventual consistency for global tables; complex secondary indexes
Cost Model:
- Provisioned: Read: $0.00013 per RCU-hour; Write: $0.00065 per WCU-hour
- On-demand: $1.25 per 1M write requests; $0.25 per 1M read requests
- Storage: $0.25 per GB/month
- Example: 100 GB, on-demand, 100k write + 500k read/month = $0.125 + $0.125 + $25 = $25.25/month
- Breakeven: Provisioned cheaper at > 1M writes/day or > 3M reads/day
Scaling Behavior:
- Provisioned: Set RCU/WCU; auto-scales up within limits; scales down with delay (conservative)
- On-demand: Scales instantly; no capacity planning; higher latency at extreme peak
- Global Tables: Replicate to any region; read-your-writes consistency
Operational Burden:
- Very Low: Fully managed; no patching, replication, or backups to configure
- Monitoring: Consumed capacity, throttling, latency in CloudWatch
Well-Architected Mapping:
- Cost: On-demand for variable; provisioned for predictable; reserve capacity (Savings Plans) for baseline
- Reliability: Automatic backups (Point-in-Time Recovery); Global Tables for multi-region
- Performance: Millisecond latency; partition key design critical to avoid hot partitions
- Security: Encryption at-rest (KMS), IAM fine-grained access, TTL for automatic deletion
- Sustainability: Managed service; consolidate workloads
Amazon Redshift
When to Use:
- Data warehousing (TB-PB scale)
- Complex OLAP queries (star schema, large joins)
- BI/analytics workloads
- Time-series analytics (rollups, trends)
When NOT to Use:
- OLTP/transactional (RDS, DynamoDB)
- Ad-hoc queries (Athena cheaper)
- Real-time ingestion (Kinesis better)
Tradeoffs:
- Pros: Petabyte-scale, fast complex queries, mature BI integration, Spectrum for querying S3
- Cons: Requires provisioning and management; cluster outage for node failures (unless multi-node with auto-scaling)
Cost Model:
- dc2.large: $1.26/hour (~$900/month)
- ra3.xlplus (managed storage): $4.02/hour + $0.008 per GB/month for managed storage
- Example: 100 GB dc2.large = ~$900/month (good for sustained warehouse workload)
Scaling Behavior:
- Vertical (resize cluster type) or horizontal (add nodes)
- Resizing requires cluster downtime or elastic resize (newer, no downtime)
Operational Burden:
- Moderate: Cluster maintenance, node replacement, VACUUM/ANALYZE
- Spectrum: Query S3 without loading (larger effective warehouse without cluster growth)
Well-Architected Mapping:
- Cost: Fixed monthly cost; break-even against Athena at ~1 TB queries/month; reserved instances available
- Performance: Complex OLAP queries; star schema optimization; Spectrum for external data
- Reliability: Snapshots for backup; cross-region snapshot copy for DR
- Security: Encryption at-rest, in-transit; IAM; audit logging (via CloudWatch/S3)
Amazon OpenSearch (formerly Elasticsearch)
When to Use:
- Full-text search (logs, documents)
- Time-series analytics (logs with timestamps)
- Real-time dashboards (Kibana visualization)
- Vector search (embeddings for ML, RAG)
When NOT to Use:
- Traditional OLAP (Redshift)
- Transactional (RDS)
- Pure archival (S3, Glacier)
Tradeoffs:
- Pros: Fast full-text search, powerful aggregations, real-time visualization, vector search for ML
- Cons: Requires cluster; complex cluster configuration (node types, shard allocation); eventual consistency
Cost Model:
- Single node (t3.small): $0.123/hour (~$90/month)
- Multi-node (3 data nodes, r5.large):
$1.26/hour ($920/month) - Storage: Included in node cost; scales with node type
- Example: Small logging cluster (3 nodes) ~$900/month + data retention
Scaling Behavior:
- Horizontal scaling (add nodes); shard allocation balances data
- Auto-scaling available but requires careful configuration
- Manual index rollover for time-series (daily indices)
Operational Burden:
- Moderate: Shard allocation, index management, cluster health monitoring
- Snapshot repository for backup (S3)
Well-Architected Mapping:
- Cost: Fixed; justified for search/logging volume > 1TB/month
- Performance: Real-time search; aggregations on large datasets
- Reliability: Snapshots to S3; replica shards for HA; cross-region replication
- Security: Encryption, IAM, fine-grained access control via plugins
Amazon DynamoDB + Timestream
Timestream (for time-series data):
When to Use:
- Metrics, sensor data, stock prices (time-series)
- High-volume, append-only workloads
- Automatic retention policies
When NOT to Use:
- General analytics (Redshift, Athena)
- Complex queries (OpenSearch better for logs)
Cost Model: $0.30 per million writes; $0.01 per million queries
3.4 Integration & Messaging Services
Amazon SQS (Simple Queue Service)
When to Use:
- Async decoupling (producer → queue → consumer)
- Buffer burst traffic (queue absorbs spikes)
- Reliable delivery guarantee
- Work scheduling (Lambda pulling messages)
When NOT to Use:
- Real-time pub/sub (SNS)
- Complex routing (EventBridge)
- Immediate delivery (SNS faster)
Tradeoffs:
- Pros: Durable queue, at-least-once delivery, DLQ for failed messages, simple FIFO option
- Cons: Polling model (not push); no broad fanout (one consumer per message)
Cost Model:
- Standard SQS: $0.40 per 1M requests
- FIFO: $0.50 per 1M requests + deduplication/group messaging
- Example: 1M messages/month = $0.40 (essentially free at small scale)
Scaling Behavior:
- Unlimited message count; auto-scales
- Message retention: 15 min to 14 days (configurable)
- Consumers poll (or Lambda Event Source Mapping triggers Lambda)
Operational Burden:
- Very Low: Fully managed; configure retention, visibility timeout, DLQ
Well-Architected Mapping:
- Reliability: Durable queue; DLQ for failed messages; at-least-once delivery (idempotent processing required)
- Cost: Extremely cheap; pay per 1M requests
- Performance: Polling latency 0-20 sec (depends on ReceiveMessageWaitTimeSeconds)
Amazon SNS (Simple Notification Service)
When to Use:
- Fanout (one message → many consumers)
- Pub/Sub pattern
- Notifications (email, SMS, push)
- Integration with SQS (SNS → SQS for durability)
When NOT to Use:
- Ordered delivery (FIFO not as robust as SQS FIFO)
- Complex routing (EventBridge)
- Message history/replay needed (Kinesis)
Tradeoffs:
- Pros: Simple pub/sub, instant delivery, fanout, many targets
- Cons: No message history; no durability guarantees (best-effort delivery); no replay
Cost Model:
- $0.50 per 1M publishes
- Notifications: $0.02 per SMS, variable for email/HTTP
- Example: 1M publishes → $0.50/month; negligible at small scale
Scaling Behavior:
- Unlimited subscribers; auto-scales
- Delivery attempts: 3 (exponential backoff)
Operational Burden:
- Very Low: Fully managed; configure subscriptions, filters
Well-Architected Mapping:
- Reliability: Couple with SQS for durability (SNS → SQS → Consumer)
- Cost: Very cheap
- Performance: Instant delivery
Amazon EventBridge
When to Use:
- Event-driven architecture (100+ event sources)
- Rule-based routing (complex filtering)
- Multi-target fanout (90+ targets)
- Integration with AWS services and SaaS apps
- Event replay and archiving
When NOT to use:
- Simple queue (SQS)
- Just fanout (SNS)
- High-throughput streaming (Kinesis)
Tradeoffs:
- Pros: Flexible routing, extensive targets, schema registry, event replay
- Cons: More complex than SNS/SQS; slightly higher latency; limited throughput vs. Kinesis
Cost Model:
- $0.35 per 1M events
- Archive retention: $0.023 per GB/month
- Example: 1M events/month = $0.35 (negligible)
Scaling Behavior:
- Millions of events per second per account
- Auto-scales
Operational Burden:
- Low: Define rules; manage targets
Well-Architected Mapping:
- Reliability: Event replay from archive; DLQ for failed targets
- Cost: Extremely cheap
- Performance: ~50-100ms latency; Kinesis faster for streaming
Amazon Kinesis Data Streams
When to Use:
- Streaming (continuous data flow)
- Ordered by partition (shard)
- Real-time processing (seconds latency)
- 24-hour message history (replay)
- High throughput (100k+ events/sec)
When NOT to Use:
- Simple queuing (SQS/SNS)
- Low-volume events (EventBridge cheaper)
- Long message history (Kafka/MSK)
Tradeoffs:
- Pros: Real-time streaming, partitioned for ordering/parallelization, replay capability
- Cons: Requires shard management or on-demand pricing; higher cost than SQS
Cost Model:
- Provisioned: $0.36 per shard-hour; $0.02 per 1M PutRecord requests
- On-demand: $0.047 per GB ingested; $0.315 per 1M GetRecords
- Example: 1M events/hour (1 KB each) on-demand = 1GB/hour × $0.047 = $0.047/hour ≈ $35/month; provisioned (1 shard) = $262/month
Scaling Behavior:
- Provisioned: Auto-scaling by shard count
- On-demand: Scales instantly
- Throughput: 1 MB/sec per shard (provisioned); billed per GB on-demand
Operational Burden:
- Low-moderate: Shard management, consumer lag monitoring, partition key design
Well-Architected Mapping:
- Cost: Provisioned for baseline; on-demand for variable
- Performance: Real-time streaming, 24-hour replay, partition ordering
- Reliability: Replicated across AZs; consumer checkpointing for fault tolerance
Amazon MSK (Managed Streaming for Apache Kafka)
When to Use:
- Kafka ecosystem expertise available
- Cross-cloud/on-prem Kafka integration needed
- Complex streaming logic (Kafka Streams, Flink)
- Larger throughput than Kinesis economically viable
When NOT to Use:
- AWS-only workloads (Kinesis simpler)
- Low-volume (EventBridge, SQS cheaper)
- Need for hands-off (Kinesis more managed)
Tradeoffs:
- Pros: Kafka ecosystem, portability, higher throughput at scale
- Cons: More operational burden; cluster management; higher baseline cost
Cost Model:
- Broker cost: $0.15 per hour per broker (3 brokers = $1080/month baseline)
- Storage: $0.10 per GB/month
- Data transfer: $0.02 per GB (out-region)
- Example: 3 brokers, 100 GB storage = $1080 + $10 = $1090/month
Scaling Behavior:
- Broker count and storage size scales
- Auto-scaling available via IAM roles
Operational Burden:
- High: Broker configuration, topic management, consumer group coordination, monitoring
Well-Architected Mapping:
- Cost: Fixed baseline; good for high-volume/long-term commitments
- Performance: High throughput; complex streaming logic
- Reliability: Multi-AZ; broker redundancy; replication factor
AWS Step Functions
When to Use:
- Multi-step workflows with conditional logic
- Long-running processes (minutes to days)
- Error handling and retries needed
- Visual workflow definition
- Orchestrating async services
When NOT to Use:
- Simple task execution (Lambda sufficient)
- Strict real-time (latency implications)
Tradeoffs:
- Pros: Visual workflow, error handling, retry policies, state persistence, waiting
- Cons: State transitions cost money; additional latency per step; limited native support for some use cases
Cost Model:
- Standard: $0.000025 per state transition
- Express: $0.000001667 per invocation + $0.000000208 per GB-second
- Example: 1M workflows, 10 steps = 10M transitions = $250/month
Scaling Behavior:
- Millions of concurrent executions
- Automatic
Operational Burden:
- Low: Define state machine in JSON; Step Functions handles orchestration
Well-Architected Mapping:
- Reliability: Automatic retries, catch/throw errors, compensation logic
- Cost: Extremely cheap for state transitions; Express mode for high-volume
- Performance: Step execution ~0.5-2 sec latency; suited for async workflows, not real-time
3.5 Networking Services
Amazon VPC (Virtual Private Cloud)
When to Use:
- Almost every workload (default)
- Network isolation required
- Custom IP addressing
- Network ACLs, security groups
When NOT to Use:
- Unheard of; VPC is foundational
Architecture Pattern:
- Public subnet: IGW for internet; ALB, NAT Gateway
- Private subnet: No direct internet; NAT Gateway for outbound
- Database subnet: Isolated; only accessible from app tier
Scaling Behavior:
- Elastic; auto-scales
- VPC can have max 5 CIDR blocks
Operational Burden:
- Moderate: Design subnets, security groups, route tables, NAT Gateway (hourly cost)
Well-Architected Mapping:
- Security: Network isolation; security groups per tier; NACLs for stateless filtering
- Reliability: Multi-AZ subnets; redundant NAT Gateways
- Cost: Free VPC; pay for NAT Gateway ($32/month per AZ), data transfer
Application Load Balancer (ALB) / Network Load Balancer (NLB)
ALB (Layer 7):
- When to use: HTTP/HTTPS, microservices (path/hostname routing), WebSocket
- Cost: $22/month base + $0.006 per LCU (load balancer capacity unit)
- Latency: 100-200ms overhead
NLB (Layer 4):
- When to use: Extreme throughput (millions/sec), low latency (< 100ms), non-HTTP protocols (TCP, UDP)
- Cost: $32/month base + $0.006 per LCU (higher pricing)
- Latency: 10-50ms overhead
Well-Architected Mapping:
- Reliability: Health checks; automatic failover; cross-AZ
- Performance: ALB for flexibility; NLB for extreme throughput
- Cost: $22-32/month baseline; shared across multiple services if possible
Amazon API Gateway
When to Use:
- REST or GraphQL APIs
- Rate limiting, authentication (API keys, OAuth)
- Request/response transformation
- Caching
When NOT to Use:
- Internal service-to-service (VPC Endpoints)
- Extreme throughput (NLB better)
Cost Model:
- $3.50 per 1M requests (regional)
- $0.60 per 1M for WebSocket API
- Data transfer: $0.09 per GB out
Scaling Behavior:
- Auto-scales; no provisioning
Operational Burden:
- Low: Define API, configure integrations, set up throttling
Well-Architected Mapping:
- Security: API keys, OAuth, request validation, WAF integration
- Reliability: Throttling prevents cascading failures
- Performance: Caching reduces backend load
- Cost: Extremely cheap; $3.50 per 1M requests (~$3.50/month for 1M requests)
Amazon CloudFront
When to Use:
- Distribute static content globally
- Cache API responses (cache headers)
- DDoS protection (Shield Standard included)
- Origin Shield for cache efficiency
When NOT to Use:
- Single-region, low-traffic workloads
Cost Model:
- $0.085 per GB (varies by region, USA cheapest)
- Request: $0.01 per 10k
Scaling Behavior:
- Auto-scales globally; no provisioning
Operational Burden:
- Very Low: Configure origin, cache behavior, invalidation
Well-Architected Mapping:
- Performance: Global latency reduction (users served from nearest edge); significant for global audiences
- Cost: Minimal request cost; high if transferring large GB (offset by origin load reduction)
- Security: DDoS protection, WAF, Origin Shield for burst traffic
- Sustainability: Reduced origin load; distributed edge computing
AWS Transit Gateway
When to Use:
- Hub-and-spoke connectivity (multiple VPCs, on-prem)
- Simplified multi-VPC architecture
- On-premises integration via Direct Connect
When NOT to use:
- Single VPC (unnecessary)
Cost Model:
- $0.05 per hour (~$36/month)
- $0.02 per GB processed
Well-Architected Mapping:
- Reliability: Centralized connectivity; simplified failover
- Cost: Justified for 3+ VPCs or multi-region
AWS Direct Connect
When to Use:
- Dedicated network from on-premises to AWS
- Consistent bandwidth
- Large data transfers (cheaper than internet data transfer)
When NOT to Use:
- Small, intermittent connections (VPN sufficient)
Cost Model:
- $0.30 per hour (~$218/month)
- $0.02 per GB output
Well-Architected Mapping:
- Reliability: Dedicated connection; predictable performance
- Cost: Justified for > 10 TB/month transfers
3.6 Security Services
AWS IAM (Identity & Access Management)
When to use: Always
- Every workload needs IAM roles and policies
- Fine-grained permissions (least privilege)
- Cross-account access via roles
Cost Model: Free
Well-Architected Mapping:
- Security: Least privilege; assume role model; remove console access
AWS KMS (Key Management Service)
When to use:
- Encrypt sensitive data (PII, financial, secrets)
- At-rest encryption (S3, RDS, EBS)
Cost Model:
- $1.00 per month per key
- $0.03 per 10k requests
Well-Architected Mapping:
- Security: Encryption; audit trail (CloudTrail); key rotation
- Compliance: Required for regulated data (PII, HIPAA, PCI)
AWS Secrets Manager
When to use:
- Store database credentials, API keys, tokens
- Automatic rotation
Cost Model:
- $0.40 per secret per month
- $0.05 per 10k API calls
Well-Architected Mapping:
- Security: No hardcoded credentials; rotation; audit
Amazon GuardDuty
When to use:
- Threat detection
- Continuous monitoring for malicious activity
- Integration with Security Hub
Cost Model:
- $1.00 per 1M events analyzed per month
Well-Architected Mapping:
- Security: Automated threat detection; alerts; findings
AWS WAF (Web Application Firewall)
When to use:
- Protect web applications from attacks (SQL injection, XSS, bot attacks)
- Integration with CloudFront, ALB, API Gateway
Cost Model:
- $5.00 per month per rule group
- $0.60 per 1M requests
Well-Architected Mapping:
- Security: Application-layer protection; rate limiting; IP blocking
3.7 Observability Services
Amazon CloudWatch
When to use: Every workload
- Metrics (CPU, memory, custom)
- Logs (application, system)
- Alarms (trigger auto-scaling, SNS)
- Dashboards (visualization)
Cost Model:
- Logs: $0.50 per GB ingested; $0.03 per GB stored
- Metrics: $0.30 per custom metric per month
- Alarms: $0.10 per alarm per month
Well-Architected Mapping:
- Operational Excellence: Observability; alerts; dashboards
- Reliability: Alarms trigger auto-scaling; identify bottlenecks
- Cost: Identify over-provisioned resources; optimize
AWS X-Ray
When to use:
- Distributed tracing across microservices
- Identify latency bottlenecks
- Understand service dependencies
Cost Model:
- $5.00 per 1M recorded traces
- $0.50 per 1M retrieved traces
Well-Architected Mapping:
- Operational Excellence: Trace requests end-to-end; identify latency
- Reliability: Understand failure propagation
Amazon Managed Prometheus / Grafana
When to use:
- Kubernetes metrics (via Prometheus scraping)
- Long-term metric storage
- Custom visualization (Grafana)
Cost Model:
- Prometheus: $0.90 per 1M ingested samples
- Grafana: $9.00 per workspace per month
Well-Architected Mapping:
- Operational Excellence: Kubernetes-native monitoring
- Cost: For EKS workloads
Section 4: Architecture Pattern Library (Domain-Independent)
Every system, regardless of domain, is composed of these reusable patterns.
4.1 CRUD Backend
Definition: Create, Read, Update, Delete operations on a data model.
Pattern:
Client (mobile, web, API) → API Gateway → Lambda → RDS/DynamoDB
When to use: Always (fundamental pattern)
Well-Architected:
- Security: IAM roles; request validation; encryption
- Reliability: Error handling; retry logic; transaction handling
- Performance: Caching (ElastiCache); query optimization; connection pooling
Cost Optimization:
- Use DynamoDB for variable load
- RDS with read replicas for read-heavy
- Cache frequently accessed data
4.2 Event-Driven Orchestration
Definition: Components communicate via asynchronous events; central coordinator (Step Functions) directs workflow.
Pattern:
Trigger → Step Functions → [Lambda1, ECS2, Batch3, Lambda4] → EventBridge → Notifications ↓ DynamoDB (state storage)
When to use: Multi-step business workflows (order processing, loan approvals, ML pipelines)
Well-Architected:
- Reliability: Automatic retries; compensation logic; DLQ for failures
- Operational Excellence: CloudWatch logs per step; alerts on failure
- Cost: Pay per state transition; extremely cheap
4.3 Saga Pattern (Distributed Transactions)
Definition: Long-running transaction across multiple services; compensating transactions for rollback.
Pattern:
Service A (order) → EventBridge → Service B (payment) → EventBridge → Service C (fulfillment) ↓ If payment fails: Compensate Service A (cancel order)
When to use: Multi-service transactions without distributed locks
Well-Architected:
- Reliability: Compensation logic; idempotent operations
- Operational Excellence: Audit trail (events); replayable
4.4 Fan-Out / Fan-In
Definition: One event triggers multiple parallel processors; results aggregated.
Pattern (fan-out):
SNS/EventBridge → Lambda1 (send email) → Lambda2 (update metrics) → Lambda3 (trigger analytics)
Pattern (fan-in):
Lambda1 \ Lambda2 → Step Functions → Lambda4 (aggregate results) Lambda3 /
When to use: Parallel processing; decoupled consumers
Well-Architected:
- Reliability: Fan-out is parallel; fanin aggregates results
- Performance: Parallel execution reduces latency
4.5 Data Lakehouse (Medallion Architecture)
Definition: Multi-tier data organization (bronze → silver → gold)
Pattern:
Source (API, DB) → S3 Bronze (raw) → Glue (transform) → S3 Silver (cleaned) ↓ Lake Formation (govern) ↓ S3 Gold (curated) ↓ Redshift/Athena/QuickSight
When to use: Centralized data platform with multi-consumer access
Well-Architected:
- Operational Excellence: Glue Data Catalog; Lake Formation permissions
- Security: Encryption; access control; audit logging
- Cost: S3 + Athena for ad-hoc; Redshift for BI; Glacier for archive
4.6 CQRS (Command Query Responsibility Segregation)
Definition: Separate read and write models; write commands to event store; query from read-optimized views.
Pattern:
Write Path: Read Path: Command → Aggregate → Event Store → Denormalize → Read View ↓ Query (fast, optimized)
When to use: Complex domain logic; audit requirements; multiple read views
Well-Architected:
- Reliability: Event sourcing provides audit trail; replay for recovery
- Performance: Read model optimized for queries
- Cost: DynamoDB for event store + read model; cheap at scale
4.7 Command Pipeline (Batch Job Chain)
Definition: Sequential batch jobs; each transforms data and passes to next
Pattern:
Schedule → Glue1 (extract) → S3 (temp) → Glue2 (transform) → S3 (output) → Redshift (load) ↓ Step Functions (orchestration)
When to use: ETL pipelines; scheduled data processing
Well-Architected:
- Operational Excellence: Step Functions for orchestration; error handling
- Cost: Glue on-demand for variable workloads; schedule off-peak
- Reliability: Checkpointing for restartable jobs; DLQ for failed steps
4.8 Agent-Tool Execution (Agentic AI)
Definition: AI agent reasons over user query; selects and executes tools; iterates until complete.
Pattern:
User Query → Bedrock Agent (reasoning) → Tool Selection ├── Lambda1 (query DB) ├── Lambda2 (call API) ├── DynamoDB (read state) └── OpenSearch (semantic search) ↓ Feedback loop (iterate if needed) ↓ Final response to user
When to use: Autonomous automation; conversational AI; decision support
Well-Architected:
- Security: Tool access control; query validation; PII redaction
- Reliability: Fallback tools; error recovery
- Observability: Trace tool calls; audit agent decisions
4.9 Streaming Ingestion (Real-Time Data Pipeline)
Definition: Continuous data ingestion; real-time processing; multiple sinks.
Pattern:
Sensors/APIs → Kinesis → Analytics (windowed agg) → DynamoDB (current state) ↓ Lambda (enrichment) ↓ Multiple sinks: ├── DynamoDB (dashboard) ├── S3 (cold storage) ├── OpenSearch (search/alerting) └── SNS (alerts)
When to use: Real-time analytics; alerting; monitoring
Well-Architected:
- Performance: Partition by shard; parallel processing
- Cost: On-demand Kinesis for variable; provisioned for baseline
- Reliability: Consumer checkpointing; DLQ for failed events
4.10 Multi-Region Active-Active
Definition: Same workload deployed in multiple regions; users routed to nearest; data replicated.
Pattern:
Global User → Route 53 (geolocation routing) → Region 1 (ALB → ECS → RDS Aurora Global) → Region 2 (ALB → ECS → RDS Aurora Global) → Region 3 (ALB → ECS → RDS Aurora Global)
When to use: Global applications; disaster recovery; low-latency for globally distributed users
Well-Architected:
- Reliability: Automatic failover; regional disaster recovery
- Performance: Users served from nearest region
- Cost: 3x infrastructure cost; justified for critical, global workloads
- Sustainability: Distributed load
Section 5: AWS Well-Architected Integration (Mandatory)
Every architectural decision must align with the six pillars of the AWS Well-Architected Framework. This section maps decisions to pillars and provides measurable indicators.
Pillar Alignment Template
For every major decision (service choice, architecture pattern, data design), answer:
- Operational Excellence: Can teams operate this? Is it observable? Are procedures clear?
- Security: Are data and access protected? Is compliance addressed?
- Reliability: Can it recover from failure? What is RTO/RPO?
- Performance Efficiency: Does it meet latency/throughput targets? Is it optimized?
- Cost Optimization: Is it cost-effective? Are there cheaper alternatives?
- Sustainability: Does it minimize energy/carbon? Is it resource-efficient?
5.1 Operational Excellence Pillar
Design Principles:
- Organize teams around business outcomes
- Implement observability for actionable insights
- Safely automate where possible
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from operational events
- Use managed services
Key Questions:
- Are teams organized to own their systems end-to-end?
- Is every component observable (metrics, logs, traces)?
- Are deployments automated and safe (canary, blue-green)?
- Can operators quickly diagnose and respond to issues?
- Are runbooks and procedures documented and tested?
Best Practices by Service:
| Service | Observability | Automation | Disaster Response |
|---|---|---|---|
| Lambda | CloudWatch Logs, X-Ray | SAM, CDK, CodePipeline | DLQ, retries, reserved concurrency |
| ECS | CloudWatch metrics, Container Insights | CodeDeploy, ECS task placement | Service health checks, auto-restart |
| RDS | Enhanced monitoring, Performance Insights, CloudWatch | AWS Config, automated backups | Multi-AZ failover, read replicas |
| DynamoDB | CloudWatch metrics, X-Ray tracing, TTL monitoring | Point-in-time recovery, backup service | Global tables, on-demand scaling |
| S3 | Access logging, CloudTrail, CloudWatch metrics | Lifecycle policies, inventory, replication | Versioning, cross-region replication, Glacier |
| Kinesis | Iterator age, consumer lag, CloudWatch metrics | Auto-scaling shards, Lambda event mapping | Shard backup via S3, 24-hour retention |
Measurable Indicators:
- MTTR (Mean Time To Recovery): Target < 15 min for P1 incidents
- Change failure rate: < 10% of deployments cause incidents
- Deployment frequency: Daily or more
- On-call fatigue: < 1 page/week per engineer
Anti-Patterns:
- ❌ Manual deployments; no version control
- ❌ No monitoring; discovering issues from customers
- ❌ Monolithic applications; all-or-nothing deployments
- ❌ Undocumented procedures; tribal knowledge
5.2 Security Pillar
Design Principles:
- Implement strong identity foundation (least privilege)
- Maintain traceability (audit all actions)
- Protect data in transit and at-rest
- Detect and investigate security events
- Protect infrastructure
- Prepare for security events
Key Questions:
- Are all principals (users, roles, services) authenticated?
- Are permissions least-privilege (does role have exactly needed permissions)?
- Is all data encrypted (transit TLS, at-rest KMS)?
- Is access logged and audited?
- Are data classification and sensitivity levels defined?
- Are compliance requirements met (HIPAA, PCI, GDPR)?
Best Practices by Tier:
| Layer | Best Practice | Implementation |
|---|---|---|
| Identity & Access | Least privilege, role-based access | IAM policies, assume roles across accounts, MFA for console |
| Infrastructure | Network isolation, security groups | VPC, security groups (stateful), NACLs (stateless), WAF for web |
| Data Protection | Encrypt all data | KMS at-rest (S3, RDS, EBS), TLS in-transit, Secrets Manager for credentials |
| Detection | Monitor and alert on anomalies | CloudTrail (API audit), GuardDuty (threats), Config (compliance), Security Hub (aggregation) |
| Incident Response | Automated and manual playbooks | CloudWatch alarms → SNS → Lambda/email; IAM roles for responders |
Measurable Indicators:
- 100% of data encrypted at-rest and in-transit
- All API calls logged to CloudTrail
- Zero exposed credentials (Secrets Manager, no hardcoding)
- Compliance audit pass rate: 100% on critical controls
- Incident detection time: < 5 min for automated, < 1 hour for manual
Anti-Patterns:
- ❌ Hardcoded credentials in code/config
- ❌ Public S3 buckets (unless intentional)
- ❌ Over-permissive IAM roles (e.g., AdministratorAccess)
- ❌ No encryption; no audit logging
- ❌ Manual credential rotation
5.3 Reliability Pillar
Design Principles:
- Automatically recover from failure
- Test failure scenarios
- Stop guessing capacity
- Manage change via automation
Key Questions:
- Can the system recover automatically from failures?
- Have failure scenarios been tested (chaos engineering)?
- Is capacity provisioned to handle peaks without manual intervention?
- Are changes made safely (automated rollback)?
Best Practices by Scenario:
| Scenario | Strategy | Implementation |
|---|---|---|
| Service Failure | Auto-restart, circuit breaker | ECS health checks, ALB target deregistration, Step Functions retries |
| Database Failure | Multi-AZ failover, read replicas | Aurora Multi-AZ, RDS Multi-AZ, DynamoDB auto-replication |
| Data Loss | Backups, point-in-time recovery | RDS automated backups, S3 versioning, DynamoDB PITR, DMS (CDC) |
| Region Failure | Disaster recovery, multi-region | Cross-region replication (S3, RDS, DynamoDB global tables), Route 53 failover |
| Capacity Overload | Auto-scaling, circuit breakers | ASG, Lambda concurrency, SQS queue buffering, Step Functions error handling |
RTO & RPO by Workload:
| Workload Criticality | RTO | RPO | Strategy |
|---|---|---|---|
| Critical (financial, healthcare) | < 1 hour | Zero data loss | Multi-AZ + multi-region, synchronous replication, event sourcing |
| High (revenue-impacting) | 1-4 hours | < 1 hour | Multi-AZ, async replication, hourly backups |
| Medium (operational) | 4-24 hours | < 1 day | Single AZ, daily snapshots |
| Low (dev/test) | > 1 day | Not applicable | Backups acceptable; manual recovery OK |
Measurable Indicators:
- Availability: 99.9% for critical, 99% for high
- MTTR: < 5 min for automated recovery
- Unplanned downtime: < 43 min/month for 99.9%
- Recovery test success: 100% of DR scenarios annually
- Change rollback success: < 5 min
Anti-Patterns:
- ❌ Single points of failure (single AZ, single instance)
- ❌ No backups or unverified recovery
- ❌ Manual scaling (capacity runs out during spikes)
- ❌ No circuit breakers (cascading failures)
- ❌ Changes without automated rollback
5.4 Performance Efficiency Pillar
Design Principles:
- Democratize advanced technologies
- Go global in minutes (CloudFront)
- Use serverless for variable workloads
- Experiment often
- Mechanical sympathy (align tech to workload)
Key Questions:
- Does the system meet latency targets?
- Is throughput optimized for the workload?
- Are expensive operations (queries, compute) optimized?
- Is caching used where beneficial?
Optimization by Service:
| Service | Key Optimizations | Measurements |
|---|---|---|
| API Gateway | Caching, throttling, request validation | p99 latency < 200ms |
| Lambda | Provisioned concurrency, memory tuning, connection reuse | Cold start < 100ms, warm < 50ms |
| RDS/Aurora | Read replicas, indexes, query optimization, connection pooling | Query p95 < 100ms, throughput per core |
| DynamoDB | Partition key design, GSI, DAX cache, batch operations | p99 < 10ms, no hot partitions |
| S3 | Multipart upload, batch operations, Transfer Acceleration, Intelligent-Tiering | Upload throughput, object retrieval latency |
| CloudFront | Origin Shield, cache headers, compression, HTTP/2 | p99 latency < 100ms globally |
Performance by Workload:
| Workload Type | Latency Target | Optimization |
|---|---|---|
| Synchronous (API) | p99 < 200ms | Caching, query optimization, parallel requests |
| Real-time | p99 < 100ms | Local cache (ElastiCache), connection reuse, batch operations |
| Batch | Throughput optimized | Parallel processing, partitioning, appropriate instance size |
| Streaming | Sub-second processing | Partition key design, Kinesis shards, parallel Lambda invocation |
Measurable Indicators:
- p99 latency: < target threshold
- Throughput: Scales linearly with resources (no bottlenecks)
- Resource utilization: 60-80% for optimal cost/performance
Anti-Patterns:
- ❌ N+1 queries (repeated DB calls in loops)
- ❌ No caching (repeated expensive operations)
- ❌ Synchronous processing (blocking calls)
- ❌ Single-threaded/single-shard processing (can't parallelize)
- ❌ Unoptimized queries (full table scans)
5.5 Cost Optimization Pillar
Design Principles:
- Implement Cloud Financial Management
- Measure and attribute expenditure
- Stop paying for under-utilized resources
- Analyze and optimize over time
- Use managed services to reduce operational cost
Key Questions:
- Is the current spend justified by business value?
- Are there cheaper service alternatives?
- Are discounts being used (Reserved Instances, Savings Plans)?
- Is capacity right-sized?
- Are unused resources being cleaned up?
Cost Optimization by Service:
| Service | Cost Driver | Optimization Strategy |
|---|---|---|
| Lambda | Requests + GB-seconds | On-demand for variable; consolidate functions if baseline traffic |
| ECS | EC2 instance hours | Fargate for variable; EC2 + Savings Plans for stable |
| RDS | Instance-hours + storage | Right-size instances; use Savings Plans (1-yr: 31% off, 3-yr: 43% off); read replicas for read-heavy |
| DynamoDB | Provisioned RCU/WCU or on-demand | On-demand for unpredictable; provisioned for baseline; consider reserve capacity |
| S3 | Storage + requests + transfer | Intelligent-Tiering for unknown access; Glacier for archive; delete unused data |
| Redshift | Instance-hours + storage | Reserved instances; use Spectrum for external data; pause during low traffic |
| Kinesis | Shard-hours or GB ingested | On-demand for variable; provisioned for baseline; batch and compress |
Pricing Model Selection:
| Workload Pattern | Recommended | Cost Savings |
|---|---|---|
| Predictable, high utilization | Reserved Instances (3-year) | 65% off on-demand |
| Predictable, multi-service baseline | Savings Plans (3-year) | 60-65% off on-demand; flexible across services |
| Variable, bursty | On-demand or serverless | No commitment; higher per-unit cost |
| Batch, interruptible | Spot instances | 70% off on-demand |
Cost Monitoring & Attribution:
- Cost tags per application, team, cost center
- Budget alerts in Cost Explorer
- Savings Plans recommendations (automated)
- Trusted Advisor for cost optimization opportunities
Measurable Indicators:
- Cost per transaction: Trending down quarter-over-quarter
- Utilization: EC2 CPU 60-80%; under-utilized instances identified and removed
- Discount penetration: > 70% of compute cost on Reserved/Savings Plans
- Monthly savings from optimization: Documented and tracked
Anti-Patterns:
- ❌ Large reserved instance commitment for unproven workloads
- ❌ Oversized instances (paying for unused capacity)
- ❌ Running dev/test environments 24/7
- ❌ Keeping data in expensive storage (not using Intelligent-Tiering)
- ❌ No cost allocation or visibility
5.6 Sustainability Pillar
Design Principles:
- Understand your sustainability impact
- Establish goals and measure impact
- Maximize utilization (reduce waste)
- Adopt efficient hardware and architecture
- Use managed services (AWS optimizes for efficiency)
- Reduce downstream impact (minimize data transfer)
Key Sustainability Decisions:
| Decision | High-Impact Option | Impact |
|---|---|---|
| Compute selection | Serverless + managed services | No idle infrastructure; AWS amortizes overhead |
| Regional placement | Use AWS Regions with renewable energy | Check AWS Sustainability Report; PPA-backed regions |
| Data storage | S3 Intelligent-Tiering → Glacier | Reduces storage footprint; archive old data |
| Instance types | Graviton, Trainium (AWS-built chips) | Higher energy efficiency than x86 |
| Architecture | Batch processing, off-peak scheduling | Consolidate; avoid running 24/7 if not needed |
| Data transfer | Minimize inter-region/public data transfer | CloudFront for global distribution; VPC Endpoints for internal |
Measurable Indicators:
- Kilograms CO2 per transaction: Trending down
- Instances with < 20% utilization: Identified and consolidated
- Data stored in cold tiers (Glacier): Percentage of total
- Energy efficiency score (AWS Carbon Intelligence): Tracking vs. industry baseline
Anti-Patterns:
- ❌ Running compute 24/7 (especially development/test)
- ❌ No data lifecycle policies (keeping hot storage forever)
- ❌ Using x86 instances when Graviton available
- ❌ Not considering region sustainability impact
Section 6: Cost & Scale Modeling Framework
6.1 Fixed vs. Variable Cost Analysis
Fixed Cost (per month, regardless of usage):
- NAT Gateway: $32/month per AZ
- ALB: $22/month base
- Redshift cluster: $1.26/hour (minimum)
- RDS instance: $50-$1000+/month (instance-based)
Variable Cost (per unit of usage):
- Lambda: $0.0000002 per request + $0.0000166667 per GB-second
- SQS: $0.40 per 1M requests
- S3: $0.023 per GB/month + $0.0004 per 1k requests
- Data transfer out: $0.09 per GB
6.2 Cost at Different Scales
Scenario: Build an API backend
Low Scale (100 req/sec, < 1 TB data):
- Lambda: $4/month (requests) + $100/month (compute) = $104/month ✓ Cheapest
- ECS Fargate: $350/month (1 vCPU, 2GB always-on) + $0 RDS = $350
- RDS t3.micro: $17/month
Medium Scale (1000 req/sec, 100 GB data):
- Lambda + DynamoDB: $40 (requests) + $1000 (compute) + $50 (DB) = $1090
- ECS + RDS: $1400 (compute) + $200 (RDS t3.small) = $1600
- Winner: Lambda still cheaper (less infrastructure overhead)
High Scale (10,000 req/sec, 1 TB data):
- Lambda + DynamoDB: $400 + $10,000 + $500 = $10,900
- ECS + RDS Aurora: $4000 (compute) + $1000 (Aurora) + $1000 (storage) = $6000 ✓ Cheaper
- Winner: ECS + provisioned services (fixed cost amortized over high traffic)
6.3 Break-Even Analysis
Lambda vs. ECS Example:
-
Lambda: $0.0000002 per request + $0.0000166667 per GB-second
-
For 128 MB Lambda, 100ms execution:
- Cost per request: $0.0000002 + (128/1024 × 0.0000166667 × 0.1) = $0.00000176 per request
- 1M requests = $1.76
-
ECS (1 vCPU, 2 GB): $0.04695 + $0.00519 = $0.05214/hour = $1250/month
-
At 100ms execution time: 1 vCPU handles ~10 req/sec = 864k req/day = 26M req/month
-
Cost per request: $1250 / 26M = $0.000048 per request
Breakeven: Lambda cheaper below ~5k req/sec; ECS cheaper above (depending on execution time)
6.4 Data Gravity Analysis
Question: Should data stay in one region or be replicated?
Factors:
- Data transfer cost out: $0.09 per GB (public internet)
- S3 cross-region replication: Included (paid per GB replicated)
- RDS Multi-AZ: Free (internal replication)
- RDS cross-region: Paid data transfer ($0.02/GB)
Decision Matrix:
- Single region, high data volume (100 TB+) and distributed users: CloudFront for distribution (caching layer)
- Multi-region for disaster recovery: Use cross-region replication; amortize cost over disaster scenarios
- Multi-region active-active: High cost; only for mission-critical global workloads
6.5 Cost Optimization Levers
Ordered by impact:
-
Right-sizing (biggest impact): Reduce instance size; use Intelligent-Tiering
- Potential savings: 30-50% if currently over-provisioned
-
Commitment discounts (Savings Plans, Reserved Instances): 40-65% off
- Potential savings: 40-65% of compute cost
- Requirement: 70%+ predictable baseline
-
Reserved capacity (DynamoDB, Redshift): 50%+ off
- Potential savings: 50% of baseline database cost
- Requirement: Stable, known baseline
-
Spot instances (for interruptible workloads): 70% off
- Potential savings: 70% of batch compute cost
- Trade-off: Interruption risk acceptable
-
Architecture changes (Lambda vs ECS, S3 Intelligent-Tiering): 20-50% off
- Potential savings: Varies by workload; biggest impact for right-sizing to service tier
-
Data transfer optimization: 20-30% off data costs
- Use CloudFront for global distribution; keep data local; compress
Section 7: Evolution & Change Strategy
7.1 MVP → Scale
MVP Phase (0-6 months, 1-100 users):
- Use serverless (Lambda, Fargate, DynamoDB on-demand)
- Minimize operational burden
- Fast iteration; don't optimize prematurely
- Architecture: API Gateway → Lambda → DynamoDB
- Cost: ~$50-500/month
Scale Phase (6+ months, 1k-1M users):
- Identify bottlenecks; migrate to provisioned services as needed
- Add caching (CloudFront, ElastiCache)
- Optimize costs: Reserved Instances, Savings Plans
- Architecture: API Gateway → ECS/Lambda → RDS Aurora with read replicas → CloudFront
- Cost: $1k-10k/month
Mature Phase (2+ years, 1M+ users):
- Multi-region active-active for resilience
- Advanced caching and CDN
- Dedicated infrastructure; right-sized instances
- Cost: $10k+/month
7.2 Monolith → Microservices
Phase 1: Decompose (0-3 months)
- Identify service boundaries (domain-driven design)
- Build event-driven orchestration (EventBridge, Step Functions, SNS/SQS)
- Keep monolith and microservices running in parallel
Phase 2: Strangle Fig (3-12 months)
- Gradually route traffic to microservices
- Retire monolith modules as traffic shifts
- Maintain backward compatibility
Phase 3: Mature (12+ months)
- All traffic on microservices
- Optimize service communication (caching, circuit breakers)
- Consider service mesh (Istio) if > 20 services
7.3 Single-Region → Multi-Region
Phase 1: Failover (0-3 months)
- Set up cross-region backup/restore
- Automate backup and restore testing
- RTO: hours; manual failover
Phase 2: Active-Passive (3-6 months)
- Set up read replicas in secondary region
- Automate failover via Route 53 health checks
- RTO: < 5 min; automatic
Phase 3: Active-Active (6-12 months)
- Data replicated bidirectionally (Aurora Global, DynamoDB Global Tables)
- Load balanced across regions (Route 53 geolocation)
- RTO: < 1 min; automatic; full active workload in both regions
7.4 Manual → Fully Automated
Infrastructure as Code (Week 0-2):
- CloudFormation, Terraform, or CDK
- Version-control infrastructure
- Enable reproducible deployments
CI/CD Pipeline (Week 2-4):
- GitHub/CodeCommit → CodeBuild → CodeDeploy
- Automated tests before deployment
- Blue-green or canary deployments
Operations Automation (Week 4-8):
- CloudWatch alarms → SNS → Lambda (auto-remediation)
- Patch automation (SSM Patch Manager)
- Infrastructure health monitoring (Config, GuardDuty)
Observability (Week 8-12):
- CloudWatch Logs Insights for querying
- X-Ray for distributed tracing
- Custom dashboards; alerts on SLO breaches
7.5 Static Systems → AI-Driven
Phase 1: Monitoring & Insights (0-3 months)
- Collect metrics, logs, traces
- Identify anomalies (CloudWatch Anomaly Detection)
- Manual decision-making based on data
Phase 2: Basic Automation (3-6 months)
- Auto-scaling rules based on metrics
- CloudWatch alarms trigger Lambda for remediation
- Chatbots for simple queries (Lex)
Phase 3: Intelligent Decision-Making (6-12 months)
- ML models predict optimal resource allocation
- Optimization recommendations (Cost Optimization Hub)
- Agentic AI for complex decision workflows
Phase 4: Autonomous Operations (12+ months)
- Agents autonomously execute remediation
- Human-in-the-loop for high-impact decisions
- Continuous learning from outcomes
Section 8: Decision Playbooks & Checklists
8.1 Universal Architecture Decision Checklist
Phase 1: Requirements Gathering (Week 1)
- Define business intent and success metrics
- Identify users, actors, and access patterns
- Classify data (volume, velocity, variety, sensitivity)
- Determine workload type (sync/async, batch/streaming)
- Estimate traffic patterns and scale requirements
- Define availability targets (RTO, RPO)
- Document security and compliance needs
- Set cost constraints
- Assess team capability and operational tolerance
- Plan for extensibility and evolution
Output: Requirement document (1-2 pages)
Phase 2: Service Selection (Week 2)
Compute:
- Lambda, ECS, EKS, EC2, or Batch?
- Decision factor: Scale, latency, operational burden
- Cost analysis: Provisioned vs. on-demand
Storage:
- S3 (objects), EBS (block), EFS (file), RDS/DynamoDB (DB)?
- Durability and availability targets met?
- Encryption, compliance requirements met?
Database:
- RDS (relational), DynamoDB (key-value), Redshift (warehouse), OpenSearch (search)?
- Access patterns validated against service?
- Scaling behavior acceptable?
Integration:
- Lambda functions, API Gateway, EventBridge, SQS/SNS, Kinesis, Step Functions?
- Coupling/decoupling adequate?
- Error handling strategy defined?
Output: Service decision matrix (1 page)
Phase 3: Architecture Design (Week 3)
- Draw high-level architecture (compute → storage → database)
- Identify data flows (sync vs. async)
- Map to architecture patterns (CRUD, event-driven, streaming, etc.)
- Define failure scenarios and recovery strategy
- Calculate cost at baseline, peak, and 2x peak scale
- Identify cost optimization opportunities
- Review against Well-Architected pillars
Output: Architecture diagram + Well-Architected scorecard
Phase 4: Implementation & Deployment (Weeks 4-6)
- Code infrastructure (CloudFormation, Terraform, CDK)
- Implement logging, monitoring, alerting (CloudWatch, X-Ray)
- Set up CI/CD pipeline (CodePipeline, CodeBuild, CodeDeploy)
- Write runbooks for operational procedures
- Test failure scenarios (chaos engineering)
- Performance test at 2x expected peak load
- Security audit and penetration testing
- Compliance validation (HIPAA, PCI, etc. if applicable)
Output: Deployment checklist, runbooks, test results
Phase 5: Go-Live (Week 7)
- Production readiness review (deployment, security, operations)
- Gradual rollout (5% → 25% → 50% → 100% traffic)
- Monitor golden signals (latency, error rate, throughput)
- Alert thresholds defined and tested
- Incident response team briefed and on-call
- Backup/disaster recovery procedures verified
Output: Incident response playbook, monitoring dashboard
Phase 6: Optimization & Learning (Week 8+)
- Review cost monthly; identify optimizations
- Analyze performance metrics; optimize hot paths
- Conduct retrospectives on incidents
- Update architecture based on learnings
- Refine automation and runbooks
Output: Quarterly optimization report
8.2 Service Selection Flowchart
Compute Decision Tree:
"How long does the job run?" ├─ < 15 min │ └─ "Variable or bursty load?" │ ├─ Yes → Lambda │ └─ No → "Latency critical?" │ ├─ Yes → ECS (Fargate, warm) │ └─ No → EC2 (if always-on cheaper) ├─ 15 min - 1 hour │ └─ Batch (for batch jobs) or ECS (for services) └─ > 1 hour └─ "Distributed processing needed?" ├─ Yes → EMR (Spark/Hadoop) └─ No → EC2, ECS, or SageMaker
Database Decision Tree:
"What type of queries?" ├─ SQL, complex joins, transactions │ └─ RDS (MySQL, PostgreSQL) or Aurora (high throughput) ├─ Key-value, documents, real-time │ └─ DynamoDB (provisioned for baseline, on-demand for variable) ├─ Large-scale analytics, BI │ └─ Redshift (petabyte-scale OLAP) ├─ Full-text search, time-series logs │ └─ OpenSearch └─ Graph, time-series, other └─ Neptune (graph), Timestream (time-series), Keyspaces (Cassandra)
8.3 Failure Scenario Modeling
Template: For each critical component, document:
- Component: e.g., RDS database, API Gateway, Lambda function
- Failure Mode: e.g., instance crash, network partition, application error
- Impact: e.g., "500 errors for 5 min, 1000 failed requests"
- Detection: e.g., "CloudWatch alarm: 5xx error rate > 1%"
- Recovery Time: e.g., "< 1 min (automatic failover)"
- Prevention: e.g., "Multi-AZ, read replicas, automated health checks"
Example Scenarios:
| Component | Failure | Impact | Detection | Recovery | Prevention |
|---|---|---|---|---|---|
| RDS (primary) | Instance crash | DB unavailable | CloudWatch metrics | Auto-failover to read replica (1 min) | Multi-AZ |
| Lambda function | Code error | 500 responses | CloudWatch error metrics, X-Ray | Automatic retry; DLQ for inspection | Unit tests, canary deployment |
| API Gateway | DDoS attack | Request throttling | CloudWatch request count | Auto-scaling, WAF, Shield | WAF rules, rate limiting |
| S3 bucket | Accidentally deleted | Data loss | CloudWatch metrics drop | Restore from versioning or backup | Versioning enabled, lifecycle policies |
8.4 Security Threat Modeling
Threat Model Template (STRIDE):
| Threat Category | Threat | Mitigation | Implementation |
|---|---|---|---|
| S - Spoofing | Attacker impersonates API caller | Strong authentication, API key validation | API Gateway API keys, IAM roles, OAuth |
| T - Tampering | Attacker modifies data in transit | Encryption, integrity checks | TLS, HMAC, message signing |
| R - Repudiation | Attacker denies action | Audit logging, immutable records | CloudTrail, DynamoDB event sourcing |
| I - Information Disclosure | Attacker accesses sensitive data | Encryption, access control | KMS, IAM, data classification |
| D - Denial of Service | Attacker floods system | Rate limiting, auto-scaling, WAF | API Gateway throttling, WAF rules, Shield |
| E - Elevation of Privilege | Attacker gains higher access | Least privilege, MFA, role separation | IAM policies, MFA, role assumption logs |
Section 9: Production Deployment Patterns
9.1 Blue-Green Deployment
Definition: Run two identical production environments (blue, green); switch traffic to green after validation.
Benefits:
- Zero-downtime deployments
- Easy rollback (switch back to blue)
- Thorough testing in production environment before traffic
- Cost: 2x infrastructure during deployment (brief window)
Implementation:
Users → ALB → Blue (v1) [current] → Green (v2) [standby, being deployed] Test Green thoroughly; if successful: ALB → Green (v2) [becomes current]
9.2 Canary Deployment
Definition: Gradually route traffic to new version; rollback if error rate exceeds threshold.
Benefits:
- Low-risk; easy rollback
- Real user traffic tests new code
- Immediate detection of issues
- Cost: Minimal additional infrastructure
Implementation:
Minute 0: 5% traffic → v2 Minute 5: 25% → v2 (if error rate normal) Minute 10: 50% → v2 Minute 15: 100% → v2 If error rate exceeds threshold at any step: rollback to 0% → v2
9.3 Feature Flags
Definition: Deploy code without enabling features; toggle features on/off without redeployment.
Benefits:
- Decouple deployment from release
- Kill switches for problematic features
- Gradual feature rollout
Implementation:
1if feature_flags.get("new_checkout_flow"): 2 # New code path 3else: 4 # Old code path
Section 10: AWS Service Selection Quick Reference
| Use Case | Primary Service | Alternative | Trade-off |
|---|---|---|---|
| Static website | S3 + CloudFront | API Gateway + Lambda | Simpler (S3) vs. dynamic (Lambda) |
| REST API | API Gateway + Lambda | ECS + ALB | Serverless (Lambda) vs. control (ECS) |
| Microservices | ECS + ALB + SQS/SNS | EKS | Simplicity (ECS) vs. power (EKS) |
| Database (SQL) | RDS Aurora | RDS PostgreSQL | Performance/scale (Aurora) vs. cost (basic RDS) |
| Real-time database | DynamoDB | RDS | Milliseconds (DynamoDB) vs. complex queries (RDS) |
| Data warehouse | Redshift | Athena | Complex queries (Redshift) vs. ad-hoc (Athena) |
| Log analysis | OpenSearch | CloudWatch Logs Insights | Powered search (OpenSearch) vs. simple (CloudWatch) |
| Batch processing | Glue or Batch | EMR | Simplicity (Glue) vs. control (EMR) |
| Streaming | Kinesis | MSK | AWS-native (Kinesis) vs. portable (MSK) |
| Event routing | EventBridge | SNS + SQS | Rich routing (EventBridge) vs. simple (SNS/SQS) |
| Workflow orchestration | Step Functions | Apache Airflow (MWAA) | Simplicity (Step Functions) vs. power (Airflow) |
| ML model training | SageMaker | EC2 + Jupyter | Managed (SageMaker) vs. DIY (EC2) |
| LLM applications | Bedrock | SageMaker + custom models | Ease (Bedrock) vs. control (SageMaker) |
Conclusion
This universal framework enables architects to:
- Decompose any problem into 10 fundamental dimensions
- Classify workloads into generic archetypes (request/response, event-driven, streaming, batch, workflows, data platforms, AI/ML, edge, hybrid)
- Select AWS services with explicit decision criteria, tradeoffs, cost models, and scaling behavior
- Design architectures using proven patterns (CRUD, event-driven, saga, fan-out/fan-in, data lakehouse, CQRS, streaming, multi-region)
- Align with Well-Architected pillars (operational excellence, security, reliability, performance, cost, sustainability) with measurable indicators
- Model costs and scale across different workload profiles and identify breakeven points
- Plan evolution from MVP to scale, monolith to microservices, single-region to multi-region
- Execute with confidence using playbooks, checklists, and deployment patterns
This guide is domain-agnostic and applies to financial systems, consumer apps, data platforms, AI/ML systems, real-time systems, batch workloads, legacy migrations, greenfield and brownfield architectures.
Use this as a reference handbook for system design, an enterprise architecture playbook for organizations, a teaching and onboarding reference for cloud teams, and a foundation for automating AWS architectural decisions.
Version: 1.0 (December 2024)
Last Updated: December 15, 2024
Framework Alignment: AWS Well-Architected Framework (June 2024)
On This Page
The table of contents is automatically generated from the document headings.