AWS Universal Architecture Design & Decision Framework

A Domain-Agnostic Reference Guide for Analyzing, Decomposing, and Implementing Any Cloud System on AWS

Executive Summary

This guide provides a universal, repeatable methodology for transforming any business requirement, technical problem, or system constraint into a correct, scalable, and cost-efficient AWS architecture. It is domain-agnostic and works for financial systems, consumer apps, data platforms, AI/ML systems, real-time workloads, batch processes, legacy migrations, and unknown system types.

The guide functions as:

A system design handbook for architects and engineers
An enterprise architecture playbook for organizations
A teaching and onboarding reference for cloud teams
A foundation for automating AWS architectural decisions

Section 1: Universal Requirement Decomposition Framework

Any architectural problem—regardless of domain—must be decomposed across 10 critical dimensions. This framework ensures no aspect of your system is overlooked.

1.1 Business Intent

Purpose: Understand the "why" before the "what."

Guiding Questions:

What is the core business value this system delivers?
What are the primary revenue drivers or cost reduction targets?
Who are the end users (internal employees, external customers, partners)?
What is the go-to-market timeline and business phase (MVP, scaling, mature)?
Are there regulatory, compliance, or governance mandates?
What are the competitive pressures or market differentiators?

Decision Heuristic: Business intent drives all architectural priorities. A fast market entry (MVP) justifies serverless and managed services over optimal scaling. A mission-critical financial system prioritizes reliability and auditability over speed-to-market.

Output: A single-paragraph "business thesis" that frames all downstream decisions.

1.2 User & System Actors

Purpose: Identify who interacts with the system and how.

Guiding Questions:

Who are the primary users? (internal ops, external customers, partners, machines)
How many concurrent users? (10, 10k, 1M+)
What is the geographic distribution? (single region, multi-region, global)
Are there bots, APIs, or automated integrators?
What are the access patterns? (browser, mobile, API, batch, event-driven)
Are there downstream systems that depend on this?

Decision Heuristic:

Large distributed user base → CDN, global load balancing, edge computing.
Batch/job consumers → Event-driven, Step Functions orchestration.
API integrators → API Gateway, rate limiting, schema validation.
Internal tools only → Simplified networking, reduced durability requirements.

Output: Actor matrix (type, count, geography, interaction pattern, SLA expectations).

1.3 Data Characteristics

Purpose: Classify the Volume, Velocity, Variety, and Sensitivity of data flowing through the system.

Volume

Gigabytes → RDS, DynamoDB, S3
Terabytes → Redshift, EMR, Athena on S3
Petabytes → Data lakes, distributed processing (Spark/Flink on EMR)

Velocity

Batch (hourly/daily) → Glue jobs, Lambda scheduled, batch processing
Near-real-time (seconds) → Kinesis, MSK, Lambda streaming
Real-time (milliseconds) → Kinesis with parallel consumers, DynamoDB Streams, SQS
Streaming (continuous) → Kinesis Data Analytics, Flink, Kafka

Variety

Structured (SQL schemas) → RDS, Aurora, DynamoDB
Semi-structured (JSON, Avro, Parquet) → S3 + Athena, Glue Data Catalog
Unstructured (images, video, logs) → S3, OpenSearch for text search
Mixed → Lake Formation for unified governance

Sensitivity (Data Classification)

Public → No encryption, standard S3 access
Internal → Standard encryption, IAM access control
Confidential → KMS encryption, VPC isolation, audit logging
PII/Regulated (HIPAA, PCI, GDPR) → Encryption, access auditing, data retention policies, DLP tools

Decision Heuristic:

High volume + high sensitivity → Use AWS Glue for data classification, Lake Formation for fine-grained access control.
High velocity + high variety → Kinesis or MSK for ingestion; Lambda + DynamoDB for processing.
Mixed sensitivity → Separate data layers by classification; encrypt all; audit all access.

Output: Data catalog (volume, velocity, variety, sensitivity, retention, lineage).

1.4 Workload Type

Purpose: Classify the synchronicity and scheduling of work.

Synchronous Workloads

User waits for response
Low latency required (milliseconds to seconds)
Examples: web requests, API calls, real-time queries
AWS Services: Lambda, API Gateway, ECS, RDS, DynamoDB, ElastiCache
Scaling: Auto-scale based on concurrent requests; cold starts matter

Asynchronous Workloads

User does not wait; work happens later
Moderate latency acceptable (seconds to hours)
Examples: email notifications, data processing, report generation
AWS Services: SQS, SNS, EventBridge, Step Functions, Glue, Batch, EMR
Scaling: Auto-scale workers; decouple producers from consumers

Batch Workloads

Large volumes of data processed at scheduled intervals
Latency can be hours or days
Examples: ETL, data analytics, ML training, backups
AWS Services: Glue, Batch, EMR, Lambda scheduled, Redshift, SageMaker
Scaling: Fixed or dynamic parallelization; cost-optimized

Streaming Workloads

Continuous or near-continuous data flow
Low latency per event (milliseconds to seconds)
Examples: sensor data, clickstreams, logs, financial ticks
AWS Services: Kinesis, MSK, Lambda, DynamoDB Streams, EventBridge
Scaling: Partition by shard; auto-scale shards

Decision Heuristic:

Synchronous + low latency → Lambda (cold start < 100ms) or ECS (predictable latency).
Synchronous + consistent load → ECS, EC2 (warm instances).
Asynchronous + bursty → SQS with Lambda, EventBridge rules.
Batch + scheduled → Glue, Batch, or Lambda scheduled.
Streaming + high throughput → Kinesis or MSK with parallel consumers.

Output: Workload classification (sync/async, latency target, throughput, scheduling).

1.5 Traffic & Scale Patterns

Purpose: Understand load profile, growth trajectory, and burst capacity.

Guiding Questions:

What is baseline traffic? (requests/sec, MB/sec)
What is peak traffic? (seasonal, event-driven, predictable or not)
What is growth rate? (stable, linear, exponential)
Are there time-zone or geographic spikes?
What is acceptable latency during peak load?
Can your system gracefully degrade, or must it handle all traffic?

Decision Heuristic:

Stable, predictable load → Reserved Instances, provisioned capacity (RDS, Redshift, Kinesis), Savings Plans.
Bursty, unpredictable load → Serverless (Lambda, Fargate), on-demand (DynamoDB), auto-scaling.
Rapid growth (0 → scale) → Serverless initially; migrate to provisioned as load stabilizes.
Time-zone dependent → Multi-region or scheduled scaling.
Graceful degradation required → Circuit breakers, queue shedding, read replicas for read-heavy workloads.

Output: Traffic profile (baseline, peak, growth, patterns, cost/performance targets).

1.6 Availability & Durability Targets

Purpose: Define Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each component.

RTO = How long can the system be down before it impacts business?

Critical (< 1 hour) → Multi-AZ, multi-region failover, hot standby
High (1-4 hours) → Multi-AZ failover, automated recovery
Medium (4-24 hours) → Single AZ with automated backups, manual recovery
Low (> 24 hours) → Manual recovery acceptable; development/test environments

RPO = How much data loss is acceptable?

Critical (zero data loss) → Synchronous replication, read replicas, event sourcing
High (minutes) → Asynchronous replication, hourly backups
Medium (hours) → Daily snapshots
Low (days) → Weekly or monthly backups

Decision Heuristic:

RTO < 1 hour, RPO zero → Multi-AZ Aurora with read replicas, DynamoDB global tables, cross-region failover.
RTO 1-4 hours, RPO hours → Multi-AZ RDS with automated backups, S3 cross-region replication.
RTO > 1 day, RPO > 1 day → Single AZ with backups; manual recovery acceptable.

AWS Service Mapping:

High availability: Aurora, DynamoDB, Lambda, ALB/NLB
Disaster recovery: S3 cross-region replication, RDS backups, Backup service, Route 53 health checks
Durability: S3 (11 nines), EBS snapshots, Data Lifecycle Manager

Output: Availability matrix (component, RTO, RPO, strategy, cost implications).

1.7 Security & Compliance Needs

Purpose: Identify confidentiality, integrity, availability requirements and regulatory constraints.

Guiding Questions:

What data is handled? (PII, PHI, financial, trade secrets)
What are regulatory requirements? (HIPAA, PCI-DSS, GDPR, SOX, CCPA)
Must data stay in-region? (data residency, data sovereignty)
Who can access what data? (role-based, attribute-based)
Is encryption required? (in-transit, at-rest, key management)
Are audit logs and compliance reports required?
Are there penetration testing or security assessment requirements?

Decision Heuristic:

PII/GDPR → Encryption (KMS), access auditing, data classification, DLP (Macie), deletion policies
HIPAA/PHI → Encryption, audit logging, VPC isolation, Business Associates Agreement (BAA)
PCI-DSS → Network isolation, encryption, access logging, vulnerability scanning (Inspector)
Financial/SOX → Immutable audit logs (CloudTrail), change controls, segregation of duties
Data residency required → Single-region architecture, explicit region selection
High-security environment → VPC isolation, GuardDuty, WAF, Network Firewall

Output: Security posture (data classification, regulatory drivers, encryption, access control, audit requirements).

1.8 Cost Sensitivity

Purpose: Define cost constraints and optimization priorities.

Guiding Questions:

Is this a cost-driven or feature-driven project?
What is the acceptable cost range per month, per user, per transaction?
Are there cost allocation/showback requirements?
Can you commit to reserved capacity, or is on-demand necessary?
Is CapEx or OpEx preferred?
What cost-optimization tools (CloudWatch, Cost Explorer, Trusted Advisor) are available?

Decision Heuristic:

Startup/MVP → On-demand, serverless, minimize fixed costs.
Scaling (stable load) → Mix reserved + on-demand; commit 70% of baseline traffic.
Cost-sensitive operations → Reserved Instances (up to 75% off), Savings Plans (65% off, more flexible), spot instances (up to 90% off, interruptible workloads).
Variable workloads → Serverless (Lambda, Fargate, Athena), DynamoDB on-demand.
Long-term projects (3+ years) → 3-year reserved instances provide best ROI.

Output: Cost model (budget, cost per unit, commitment level, optimization targets).

1.9 Operational Complexity

Purpose: Assess team capability and operational overhead tolerance.

Guiding Questions:

What is the team's cloud maturity? (beginners, intermediate, advanced)
Do you have DevOps, SRE, or platform engineering capabilities?
Can the team manage Kubernetes, on-premises infrastructure, or complex integrations?
What is the tolerance for manual operational tasks?
Are there compliance/audit requirements for infrastructure change management?

Decision Heuristic:

Small team, low maturity → Serverless (Lambda, Fargate), managed services (RDS, OpenSearch), minimal custom code.
Intermediate maturity → ECS with CloudFormation, RDS Aurora, Glue data pipelines.
Advanced maturity + complex requirements → EKS, self-managed databases, custom orchestration.
Compliance-heavy environments → Infrastructure as Code (Terraform, CDK), automated deployments, audit trails.

Output: Operational profile (team size, skills, tolerance for complexity, tooling requirements).

1.10 Change Frequency & Extensibility

Purpose: Design for evolution and minimize rework.

Guiding Questions:

How often do requirements change? (weekly, monthly, quarterly)
Will the system need to integrate with new services or data sources?
Is the architecture greenfield (new) or brownfield (existing)?
How modular and decoupled should components be?
What is the acceptable technical debt?

Decision Heuristic:

High change frequency → Microservices, event-driven, loose coupling; use Step Functions, EventBridge for orchestration.
Stable requirements → Monolith is acceptable; focus on reliability and scaling.
Must integrate with many systems → Event-driven, API-first, API Gateway for versioning.
Greenfield + flexible → Start with serverless; migrate to provisioned services as load/patterns stabilize.
Brownfield + legacy → Strangler Fig pattern; migrate incrementally; keep old and new running in parallel.

Output: Architecture roadmap (current state, evolution path, integration points, refactoring windows).

Section 2: Workload Classification Engine

Every real-world problem maps to one or more of these generic workload archetypes. Understanding which archetype(s) your system belongs to unlocks the correct service selection.

2.1 Request/Response Systems (Synchronous)

Characteristics:

User or system waits for response
Latency-sensitive (< 1 second to minutes)
Throughput variable or bursty
State management required (sessions, user context)

Real-World Examples:

Web applications, APIs, mobile backends
Search engines, recommendation systems
Checkout flows, payment processing
Real-time dashboards, monitoring systems

Core AWS Services:

Compute: Lambda (cold start sensitive), ECS/Fargate (warm, predictable), EC2 (control, long-running)
Network: API Gateway (REST/WebSocket), ALB (internal routing), CloudFront (caching)
Database: RDS/Aurora (ACID, sessions), DynamoDB (key-value, sessions), ElastiCache (ephemeral cache)
Observability: CloudWatch, X-Ray (trace requests end-to-end)

Architecture Pattern:


Client → CloudFront (cache) → API Gateway → Lambda/ECS/EC2 → RDS/DynamoDB/Cache
         ↓                                                          ↓
         Static assets (S3)                                   Logs to CloudWatch

Key Design Decisions:

Compute choice:
- Lambda: Best for I/O-bound, variable load; worst for long-running or CPU-intensive work
- ECS: Best for containerized, moderate load; good cold-start latency
- EC2: Best for high CPU, long-running, or specialized requirements
Caching strategy:
- CloudFront: Cache static assets and cacheable API responses globally
- ElastiCache: Cache database queries, sessions, expensive computations
Database choice:
- RDS/Aurora: Relational data, complex queries, ACID guarantees
- DynamoDB: Key-value lookups, horizontal scaling, variable load
- Hybrid: Both (RDS for transactions, DynamoDB for sessions/cache)
Load balancing:
- ALB (Layer 7): Route based on path, hostname, headers (microservices)
- NLB (Layer 4): Route based on IP protocol, port; extreme throughput/low latency
- API Gateway: REST/GraphQL APIs, rate limiting, schema validation
Scaling and resilience:
- Auto Scaling Groups (ECS, EC2) for gradual scaling
- Lambda: Automatic, scales to 1000s of concurrent requests
- Circuit breakers (Step Functions) to prevent cascading failures
- Read replicas (RDS, DynamoDB) for read-heavy workloads

Well-Architected Alignment:

Operational Excellence: Observability (CloudWatch metrics, logs, X-Ray traces); automated deployments (CodeDeploy)
Security: IAM roles, VPC security groups, encryption in transit (HTTPS), KMS at-rest
Reliability: Multi-AZ deployment, auto-scaling, health checks, failover
Performance: Caching (CloudFront, ElastiCache), connection pooling, query optimization
Cost Optimization: Right-size instances, use Savings Plans, cache frequently accessed data
Sustainability: Use managed services to avoid idle servers; consolidate workloads

Cost Model:

Lambda: $0.0000002 per request + $0.0000166667 per GB-second (pay for actual use)
ECS Fargate: $0.04695 per vCPU-hour + $0.00519 per GB-hour (pay for provisioned capacity)
RDS: $0.17/hour (db.t3.micro) to $10+/hour (large instances) + storage + backups
Break-even: Lambda vs. ECS ≈ 10-100 requests/sec depending on execution time; ECS cheaper above this

2.2 Event-Driven Systems (Asynchronous, Decoupled)

Characteristics:

Components communicate via events, not direct calls
Loose coupling; easy to add consumers
No waiting for response
Throughput and latency can vary independently

Real-World Examples:

Order processing (order → payment → fulfillment → notification)
Data pipeline orchestration (ingest → transform → load → analyze)
Multi-tenant SaaS platforms (user action → event → multiple subscribers)
IoT systems (sensor → ingestion → storage → analytics → alerting)

Core AWS Services:

Event sources: S3, RDS, EventBridge, Kinesis, SNS, SQS, Lambda, DynamoDB Streams
Event routers: EventBridge (pub/sub with routing), SNS (fanout), SQS (queue), Kinesis (streaming)
Event processors: Lambda, Step Functions, ECS, Fargate, Batch
Event storage: DynamoDB, S3, EventBridge Archive

Architecture Pattern:


Source (S3, API, DDB) → EventBridge/SNS/SQS → Lambda/ECS/Batch → Target (DB, S3, Email)
                                          ↓
                              Multiple independent consumers
                                    (fanout)

Key Design Decisions:

Event source:
- EventBridge: Most flexible; supports 100+ AWS services and custom apps; best for rule-based routing
- SNS: Simple pub/sub; less configuration; no replay; good for fanout
- SQS: Queue-based; replay via DLQ; guarantees delivery; good for buffering
- Kinesis: Streaming; ordered; replay; good for real-time analytics
- DynamoDB Streams: Change data capture; triggers Lambda; good for change-driven workflows
Fanout vs. queue:
- Fanout (SNS, EventBridge): One event → multiple independent consumers; each gets a copy
- Queue (SQS): One event → one consumer (processes and deletes); good for load leveling
Error handling:
- Idempotence: Events may be delivered multiple times; ensure consumers can handle duplicates
- Dead-letter queues (DLQ): SQS/SNS DLQs capture failed messages for replay/inspection
- Retry policies: Exponential backoff; max retries; circuit breaker pattern
Orchestration vs. choreography:
- Choreography: Each service subscribes to events and emits its own (decoupled, but implicit flows)
- Orchestration: Central coordinator (Step Functions) directs workflow; explicit, auditable
Consistency model:
- Eventual consistency: Events propagate asynchronously; ordering may not be guaranteed
- At-least-once delivery: Event may be delivered multiple times; idempotence required
- Exactly-once semantics (hard): Use idempotent IDs + database writes with deduplication

Well-Architected Alignment:

Operational Excellence: Event tracing (CloudWatch logs, X-Ray), dead-letter queues for debugging, alerts
Security: IAM policies for publish/subscribe, encryption at-rest (SNS, SQS, EventBridge)
Reliability: Durable queues (SQS, EventBridge); retries; DLQs for failure handling; multi-AZ
Performance: EventBridge rules scale to millions/sec; Kinesis scales by partition; fanout is instant
Cost Optimization: Pay only for events processed; SQS cheaper than Kinesis for non-streaming; use Kinesis on-demand
Sustainability: Event-driven can consolidate workloads; avoid polling

Cost Model:

EventBridge: $0.35 per 1M events
SNS: $0.50 per 1M publishes
SQS: $0.40 per 1M requests (includes sends, receives, deletes)
Kinesis on-demand: $0.047 per GB ingested
Kinesis provisioned: $0.36 per shard-hour

2.3 Stream Processing (Continuous, Low-Latency Analytics)

Characteristics:

Data arrives continuously at high velocity
Process and react within seconds or milliseconds
Maintain windowed state (e.g., moving averages)
Throughput scales by partitioning

Real-World Examples:

Real-time fraud detection, financial trading
Clickstream analysis, A/B testing
Sensor monitoring, anomaly detection
Log processing, security analytics

Core AWS Services:

Ingestion: Kinesis Data Streams, MSK (Kafka), EventBridge
Processing: Kinesis Data Analytics (SQL), Lambda (event-by-event), Flink/Spark on EMR, Kafka Streams
State storage: DynamoDB, ElastiCache, Kinesis (windowing)
Output: DynamoDB, RDS, S3, Redshift, OpenSearch, Lambda targets

Architecture Pattern:


Source (API, sensor) → Kinesis/MSK → Analytics/Processing → DynamoDB/S3/Redshift
                                            ↓
                                    Real-time alerts/dashboards

Key Design Decisions:

Ingestion service:
- Kinesis: AWS-native, scales by shard, low-level control, good for AWS ecosystem
- MSK (Kafka): Open-source, more operational burden, richer ecosystem, good for multi-cloud
- EventBridge: Rule-based routing; not ideal for high-throughput streaming; best for low-frequency events
Processing framework:
- Kinesis Data Analytics: SQL-based, managed, good for windowing and aggregations
- Lambda: Simple transformation; event-by-event processing; cold start may impact throughput
- Flink/Spark on EMR: Complex stateful processing, machine learning, but requires cluster management
Partitioning strategy:
- Partition by customer ID, user ID, or geographic region to parallelize processing
- Avoid hot partitions (one shard falling behind)
- Auto-scale shards based on throughput metrics
Windowing and state:
- Tumbling windows: Fixed intervals (1-minute aggregates)
- Sliding windows: Overlapping intervals (10-second windows, every 1 second)
- Session windows: Grouped by inactivity (user session breaks when idle > 30 min)
- Use DynamoDB or ElastiCache for windowed state
Latency vs. throughput:
- Low latency (< 1 second): Lambda + Kinesis, minimal batching
- Higher throughput (100k+/sec): Flink, larger batches, higher latency acceptable

Well-Architected Alignment:

Operational Excellence: Monitoring lag (Kinesis iterator age), consumer latency, throughput metrics
Security: Encryption (TLS in-transit, KMS at-rest), IAM for producer/consumer, audit logging
Reliability: Auto-scaling shards, replay from stream (Kinesis retention 24 hours → LONG_TERM backups to S3)
Performance: Partition hot spots avoided, optimal shard count, batch size tuning
Cost Optimization: On-demand Kinesis for variable workloads; reserved for baseline; consider Kafka if high volume
Sustainability: Consolidate stream processing on shared clusters; avoid idle shards

Cost Model:

Kinesis provisioned: $0.36 per shard-hour (baseline ~4MB/sec throughput per shard)
Kinesis on-demand: $0.047 per GB ingested (no upfront commitment)
EMR Flink: $0.24–$0.30 per DPU-hour + EC2 compute costs
Lambda streaming: $0.0000002 per request + GB-second (high concurrency best for 100s of partitions)

2.4 Batch Processing (Scheduled, Offline, High-Volume)

Characteristics:

Large volumes of data processed at scheduled intervals
Latency acceptable (hours, days)
Cost-optimized (spot instances, off-peak scheduling)
Often part of data pipelines

Real-World Examples:

ETL jobs (extract, transform, load)
Daily/weekly analytics, reporting
ML model training, batch scoring
Data cleanup, archival, backups
Rendering, video transcoding

Core AWS Services:

Scheduling/Orchestration: Step Functions, Lambda scheduled, EventBridge rules, Glue workflows
Compute: Batch, Glue, EMR, Lambda (small jobs), SageMaker (ML)
Storage: S3 (input/output), RDS (data source), Redshift (output)
Distributed Processing: Spark (EMR), Flink (EMR), Hadoop (EMR), Glue (Spark-based)

Architecture Pattern:


S3/RDS (source) → Glue/Batch/EMR (transform) → S3/Redshift/RDS (output) → Analytics/Reporting
                                   ↓
                         Scheduled by Step Functions
                         Failure handling via DLQ/SNS

Key Design Decisions:

Compute platform:
- Glue: AWS-native, serverless, pay-per-DPU-second, good for AWS-centric pipelines
- Batch: Managed job queue, auto-scaling EC2, good for large compute jobs, spot instances
- EMR: Full Hadoop/Spark cluster, complex transformations, good for data science
- Lambda: Small, short-lived jobs; minimal overhead; limited timeout (15 min)
Scheduling:
- Step Functions: Visual workflows, error handling, automatic retries, good for multi-step pipelines
- EventBridge rules: Cron expressions, simple triggering, good for scheduled tasks
- Glue workflows: Built-in job orchestration, trigger on job completion
- Apache Airflow (MWAA): Complex DAGs, good for data engineers
Data partitioning:
- Partition input data by date, customer, or region for parallel processing
- Output partitioned for efficient querying (Athena, Redshift spectrum)
- Use S3 object keys intelligently (e.g., s3://bucket/year/month/day/hour/)
Cost optimization:
- Use spot instances on Batch/EMR (up to 70% savings); interruptions acceptable
- Schedule batch jobs in off-peak hours (e.g., 2-6 AM)
- Delete intermediate data; keep only final output
- Use Glue over EMR for small/medium jobs (cheaper); EMR for complex workloads
Failure handling:
- Retry logic in Step Functions (exponential backoff, max retries)
- Dead-letter queues for failed jobs
- SNS notifications on failure
- Idempotent job design (can re-run without side effects)

Well-Architected Alignment:

Operational Excellence: Job monitoring, retry policies, automated error handling, alerts
Security: Data encryption (S3, KMS), IAM roles for job execution, audit logs
Reliability: Multi-step workflows with automatic retries, DLQs, checkpointing for restartable jobs
Performance: Parallel processing, data partitioning, right-size instances
Cost Optimization: Spot instances, off-peak scheduling, delete interim data, managed services (Glue vs. EMR)
Sustainability: Batch when possible (consolidate compute); avoid idle clusters; use managed services

Cost Model:

Glue: $0.44 per DPU-hour; minimum 1 DPU (0.5 GB memory per DPU)
Batch: EC2 spot instances (70% off on-demand) + networking; pay only during execution
EMR: $0.07–$0.10 per instance-hour (depends on instance type) + spot discounts
Lambda: $0.0000002 per request + $0.0000166667 per GB-second (cheap for small jobs; inefficient for large workloads)

2.5 Long-Running Workflows (State Machines, Sagas)

Characteristics:

Multi-step processes with conditional logic and retries
Steps may take minutes, hours, or days
State must be persisted; can be paused/resumed
Distributed, multi-service coordination

Real-World Examples:

Loan approval workflows (underwriting → approval → funding)
Supply chain tracking (order → warehouse → shipment → delivery)
ML training pipelines (data prep → training → evaluation → deployment)
Multi-step approval processes (draft → review → approve → execute)

Core AWS Services:

Orchestration: Step Functions (standard or express), Apache Airflow (MWAA)
Persistence: DynamoDB (state), SQS/SNS (communication), S3 (large payloads)
Compute: Lambda, ECS, Batch (individual steps)
Monitoring: CloudWatch, X-Ray (trace execution paths)

Architecture Pattern:


Trigger (API, S3, EventBridge) → Step Functions State Machine
                                  ├── Step 1: Validate (Lambda)
                                  ├── Step 2: Process (ECS/Batch)
                                  ├── Step 3: Notify (SNS)
                                  └── Error handling: Retry, DLQ, Alert

Key Design Decisions:

Step Functions flavor:
- Standard: ≤ 1 year duration, at-least-once execution, good for business workflows
- Express: ≤ 5 minutes, exactly-once execution, high throughput, good for event processing
Saga pattern (distributed transactions):
- Choreography: Each step emits events; next steps listen (decoupled but implicit)
- Orchestration: Central coordinator (Step Functions) directs all steps (explicit, auditable, easier to debug)
- Compensation: On failure, "undo" previous steps (e.g., reverse payment if fulfillment fails)
Wait strategies:
- Callback pattern: Step Functions waits for external system to call back; uses DynamoDB + EventBridge
- Polling: Step Functions checks status periodically (less efficient)
- Event-driven: External system sends event (SNS/SQS) to notify completion
Error handling:
- Automatic retries (exponential backoff, max attempts)
- Catch and handle specific errors (e.g., timeout → default value; validation failure → alert)
- Dead-letter queues for unrecoverable failures
- Manual intervention path if needed
State management:
- Use DynamoDB to persist workflow state
- Store large payloads in S3; reference by key
- Avoid Step Functions input/output size limits (32 KB) with S3 references

Well-Architected Alignment:

Operational Excellence: Visual workflow monitoring, CloudWatch logs for each step, alerts on failure
Security: IAM roles per step (least privilege), encryption of state data
Reliability: Automatic retries, compensation logic, no data loss (persisted state)
Performance: Parallel step execution, avoid blocking waits
Cost Optimization: Pay per state transition; consolidate steps where possible; use spot instances in Batch steps
Sustainability: Avoid polling; use event-driven waits

Cost Model:

Step Functions Standard: $0.000025 per state transition (1M = $25/month)
Step Functions Express: $0.000001667 per invocation + GB-second (useful for high-volume, short workflows)
DynamoDB for state: $0.25 per 1M write units (on-demand: $1.25 per 1M writes)

2.6 Data-Intensive Platforms (Data Lakes, Warehouses, Analytics)

Characteristics:

Large volume of diverse data sources
Multiple consumers (analytics, ML, real-time queries)
Governed access; compliance and lineage important
Optimized for both ingestion and querying

Real-World Examples:

Data lakes (centralized repository for all data)
Data warehouses (curated, optimized for BI queries)
Data mesh (domain-oriented, decentralized ownership)
ML feature stores

Core AWS Services:

Ingestion: Glue, Kinesis, MSK, DMS (database migration)
Storage: S3 (bronze/silver/gold layers), Redshift (warehouse)
Processing: Glue (ETL), EMR (Spark/Flink), Athena (SQL on S3)
Governance: Lake Formation (access control), Glue Data Catalog (metadata)
Analytics: Redshift, Athena, QuickSight, SageMaker

Architecture Pattern (Modern Data Lakehouse):


Raw Data (S3 Bronze)
    ↓
Glue/EMR (Transform)
    ↓
Curated Data (S3 Silver/Gold)
    ↓
↙           ↓           ↘
Redshift   Athena      OpenSearch
(OLAP)     (Ad-hoc)    (Search)
    ↓           ↓           ↓
BI Tools     Data Apps    Search Apps
(QuickSight)  (Jupyter)    (Kibana)

Key Design Decisions:

Data organization (medallion architecture):
- Bronze: Raw data as-is (from source systems)
- Silver: Cleaned, validated, deduplicated; schema applied
- Gold: Business-ready; aggregated; optimized for use cases
- Use S3 partitioning intelligently; Glue crawlers auto-discover
Ingestion pattern:
- Batch: Scheduled Glue jobs; good for daily or less frequent loads
- Streaming: Kinesis → Glue streaming job → S3 (continuous updates)
- Change Data Capture (CDC): DMS → S3 (real-time sync from source databases)
Governance and security:
- Lake Formation: Centralized permissions; grant access to databases/tables; tag-based controls
- Data Catalog: Glue Data Catalog for metadata; search/lineage
- Encryption: S3-SSE, KMS for sensitive data; Glue connection encryption for DB credentials
- Compliance: S3 versioning, Object Lock for immutability; access logging
Query optimization:
- Redshift: Fast queries; supports complex joins; good for BI teams; requires provisioning
- Athena: Serverless SQL on S3; slower but no provisioning; good for ad-hoc queries
- Partitioning: Partition by date, customer, region; Athena scans only needed partitions (cost ↓)
- File format: Parquet (columnar) better than CSV; Glue auto-converts
Cost optimization:
- Use Athena on-demand for variable queries; no idle infrastructure
- Use Redshift Spectrum to query S3 without loading (data stays in S3)
- Partition and compress; use columnar formats (Parquet, ORC)
- Delete old data (TTL, Intelligent-Tiering); archive to Glacier if needed

Well-Architected Alignment:

Operational Excellence: Crawlers for auto-discovery, Data Catalog for searchability, Glue workflows for orchestration
Security: Lake Formation permissions, encryption, audit logging, data classification
Reliability: Data versioning (S3), immutability (S3 Object Lock), backup (cross-region replication)
Performance: Partitioning, columnar formats, Spectrum for external queries, Redshift for complex OLAP
Cost Optimization: Athena pay-per-query, Redshift reserved for predictable workloads, S3 Intelligent-Tiering
Sustainability: Consolidate analytics on shared platforms; avoid silos; centralized data reduces duplication

Cost Model:

S3: $0.023 per GB/month (standard); $0.0125 (infrequent access)
Athena: $6.25 per TB scanned (queries only scan partitions needed)
Glue: $0.44 per DPU-hour
Redshift: $1.26/hour (dc2.large); $8.02/hour (ra3.xlplus); storage separate
Lake Formation: $1 per million metadata requests; ~$0 for permissions if using Glue

2.7 AI/ML & Agentic Systems

Characteristics:

Data-driven decision making and automation
Model training, evaluation, serving pipelines
Real-time inference or batch scoring
Emerging: Agentic AI (agents that reason and take actions)

Real-World Examples:

Recommendation systems, personalization
Fraud detection, anomaly detection
Natural language processing (chatbots, document analysis)
Computer vision (image classification, object detection)
Agentic AI: Autonomous agents that interact with systems, query databases, make decisions

Core AWS Services:

ML Platform: SageMaker (end-to-end), Bedrock (foundation models, agents)
Model Training: SageMaker Training, Batch, EMR (Spark/PySpark)
Feature Store: SageMaker Feature Store (managed; offline/online feature serving)
Model Registry: SageMaker Model Registry (version control, approval workflow)
Inference: SageMaker endpoints (real-time), SageMaker Batch Transform (offline), Lambda@Edge (edge ML)
Agentic AI: Bedrock Agents, Amazon Q (enterprise search/automation), Step Functions (orchestration)
Tools/APIs: Lambda (tool execution), API Gateway (external integrations), DynamoDB (state)
Data: S3 (training data), RDS/DynamoDB (features, state), OpenSearch (vector search for RAG)

Architecture Pattern (Traditional ML):


Data → SageMaker Training (S3 input) → Model Registry
                                    ↓
                         SageMaker Endpoint (real-time inference)
                         or
                         SageMaker Batch Transform (offline scoring)
                                    ↓
                         Application / BI Tool

Architecture Pattern (Agentic AI):


User Query → Bedrock Agent (orchestration) → Foundation Model (reasoning)
                                ↓
                        Tool Selection & Execution
                    ├── Lambda (query database)
                    ├── API Gateway (call external services)
                    ├── DynamoDB (read/write state)
                    └── OpenSearch (semantic search, RAG)
                                ↓
                        Response to User

Key Design Decisions:

Foundation Model (Bedrock Agents):
- Use Claude, Llama, Mistral, etc. via Bedrock (no provisioning)
- Custom fine-tuned models (SageMaker) for domain-specific tasks
- Prompt engineering and in-context learning for performance
Agentic system architecture:
- Tools: Lambda functions exposing APIs (query database, update CRM, send email)
- Knowledge bases: OpenSearch or Bedrock Knowledge Base (vector embeddings for RAG)
- Memory: DynamoDB for conversation history, user state, execution traces
- Orchestration: Bedrock Agents handle reasoning; Step Functions for complex multi-step workflows
Feature engineering and serving:
- Offline: Glue/Spark to compute features in batch; store in S3
- Online: SageMaker Feature Store for low-latency feature retrieval during inference
- Real-time: Compute on-the-fly in inference code if lightweight; Lambda acceptable
Model training pipeline:
- Data preparation: Glue (structured), SageMaker Processing (custom code)
- Training: SageMaker Training (managed) or EMR (Spark MLlib)
- Evaluation: SageMaker Experiments (track metrics); model registry for approval
- Deployment: SageMaker Endpoints (auto-scaling) or Lambda container images
Inference patterns:
- Real-time (< 1 sec): SageMaker endpoints (auto-scaling), Lambda (smaller models)
- Near-real-time (seconds): Lambda + concurrent invocations, ECS
- Batch (hours): SageMaker Batch Transform, Batch, Glue
- Edge: Lambda@Edge, Greengrass for on-device inference
RAG (Retrieval-Augmented Generation):
- Knowledge base: OpenSearch with vector embeddings or Bedrock Knowledge Base
- Retrieval: Semantic similarity search on user query
- Generation: Agent passes retrieved context to LLM for response

Well-Architected Alignment:

Operational Excellence: Experiment tracking (MLflow in SageMaker), model registry, automated retraining, A/B testing
Security: Encryption (KMS for S3 training data), IAM roles, PII redaction in prompts (for agents), audit logging
Reliability: Model versioning, canary deployments (gradual rollout), monitoring for model drift
Performance: Feature store for low-latency retrieval, batching for throughput, multi-GPU training
Cost Optimization: Use Bedrock for inference (pay-per-invocation) vs. SageMaker endpoint provisioning; spot training
Sustainability: Batch inference for non-urgent tasks; consolidate models; avoid redundant retraining

Cost Model:

Bedrock: $0.001–$0.015 per 1K input tokens (Claude Sonnet: $0.003/1K in, $0.015/1K out)
SageMaker endpoint (ml.m5.large): $0.0864/hour + data transfer
SageMaker Training (ml.p3.2xlarge with GPU): $3.06/hour
SageMaker Feature Store: $0.40 per million requests (online); $0.002 per GB/month (offline)
Bedrock Knowledge Base: $0.2 per input token for Titan embedding model

2.8 Edge, IoT, and Real-Time Communication

Characteristics:

Data originates at edge devices (sensors, mobile, on-prem)
Local processing and decision-making required (low latency)
Intermittent or unreliable connectivity
Scale: thousands to millions of devices

Real-World Examples:

Predictive maintenance (machine sensors → local ML → cloud)
Smart home automation
Mobile offline-first apps
Autonomous vehicles (local processing + cloud sync)
Real-time collaboration (low-latency WebSockets)

Core AWS Services:

Edge Compute: IoT Greengrass (on-device), Lambda@Edge (CloudFront), Outposts (on-prem)
IoT Connectivity: IoT Core (MQTT), IoT Wireless
Local Storage: Greengrass local resource access
Cloud Sync: DynamoDB global tables (eventual consistency), AppSync (real-time GraphQL)
Real-time APIs: API Gateway WebSocket, AppSync
Analytics: Kinesis, Timestream (time-series), OpenSearch

Architecture Pattern (IoT):


Sensors → Greengrass (local processing) → IoT Core (MQTT) → Kinesis/DynamoDB → Analytics
                                               ↓
                                        Local decision-making
                                        (low latency)

Architecture Pattern (Real-Time Collaboration):


Client (WebSocket) → API Gateway (WS) → Lambda (connect/message/disconnect)
                                             ↓
                                        DynamoDB (active connections)
                                        SNS/SQS (broadcast)
                                             ↓
                                        API Gateway (push to clients)

Key Design Decisions:

Edge vs. cloud processing:
- Local ML inference: Greengrass for low-latency decisions; models updated from cloud
- Cloud processing: IoT Core → Kinesis → Lambda for aggregation and advanced analytics
- Hybrid: Train in cloud; inference at edge; feedback loop for retraining
Connectivity:
- Always-on: IoT Core MQTT (publish/subscribe), reliable connection
- Intermittent: Greengrass syncs when connected; local queue until reconnection
- Low bandwidth: Compress data; send only deltas; batching
Real-time communication:
- WebSocket (API Gateway): Two-way, low-latency, good for < 100k concurrent connections
- AppSync: GraphQL, built-in subscriptions, automatic connection management
- Kinesis (event stream): High-throughput, ordered by partition, good for > 1M events/sec
Data durability at edge:
- Store locally in Greengrass; sync to cloud when connectivity restored
- Use DynamoDB global tables for eventual consistency (multi-region)
- S3 event notifications for file-based data
Security:
- IoT Core certificate-based auth; Greengrass runs as daemon
- TLS encryption; end-to-end if needed
- Least-privilege IAM roles

Well-Architected Alignment:

Operational Excellence: Device shadowing (Greengrass), fleet-wide updates, telemetry
Security: Certificate rotation, encrypted communication, local data at-rest encryption
Reliability: Local queue on disconnect, eventual sync, multi-region global tables
Performance: Local ML inference (milliseconds), batched cloud sync, Kinesis for ordering
Cost Optimization: IoT Core MQTT cheaper than constant cloud calls; Greengrass reduces cloud traffic
Sustainability: Local processing reduces cloud load; edge devices only send necessary data

Cost Model:

IoT Core: $1.00 per million messages
IoT Greengrass core device: $1.00/month per device
API Gateway WebSocket: $0.35 per million messages
Timestream: $0.30 per million writes + $0.01 per million query units

2.9 Hybrid & Multi-Account Systems

Characteristics:

Systems span on-premises, AWS, or multiple AWS accounts/regions
Data and workloads need to integrate seamlessly
Governance and cost attribution across boundaries
Network connectivity maintained reliably

Real-World Examples:

Large enterprises with on-premises datacenters migrating to cloud
Multi-tenant SaaS (separate AWS account per customer)
Regulated industries (separate account for prod, dev, compliance)
Global companies (separate regions for data residency)

Core AWS Services:

Connectivity: AWS Direct Connect (dedicated network), VPN (encrypted tunnel), Transit Gateway (hub-and-spoke)
Multi-Account Mgmt: AWS Organizations, Control Tower, Security Hub, Config
Cross-Account Access: IAM roles (assume role across accounts), Resource-based policies
Data Sync: DataSync (on-prem ↔ S3), DMS (database replication), S3 cross-account replication
Governance: Lake Formation cross-account access, Tag policies, SCPs (Service Control Policies)

Architecture Pattern (Hybrid Cloud):


On-Premises ← Direct Connect / VPN → AWS
    ↓                                    ↓
Data Center            VPC + Private subnets
    ↓                                    ↓
Legacy Apps         Modern Apps (EC2/Lambda)
    ↓                                    ↓
Database ← DataSync/DMS → RDS/DynamoDB
    ↓                                    ↓
Monitoring ← EventBridge → CloudWatch

Architecture Pattern (Multi-Account SaaS):


AWS Organizations
├── Master (billing, security)
├── Prod Account 1 (Tenant A)
├── Prod Account 2 (Tenant B)
├── Dev Account
└── Security/Logging Account

Cross-Account:
- Assume role (Tenant A's apps assume role in Tenant A's account)
- Log aggregation (each account sends logs to central Security account)
- Data sharing (Lake Formation: Tenant A shares data with Tenant B via central account)

Key Design Decisions:

Connectivity:
- Direct Connect: Dedicated network; consistent bandwidth; good for large data transfers
- VPN: Encrypted tunnel; supports multiple connections; good for backup/failover
- Transit Gateway: Hub-and-spoke; simplifies multi-VPC and on-prem connectivity
Cross-account access:
- Role assumption: Service A in Account 1 assumes role in Account 2 to access resource
- Resource-based policies: S3 bucket allows Account 2 principal to access
- Federation: Active Directory users assume AWS roles via SAML/OIDC
Data consistency:
- Synchronous replication (DMS): Real-time sync; good for transactional databases
- Asynchronous replication (S3 cross-account, cross-region): Eventual consistency; lower latency impact
- Event-based sync: EventBridge rules replicate data; decoupled, scalable
Governance:
- Organizations: Centrally managed accounts; consolidated billing; SCPs for guardrails
- Control Tower: Baseline accounts with security/compliance guardrails
- Lake Formation: Centralized data lake with cross-account access; tag-based permissions
Cost allocation:
- Organization consolidates billing; cost tags for chargebacks
- Cost anomaly detection per account
- Separate AWS accounts per business unit or customer (clean billing, security isolation)

Well-Architected Alignment:

Operational Excellence: Centralized logging (Security account), Config for compliance, Systems Manager for patching
Security: Least-privilege cross-account roles, encryption in-transit (Direct Connect, VPN), audit trails per account
Reliability: Multi-path connectivity (Direct Connect + VPN), cross-account backups, replication
Performance: Direct Connect for consistent bandwidth; Transit Gateway for simplified routing
Cost Optimization: Reserved capacity per account; consolidated billing for discounts; shutdown unused accounts
Sustainability: Consolidate under-utilized accounts; turn off dev accounts when not in use

Cost Model:

Direct Connect: $0.30 per hour + $0.02 per GB outbound (cheaper than data transfer for high volumes)
VPN: $0.05 per hour + standard data transfer charges
Transit Gateway: $0.05 per hour + $0.02 per GB processed through TGW

Section 3: AWS Service Decision Matrix (Core Framework)

For every major AWS service category, this section provides structured guidance on when to use, when NOT to use, tradeoffs, cost model, scaling behavior, and operational burden.

3.1 Compute Services

AWS Lambda

When to Use:

Synchronous request/response (API calls, webhooks)
Asynchronous event processing (S3 triggers, SQS, SNS, EventBridge)
Short-duration workloads (< 15 minutes)
Variable or bursty load (scale from 0 to 1000s automatically)
Cost-sensitive low-frequency tasks

When NOT to Use:

Long-running processes (> 15 min timeout; use ECS or EC2)
CPU-intensive workloads without strict latency constraints (EC2 cheaper)
Stateful applications with persistent connections (ECS/EC2)
Workloads with consistent, predictable high throughput (ECS/EC2 with Savings Plans cheaper)
Real-time latency-critical (< 100ms consistently; ECS/EC2 warm; Lambda cold start ~1-2 sec)

Tradeoffs:

Pros: No infrastructure to manage; auto-scales; pay only for execution; fast deployments
Cons: Cold starts (100ms-2s), 15-min timeout, 10GB max memory, vendor lock-in, difficult debugging

Cost Model:

Pricing: $0.0000002 per request + $0.0000166667 per GB-second
Example: 1M requests, 1GB RAM, 100ms duration = $0.2 (requests) + $1.67 (compute) = ~$1.87/month
Free tier: 1M requests + 400k GB-seconds/month

Scaling Behavior:

Auto-scales from 0 to 1000 concurrent executions (soft limit; request increase)
Reserved concurrency for baseline; provisioned concurrency for predictable warm starts
Burst: 500 containers per minute; after burst, 100 containers/min scaling rate

Operational Burden:

Low: No patching, scaling, or infrastructure management
Debugging: CloudWatch logs, X-Ray tracing, local SAM testing
Versioning: Built-in; easy blue-green deployments

Well-Architected Mapping:

Cost: Ideal for variable workloads; reserved concurrency for baseline (cheaper for high consistent load)
Performance: Cold start sensitive; provisioned concurrency adds cost but ensures warm
Reliability: Automatically retried on transient failures; DLQ for async failures
Security: IAM role per function; no SSH access

Amazon ECS (Elastic Container Service)

When to Use:

Containerized applications (Docker, non-Kubernetes)
Moderate to high load (100s to 1000s requests/sec)
Long-running services (> 15 min)
Need for quick startup (< 1 sec) and predictable latency
Mixed workloads on shared cluster

When NOT to Use:

Variable, bursty workloads (Lambda more cost-effective for low utilization)
Complex orchestration (Kubernetes / EKS recommended)
Minimal infrastructure (Lambda/API Gateway less operational burden)

Tradeoffs:

Pros: Fast startup, warm containers, good for stateful services, mixed workload consolidation, less overhead than EKS
Cons: Requires container registry (ECR), task definitions, service scaling configuration; no built-in persistent volumes (use EFS)

Cost Model:

Fargate (serverless): $0.04695 per vCPU-hour + $0.00519 per GB-hour
Example: 1 vCPU, 2GB, always-on = $0.0470 + $0.0104 = $33.65/month
EC2 (self-managed): Pay for EC2 instance (larger volumes, better ROI with high utilization)

Scaling Behavior:

Auto Scaling Group scales EC2 instances; ECS scheduler places tasks
Fargate scales near-instantly; EC2 scaling depends on instance launch time (~1-2 min)
Target tracking (CPU, memory); step scaling for complex rules

Operational Burden:

Moderate: Manage task definitions, scaling policies, blue-green deployments
Monitoring: CloudWatch metrics, container logs
Patching: Update task definitions and redeploy; ECS handles rolling updates

Well-Architected Mapping:

Cost: Fargate for simple, variable workloads; EC2 for predictable, high-utilization
Performance: No cold start; warm containers; can be more efficient than Lambda for sustained load
Reliability: Auto-restart failed tasks; service health checks; distributed across AZs
Security: IAM task role; container-level isolation

Amazon EKS (Elastic Kubernetes Service)

When to Use:

Complex microservices ecosystems (10s-100s of services)
Kubernetes-native tooling and expertise available
Advanced orchestration needs (canary deployments, traffic shifting, service mesh)
Workloads already containerized in Kubernetes

When NOT to Use:

Small teams without Kubernetes expertise (ECS simpler)
Simple applications (Lambda or ECS sufficient)
Minimal operational overhead desired

Tradeoffs:

Pros: Powerful orchestration, extensive ecosystem (Istio, Helm, Prometheus), vendor-agnostic (portable to other clouds)
Cons: Operational complexity; requires cluster management (node updates, networking, security); higher learning curve

Cost Model:

EKS cluster: $0.10 per hour (control plane) + EC2 or Fargate for worker nodes
Full cost: $73/month (cluster) + compute costs
Good for: 10+ services; high utilization; can amortize control plane cost

Scaling Behavior:

Cluster autoscaler adds/removes nodes based on pod resource requests
Horizontal Pod Autoscaler (HPA) scales pod replicas by CPU/memory/custom metrics
Complex multi-dimensional scaling (per-deployment, per-namespace)

Operational Burden:

High: Cluster upgrades, node patching, networking (CNI plugins), monitoring (add-ons)
Requires dedicated SRE/platform team for large-scale

Well-Architected Mapping:

Cost: Only justified for complex workloads; simpler apps should use ECS
Performance: Fine-grained control; service mesh for advanced routing
Reliability: Self-healing, automated failover, rolling updates
Security: RBAC, network policies, pod security policies

Amazon EC2 (Elastic Compute Cloud)

When to Use:

Legacy applications requiring specific OS or drivers
High CPU/memory workloads (compute-intensive)
Persistent connections (WebSockets, SSH, persistent databases)
Need for dedicated hardware or licensing

When NOT to Use:

Stateless, short-lived workloads (Lambda cheaper)
Variable load without auto-scaling configured
Greenfield modern applications (serverless preferred)

Tradeoffs:

Pros: Full control, any OS/software, persistent storage (EBS), good for sustained workloads
Cons: Operational burden (patching, scaling, security), must manage capacity, higher baseline cost

Cost Model:

On-demand: $0.0116/hour (t3.micro) to $10+/hour (large instances)
Reserved (1-year): ~40% discount; 3-year: ~65% discount
Spot: Up to 90% off on-demand; interruption risk
Good for: Baseline capacity (reserved); variable load (on-demand + spot)

Scaling Behavior:

Auto Scaling Group scales by launch time (~1-2 min), not instant
Gradual scaling (step scaling, target tracking) for stability
Spot fleet for cost optimization

Operational Burden:

High: OS patching, security groups, IAM roles, monitoring, backups (EBS snapshots)
AMI management for consistent deployments

Well-Architected Mapping:

Cost: Optimize with Reserved Instances + Savings Plans; use Savings Plans for flexibility across instance families
Performance: Predictable performance; tune instance type for workload
Reliability: Multi-AZ ASG; EBS volumes for persistence; snapshots for backup
Security: Security groups, IAM instance roles, encrypted EBS

AWS Batch

When to Use:

Large-scale batch processing (1000s of parallel jobs)
Cost-optimized batch workloads (use spot instances)
Scheduled jobs (daily, weekly ETL)
Distributed processing without Spark/Hadoop complexity

When NOT to Use:

Real-time or interactive workloads (ECS, Lambda)
Complex data transformations (Glue, EMR)

Tradeoffs:

Pros: Managed job queue, auto-scaling, spot instances for cost, simple job definitions
Cons: Not real-time; latency for job startup; limited monitoring compared to ECS

Cost Model:

Compute: Pay for EC2/Fargate underlying (Batch orchestration free)
Spot instances: 70% off on-demand
Example: 1000 jobs, 1 hour each, m5.large spot = ~50 parallel instances × $0.0174/hour × 1 hour × 70% discount = $61

Scaling Behavior:

Managed job queue with auto-scaling
Jobs queued; Batch launches instances on demand
Scale-down after job completion (if no backlog)

Operational Burden:

Low-moderate: Define job definitions; submit jobs; Batch manages the rest
Monitoring: CloudWatch metrics, job logs

Well-Architected Mapping:

Cost: Excellent for batch workloads; spot instances reduce cost significantly
Reliability: Automatic retries on failure; DLQs for failed jobs
Performance: Parallel execution scales linearly

AWS Elastic Beanstalk

When to Use:

Simple web applications or APIs
Team comfortable with code, not infrastructure
Want managed platform without container complexity

When NOT to Use:

Complex microservices architectures (EKS)
Highly customized infrastructure
Needing fine-grained control

Tradeoffs:

Pros: Managed; simple deployments (git push or CLI); handles scaling, load balancing, monitoring
Cons: Less control than ECS; less powerful than EKS; can be "magic" (hard to debug)

Cost Model:

Same as underlying EC2 + ALB + RDS (if used)
No additional Beanstalk fee
Good for: Simple apps where operational simplicity outweighs cost

Scaling Behavior:

Auto Scaling Group scales EC2 instances
Health checks monitor instances

Operational Burden:

Very Low: Push code; Beanstalk handles deployment, scaling, logging

Well-Architected Mapping:

Operational Excellence: Managed deployments; built-in monitoring; environment cloning
Cost: Transparent cost; same as self-managed
Performance: Good for small to medium workloads

3.2 Storage Services

Amazon S3 (Simple Storage Service)

When to Use:

Object storage (files, media, logs, backups, data lake)
Static website hosting
Archive (Glacier tiers)
Durability requirement (11 nines)

When NOT to Use:

Block storage (use EBS)
File system access patterns (use EFS)
Database (use RDS, DynamoDB)
Real-time read/write latency (milliseconds)

Tradeoffs:

Pros: Infinitely scalable, highly durable, cheap at scale, multi-region replication, lifecycle policies
Cons: Not a file system (eventual consistency for overwrites in older regions), latency ~100-200ms, complex IAM

Cost Model:

Standard: $0.023 per GB/month
Intelligent-Tiering: $0.0125 per GB/month (auto-moves to cheaper tiers based on access)
Glacier Instant: $0.004 per GB/month (instant retrieval)
Glacier Flexible: $0.0036 per GB/month (1-12 hour retrieval)
Requests: $0.0004 per 1k PUT, $0.000004 per 1k GET
Data transfer out: $0.09 per GB (in-region free)

Scaling Behavior:

Unlimited storage; auto-scales
Throughput: 3,500 PUT/COPY/POST/DELETE per second per prefix; 5,500 GET per prefix
For higher throughput, use different key prefixes (randomize first characters)

Operational Burden:

Very Low: Fully managed; no provisioning, scaling, or patching
Configuration: Bucket policies, CORS, versioning, lifecycle, replication

Well-Architected Mapping:

Cost: Incredibly cheap at scale; Intelligent-Tiering for unknown access patterns; Glacier for archival
Reliability: 11 nines durability; cross-region replication for disaster recovery; versioning for data protection
Security: Bucket policies, ACLs, encryption (SSE-S3, SSE-KMS), Block Public Access, access logging
Sustainability: Intelligent-Tiering reduces carbon footprint; archive old data to Glacier

Amazon EBS (Elastic Block Store)

When to Use:

Block storage for EC2 instances (root volume, data volumes)
Persistent storage for databases
High-performance workloads (SSD gp3, io2)

When NOT to Use:

Object storage (use S3)
Shared file system (use EFS)
Archival (use Glacier)

Tradeoffs:

Pros: Block-level, high performance, snapshots for backup, encryption
Cons: Must be attached to EC2 instance; limited to single AZ (unless snapshot-replicated)

Cost Model:

gp3 (general-purpose SSD): $0.08 per GB/month (includes 3k IOPS, 125 MB/s throughput)
io2 (high I/O SSD): $0.125 per GB/month; $0.065 per IOPS provisioned
st1 (throughput-optimized HDD): $0.045 per GB/month
Snapshots: $0.05 per GB/month (incremental)
Example: 100 GB gp3 = $8/month

Scaling Behavior:

Fixed at provisioning; can be increased (online) but not decreased (offline only)
IOPS scale independently of size (gp3: up to 16k IOPS)

Operational Burden:

Low: Fully managed; snapshots are automated if configured
Monitoring: Volume metrics in CloudWatch

Well-Architected Mapping:

Reliability: Snapshots for backup; replicate snapshots for cross-AZ recovery
Performance: Choose instance-store for max IOPS (no persistence); EBS for balance
Cost: gp3 cheaper than gp2 at same performance; delete unused snapshots

Amazon EFS (Elastic File System)

When to Use:

Shared file system for multiple EC2 instances
NFS access required
Scaling file system across AZs

When NOT to Use:

High-performance (EBS/instance-store faster)
Archive (use S3)
Windows (use FSx for Windows File Server)

Tradeoffs:

Pros: Elastic (grow/shrink), multi-AZ, scalable across instances
Cons: Latency higher than EBS; NFS protocol overhead; more expensive

Cost Model:

Standard: $0.30 per GB/month
One Zone: $0.16 per GB/month (single AZ, 20% cheaper)
Provisioned throughput: $0.01 per MB/s (on-demand)
Example: 100 GB multi-AZ = $30/month

Scaling Behavior:

Auto-scales; no provisioning needed
Throughput: Bursting (up to 500 MB/s for 100 GB); provisioned for higher sustained

Operational Burden:

Low: Fully managed; no patching or replication

Well-Architected Mapping:

Reliability: Data replicated across AZs automatically
Performance: Throughput mode for parallelized workloads
Cost: More expensive than S3 for archive; less expensive than maintaining NFS servers

Amazon FSx for Windows File Server / Lustre

When to Use:

Windows-native file sharing (SMB/CIFS)
Lustre (high-performance computing, machine learning)

When NOT to Use:

NFS needed (use EFS)
General object storage (use S3)

Tradeoffs:

Pros: Fully managed; Windows-native; high performance
Cons: More expensive; less flexible than self-managed file servers

Cost Model:

Windows File Server: $0.012 per hour per GB/month allocated (SSD)
Lustre: $0.015-$0.021 per hour per TB/month

Well-Architected Mapping:

Performance: Low-latency file access; good for Windows environments
Cost: Justified only for Windows workloads needing shared files

3.3 Database Services

Amazon RDS (Relational Database Service)

When to Use:

Structured data with complex queries (SQL)
ACID transactions required
Existing relational database workloads (MySQL, PostgreSQL, Oracle, SQL Server, MariaDB)
< 100 TB data

When NOT to Use:

NoSQL access patterns (key-value; use DynamoDB)
Unstructured data (use S3)
Extreme scale (> 100 TB; use Redshift or specialized database)
Real-time analytics (use Redshift, Athena, or OpenSearch)

Tradeoffs:

Pros: Managed, automated backups, Multi-AZ failover, read replicas, encryption
Cons: Limited to single-master writes (though read replicas help); must right-size; managed backups limited to 35 days

Cost Model:

Instance: $0.17/hour (db.t3.micro) to $10+/hour (db.r6g.16xlarge)
Storage: $0.10 per GB/month (gp2 SSD); $0.12 (io1 SSD)
Backups: First backup free; additional storage $0.10 per GB/month
Data transfer: $0.01 per GB (out-of-region)
Example: db.t3.small (small app) = ~$50/month + storage

Scaling Behavior:

Vertical scaling (change instance type; requires downtime or read-replica promotion)
Read replicas for horizontal read scaling (async replication; eventual consistency)
Auto Scaling for storage (grow up to max)

Operational Burden:

Low-moderate: Backups, upgrades managed; must monitor CPU/disk; schema management
Multi-AZ automatic failover for HA

Well-Architected Mapping:

Reliability: Multi-AZ for automatic failover (2x cost); read replicas for DR and scaling
Performance: Connection pooling; query optimization; appropriate indexes
Cost: Right-size instance type; use Savings Plans (1-year: 31% off, 3-year: 43% off); gp2 → gp3 for cost reduction
Security: Encryption at-rest (KMS), in-transit (SSL); IAM database authentication; encrypted backups

Amazon Aurora (MySQL/PostgreSQL-compatible)

When to Use:

High-throughput, low-latency SQL workloads
Mission-critical applications requiring high availability
Need for read scaling (15 read replicas)
Up to 128 TB storage

When NOT to Use:

Simple applications (RDS enough)
Cost-sensitive (Aurora 2-3x RDS cost)
Oracle-licensed (RDS Oracle only)

Tradeoffs:

Pros: 5x faster than MySQL, 3x faster than PostgreSQL; 15 read replicas; auto-scaling storage; distributed architecture
Cons: More expensive; multi-master not default (requires Aurora MySQL 5.7+); Aurora Serverless has cold starts

Cost Model:

DB instance: $0.15/hour (db.t3.small) to $3.26/hour (db.r6g.16xlarge) (cheaper than comparable RDS per hour)
Storage: $0.10 per GB/month (only pay for used; auto-scaling)
Example: 100 GB, db.r6g.large (1 writer + 2 readers) = ~$250/month (better ROI for scale)

Scaling Behavior:

Read replicas scale independently (up to 15)
Auto Scaling Aurora Serverless (aurora-mysql) for variable load; pause when idle
Storage auto-scales; no disk-full risk

Operational Burden:

Very Low: Managed; replication automatic; backups automated (35-day retention)
Multi-AZ automatic; cross-region read replica option

Well-Architected Mapping:

Reliability: Multi-AZ primary + read replicas; automatic failover; cross-region DR
Performance: 5x throughput gains; up to 100k write capacity; 500k read capacity
Cost: Higher upfront; lower per-transaction cost at scale; Serverless variant for variable loads
Sustainability: Shared compute for read replicas (efficient); auto-pause for low utilization

Amazon DynamoDB

When to Use:

Key-value, document, or time-series data
Predictable access patterns (query by partition key)
Variable or bursty load (on-demand pricing)
Millisecond latency required
Serverless architecture

When NOT to Use:

Complex queries across unrelated attributes (RDS, Redshift)
Relational integrity (RDS)
Full-text search (OpenSearch)
Transactional consistency across multiple items (limited to single partition)

Tradeoffs:

Pros: Fully managed, auto-scales, millisecond latency, serverless, global tables
Cons: Query flexibility limited (must know partition key); eventual consistency for global tables; complex secondary indexes

Cost Model:

Provisioned: Read: $0.00013 per RCU-hour; Write: $0.00065 per WCU-hour
On-demand: $1.25 per 1M write requests; $0.25 per 1M read requests
Storage: $0.25 per GB/month
Example: 100 GB, on-demand, 100k write + 500k read/month = $0.125 + $0.125 + $25 = $25.25/month
Breakeven: Provisioned cheaper at > 1M writes/day or > 3M reads/day

Scaling Behavior:

Provisioned: Set RCU/WCU; auto-scales up within limits; scales down with delay (conservative)
On-demand: Scales instantly; no capacity planning; higher latency at extreme peak
Global Tables: Replicate to any region; read-your-writes consistency

Operational Burden:

Very Low: Fully managed; no patching, replication, or backups to configure
Monitoring: Consumed capacity, throttling, latency in CloudWatch

Well-Architected Mapping:

Cost: On-demand for variable; provisioned for predictable; reserve capacity (Savings Plans) for baseline
Reliability: Automatic backups (Point-in-Time Recovery); Global Tables for multi-region
Performance: Millisecond latency; partition key design critical to avoid hot partitions
Security: Encryption at-rest (KMS), IAM fine-grained access, TTL for automatic deletion
Sustainability: Managed service; consolidate workloads

Amazon Redshift

When to Use:

Data warehousing (TB-PB scale)
Complex OLAP queries (star schema, large joins)
BI/analytics workloads
Time-series analytics (rollups, trends)

When NOT to Use:

OLTP/transactional (RDS, DynamoDB)
Ad-hoc queries (Athena cheaper)
Real-time ingestion (Kinesis better)

Tradeoffs:

Pros: Petabyte-scale, fast complex queries, mature BI integration, Spectrum for querying S3
Cons: Requires provisioning and management; cluster outage for node failures (unless multi-node with auto-scaling)

Cost Model:

dc2.large: $1.26/hour (~$900/month)
ra3.xlplus (managed storage): $4.02/hour + $0.008 per GB/month for managed storage
Example: 100 GB dc2.large = ~$900/month (good for sustained warehouse workload)

Scaling Behavior:

Vertical (resize cluster type) or horizontal (add nodes)
Resizing requires cluster downtime or elastic resize (newer, no downtime)

Operational Burden:

Moderate: Cluster maintenance, node replacement, VACUUM/ANALYZE
Spectrum: Query S3 without loading (larger effective warehouse without cluster growth)

Well-Architected Mapping:

Cost: Fixed monthly cost; break-even against Athena at ~1 TB queries/month; reserved instances available
Performance: Complex OLAP queries; star schema optimization; Spectrum for external data
Reliability: Snapshots for backup; cross-region snapshot copy for DR
Security: Encryption at-rest, in-transit; IAM; audit logging (via CloudWatch/S3)

Amazon OpenSearch (formerly Elasticsearch)

When to Use:

Full-text search (logs, documents)
Time-series analytics (logs with timestamps)
Real-time dashboards (Kibana visualization)
Vector search (embeddings for ML, RAG)

When NOT to Use:

Traditional OLAP (Redshift)
Transactional (RDS)
Pure archival (S3, Glacier)

Tradeoffs:

Pros: Fast full-text search, powerful aggregations, real-time visualization, vector search for ML
Cons: Requires cluster; complex cluster configuration (node types, shard allocation); eventual consistency

Cost Model:

Single node (t3.small): $0.123/hour (~$90/month)
Multi-node (3 data nodes, r5.large): ~~$1.26/hour (~~$920/month)
Storage: Included in node cost; scales with node type
Example: Small logging cluster (3 nodes) ~$900/month + data retention

Scaling Behavior:

Horizontal scaling (add nodes); shard allocation balances data
Auto-scaling available but requires careful configuration
Manual index rollover for time-series (daily indices)

Operational Burden:

Moderate: Shard allocation, index management, cluster health monitoring
Snapshot repository for backup (S3)

Well-Architected Mapping:

Cost: Fixed; justified for search/logging volume > 1TB/month
Performance: Real-time search; aggregations on large datasets
Reliability: Snapshots to S3; replica shards for HA; cross-region replication
Security: Encryption, IAM, fine-grained access control via plugins

Amazon DynamoDB + Timestream

Timestream (for time-series data):

When to Use:

Metrics, sensor data, stock prices (time-series)
High-volume, append-only workloads
Automatic retention policies

When NOT to Use:

General analytics (Redshift, Athena)
Complex queries (OpenSearch better for logs)

Cost Model: $0.30 per million writes; $0.01 per million queries

3.4 Integration & Messaging Services

Amazon SQS (Simple Queue Service)

When to Use:

Async decoupling (producer → queue → consumer)
Buffer burst traffic (queue absorbs spikes)
Reliable delivery guarantee
Work scheduling (Lambda pulling messages)

When NOT to Use:

Real-time pub/sub (SNS)
Complex routing (EventBridge)
Immediate delivery (SNS faster)

Tradeoffs:

Pros: Durable queue, at-least-once delivery, DLQ for failed messages, simple FIFO option
Cons: Polling model (not push); no broad fanout (one consumer per message)

Cost Model:

Standard SQS: $0.40 per 1M requests
FIFO: $0.50 per 1M requests + deduplication/group messaging
Example: 1M messages/month = $0.40 (essentially free at small scale)

Scaling Behavior:

Unlimited message count; auto-scales
Message retention: 15 min to 14 days (configurable)
Consumers poll (or Lambda Event Source Mapping triggers Lambda)

Operational Burden:

Very Low: Fully managed; configure retention, visibility timeout, DLQ

Well-Architected Mapping:

Reliability: Durable queue; DLQ for failed messages; at-least-once delivery (idempotent processing required)
Cost: Extremely cheap; pay per 1M requests
Performance: Polling latency 0-20 sec (depends on ReceiveMessageWaitTimeSeconds)

Amazon SNS (Simple Notification Service)

When to Use:

Fanout (one message → many consumers)
Pub/Sub pattern
Notifications (email, SMS, push)
Integration with SQS (SNS → SQS for durability)

When NOT to Use:

Ordered delivery (FIFO not as robust as SQS FIFO)
Complex routing (EventBridge)
Message history/replay needed (Kinesis)

Tradeoffs:

Pros: Simple pub/sub, instant delivery, fanout, many targets
Cons: No message history; no durability guarantees (best-effort delivery); no replay

Cost Model:

$0.50 per 1M publishes
Notifications: $0.02 per SMS, variable for email/HTTP
Example: 1M publishes → $0.50/month; negligible at small scale

Scaling Behavior:

Unlimited subscribers; auto-scales
Delivery attempts: 3 (exponential backoff)

Operational Burden:

Very Low: Fully managed; configure subscriptions, filters

Well-Architected Mapping:

Reliability: Couple with SQS for durability (SNS → SQS → Consumer)
Cost: Very cheap
Performance: Instant delivery

Amazon EventBridge

When to Use:

Event-driven architecture (100+ event sources)
Rule-based routing (complex filtering)
Multi-target fanout (90+ targets)
Integration with AWS services and SaaS apps
Event replay and archiving

When NOT to use:

Simple queue (SQS)
Just fanout (SNS)
High-throughput streaming (Kinesis)

Tradeoffs:

Pros: Flexible routing, extensive targets, schema registry, event replay
Cons: More complex than SNS/SQS; slightly higher latency; limited throughput vs. Kinesis

Cost Model:

$0.35 per 1M events
Archive retention: $0.023 per GB/month
Example: 1M events/month = $0.35 (negligible)

Scaling Behavior:

Millions of events per second per account
Auto-scales

Operational Burden:

Low: Define rules; manage targets

Well-Architected Mapping:

Reliability: Event replay from archive; DLQ for failed targets
Cost: Extremely cheap
Performance: ~50-100ms latency; Kinesis faster for streaming

Amazon Kinesis Data Streams

When to Use:

Streaming (continuous data flow)
Ordered by partition (shard)
Real-time processing (seconds latency)
24-hour message history (replay)
High throughput (100k+ events/sec)

When NOT to Use:

Simple queuing (SQS/SNS)
Low-volume events (EventBridge cheaper)
Long message history (Kafka/MSK)

Tradeoffs:

Pros: Real-time streaming, partitioned for ordering/parallelization, replay capability
Cons: Requires shard management or on-demand pricing; higher cost than SQS

Cost Model:

Provisioned: $0.36 per shard-hour; $0.02 per 1M PutRecord requests
On-demand: $0.047 per GB ingested; $0.315 per 1M GetRecords
Example: 1M events/hour (1 KB each) on-demand = 1GB/hour × $0.047 = $0.047/hour ≈ $35/month; provisioned (1 shard) = $262/month

Scaling Behavior:

Provisioned: Auto-scaling by shard count
On-demand: Scales instantly
Throughput: 1 MB/sec per shard (provisioned); billed per GB on-demand

Operational Burden:

Low-moderate: Shard management, consumer lag monitoring, partition key design

Well-Architected Mapping:

Cost: Provisioned for baseline; on-demand for variable
Performance: Real-time streaming, 24-hour replay, partition ordering
Reliability: Replicated across AZs; consumer checkpointing for fault tolerance

Amazon MSK (Managed Streaming for Apache Kafka)

When to Use:

Kafka ecosystem expertise available
Cross-cloud/on-prem Kafka integration needed
Complex streaming logic (Kafka Streams, Flink)
Larger throughput than Kinesis economically viable

When NOT to Use:

AWS-only workloads (Kinesis simpler)
Low-volume (EventBridge, SQS cheaper)
Need for hands-off (Kinesis more managed)

Tradeoffs:

Pros: Kafka ecosystem, portability, higher throughput at scale
Cons: More operational burden; cluster management; higher baseline cost

Cost Model:

Broker cost: $0.15 per hour per broker (3 brokers = $1080/month baseline)
Storage: $0.10 per GB/month
Data transfer: $0.02 per GB (out-region)
Example: 3 brokers, 100 GB storage = $1080 + $10 = $1090/month

Scaling Behavior:

Broker count and storage size scales
Auto-scaling available via IAM roles

Operational Burden:

High: Broker configuration, topic management, consumer group coordination, monitoring

Well-Architected Mapping:

Cost: Fixed baseline; good for high-volume/long-term commitments
Performance: High throughput; complex streaming logic
Reliability: Multi-AZ; broker redundancy; replication factor

AWS Step Functions

When to Use:

Multi-step workflows with conditional logic
Long-running processes (minutes to days)
Error handling and retries needed
Visual workflow definition
Orchestrating async services

When NOT to Use:

Simple task execution (Lambda sufficient)
Strict real-time (latency implications)

Tradeoffs:

Pros: Visual workflow, error handling, retry policies, state persistence, waiting
Cons: State transitions cost money; additional latency per step; limited native support for some use cases

Cost Model:

Standard: $0.000025 per state transition
Express: $0.000001667 per invocation + $0.000000208 per GB-second
Example: 1M workflows, 10 steps = 10M transitions = $250/month

Scaling Behavior:

Millions of concurrent executions
Automatic

Operational Burden:

Low: Define state machine in JSON; Step Functions handles orchestration

Well-Architected Mapping:

Reliability: Automatic retries, catch/throw errors, compensation logic
Cost: Extremely cheap for state transitions; Express mode for high-volume
Performance: Step execution ~0.5-2 sec latency; suited for async workflows, not real-time

3.5 Networking Services

Amazon VPC (Virtual Private Cloud)

When to Use:

Almost every workload (default)
Network isolation required
Custom IP addressing
Network ACLs, security groups

When NOT to Use:

Unheard of; VPC is foundational

Architecture Pattern:

Public subnet: IGW for internet; ALB, NAT Gateway
Private subnet: No direct internet; NAT Gateway for outbound
Database subnet: Isolated; only accessible from app tier

Scaling Behavior:

Elastic; auto-scales
VPC can have max 5 CIDR blocks

Operational Burden:

Moderate: Design subnets, security groups, route tables, NAT Gateway (hourly cost)

Well-Architected Mapping:

Security: Network isolation; security groups per tier; NACLs for stateless filtering
Reliability: Multi-AZ subnets; redundant NAT Gateways
Cost: Free VPC; pay for NAT Gateway ($32/month per AZ), data transfer

Application Load Balancer (ALB) / Network Load Balancer (NLB)

ALB (Layer 7):

When to use: HTTP/HTTPS, microservices (path/hostname routing), WebSocket
Cost: $22/month base + $0.006 per LCU (load balancer capacity unit)
Latency: 100-200ms overhead

NLB (Layer 4):

When to use: Extreme throughput (millions/sec), low latency (< 100ms), non-HTTP protocols (TCP, UDP)
Cost: $32/month base + $0.006 per LCU (higher pricing)
Latency: 10-50ms overhead

Well-Architected Mapping:

Reliability: Health checks; automatic failover; cross-AZ
Performance: ALB for flexibility; NLB for extreme throughput
Cost: $22-32/month baseline; shared across multiple services if possible

Amazon API Gateway

When to Use:

REST or GraphQL APIs
Rate limiting, authentication (API keys, OAuth)
Request/response transformation
Caching

When NOT to Use:

Internal service-to-service (VPC Endpoints)
Extreme throughput (NLB better)

Cost Model:

$3.50 per 1M requests (regional)
$0.60 per 1M for WebSocket API
Data transfer: $0.09 per GB out

Scaling Behavior:

Auto-scales; no provisioning

Operational Burden:

Low: Define API, configure integrations, set up throttling

Well-Architected Mapping:

Security: API keys, OAuth, request validation, WAF integration
Reliability: Throttling prevents cascading failures
Performance: Caching reduces backend load
Cost: Extremely cheap; $3.50 per 1M requests (~$3.50/month for 1M requests)

Amazon CloudFront

When to Use:

Distribute static content globally
Cache API responses (cache headers)
DDoS protection (Shield Standard included)
Origin Shield for cache efficiency

When NOT to Use:

Single-region, low-traffic workloads

Cost Model:

$0.085 per GB (varies by region, USA cheapest)
Request: $0.01 per 10k

Scaling Behavior:

Auto-scales globally; no provisioning

Operational Burden:

Very Low: Configure origin, cache behavior, invalidation

Well-Architected Mapping:

Performance: Global latency reduction (users served from nearest edge); significant for global audiences
Cost: Minimal request cost; high if transferring large GB (offset by origin load reduction)
Security: DDoS protection, WAF, Origin Shield for burst traffic
Sustainability: Reduced origin load; distributed edge computing

AWS Transit Gateway

When to Use:

Hub-and-spoke connectivity (multiple VPCs, on-prem)
Simplified multi-VPC architecture
On-premises integration via Direct Connect

When NOT to use:

Single VPC (unnecessary)

Cost Model:

$0.05 per hour (~$36/month)
$0.02 per GB processed

Well-Architected Mapping:

Reliability: Centralized connectivity; simplified failover
Cost: Justified for 3+ VPCs or multi-region

AWS Direct Connect

When to Use:

Dedicated network from on-premises to AWS
Consistent bandwidth
Large data transfers (cheaper than internet data transfer)

When NOT to Use:

Small, intermittent connections (VPN sufficient)

Cost Model:

$0.30 per hour (~$218/month)
$0.02 per GB output

Well-Architected Mapping:

Reliability: Dedicated connection; predictable performance
Cost: Justified for > 10 TB/month transfers

3.6 Security Services

AWS IAM (Identity & Access Management)

When to use: Always

Every workload needs IAM roles and policies
Fine-grained permissions (least privilege)
Cross-account access via roles

Cost Model: Free

Well-Architected Mapping:

Security: Least privilege; assume role model; remove console access

AWS KMS (Key Management Service)

When to use:

Encrypt sensitive data (PII, financial, secrets)
At-rest encryption (S3, RDS, EBS)

Cost Model:

$1.00 per month per key
$0.03 per 10k requests

Well-Architected Mapping:

Security: Encryption; audit trail (CloudTrail); key rotation
Compliance: Required for regulated data (PII, HIPAA, PCI)

AWS Secrets Manager

When to use:

Store database credentials, API keys, tokens
Automatic rotation

Cost Model:

$0.40 per secret per month
$0.05 per 10k API calls

Well-Architected Mapping:

Security: No hardcoded credentials; rotation; audit

Amazon GuardDuty

When to use:

Threat detection
Continuous monitoring for malicious activity
Integration with Security Hub

Cost Model:

$1.00 per 1M events analyzed per month

Well-Architected Mapping:

Security: Automated threat detection; alerts; findings

AWS WAF (Web Application Firewall)

When to use:

Protect web applications from attacks (SQL injection, XSS, bot attacks)
Integration with CloudFront, ALB, API Gateway

Cost Model:

$5.00 per month per rule group
$0.60 per 1M requests

Well-Architected Mapping:

Security: Application-layer protection; rate limiting; IP blocking

3.7 Observability Services

Amazon CloudWatch

When to use: Every workload

Metrics (CPU, memory, custom)
Logs (application, system)
Alarms (trigger auto-scaling, SNS)
Dashboards (visualization)

Cost Model:

Logs: $0.50 per GB ingested; $0.03 per GB stored
Metrics: $0.30 per custom metric per month
Alarms: $0.10 per alarm per month

Well-Architected Mapping:

Operational Excellence: Observability; alerts; dashboards
Reliability: Alarms trigger auto-scaling; identify bottlenecks
Cost: Identify over-provisioned resources; optimize

AWS X-Ray

When to use:

Distributed tracing across microservices
Identify latency bottlenecks
Understand service dependencies

Cost Model:

$5.00 per 1M recorded traces
$0.50 per 1M retrieved traces

Well-Architected Mapping:

Operational Excellence: Trace requests end-to-end; identify latency
Reliability: Understand failure propagation

Amazon Managed Prometheus / Grafana

When to use:

Kubernetes metrics (via Prometheus scraping)
Long-term metric storage
Custom visualization (Grafana)

Cost Model:

Prometheus: $0.90 per 1M ingested samples
Grafana: $9.00 per workspace per month

Well-Architected Mapping:

Operational Excellence: Kubernetes-native monitoring
Cost: For EKS workloads

Section 4: Architecture Pattern Library (Domain-Independent)

Every system, regardless of domain, is composed of these reusable patterns.

4.1 CRUD Backend

Definition: Create, Read, Update, Delete operations on a data model.

Pattern:


Client (mobile, web, API) → API Gateway → Lambda → RDS/DynamoDB

When to use: Always (fundamental pattern)

Well-Architected:

Security: IAM roles; request validation; encryption
Reliability: Error handling; retry logic; transaction handling
Performance: Caching (ElastiCache); query optimization; connection pooling

Cost Optimization:

Use DynamoDB for variable load
RDS with read replicas for read-heavy
Cache frequently accessed data

4.2 Event-Driven Orchestration

Definition: Components communicate via asynchronous events; central coordinator (Step Functions) directs workflow.

Pattern:


Trigger → Step Functions → [Lambda1, ECS2, Batch3, Lambda4] → EventBridge → Notifications
                                      ↓
                           DynamoDB (state storage)

When to use: Multi-step business workflows (order processing, loan approvals, ML pipelines)

Well-Architected:

Reliability: Automatic retries; compensation logic; DLQ for failures
Operational Excellence: CloudWatch logs per step; alerts on failure
Cost: Pay per state transition; extremely cheap

4.3 Saga Pattern (Distributed Transactions)

Definition: Long-running transaction across multiple services; compensating transactions for rollback.

Pattern:


Service A (order) → EventBridge → Service B (payment) → EventBridge → Service C (fulfillment)
                                          ↓
                                    If payment fails:
                                    Compensate Service A (cancel order)

When to use: Multi-service transactions without distributed locks

Well-Architected:

Reliability: Compensation logic; idempotent operations
Operational Excellence: Audit trail (events); replayable

4.4 Fan-Out / Fan-In

Definition: One event triggers multiple parallel processors; results aggregated.

Pattern (fan-out):


SNS/EventBridge → Lambda1 (send email)
                → Lambda2 (update metrics)
                → Lambda3 (trigger analytics)

Pattern (fan-in):


Lambda1 \
Lambda2  → Step Functions → Lambda4 (aggregate results)
Lambda3 /

When to use: Parallel processing; decoupled consumers

Well-Architected:

Reliability: Fan-out is parallel; fanin aggregates results
Performance: Parallel execution reduces latency

4.5 Data Lakehouse (Medallion Architecture)

Definition: Multi-tier data organization (bronze → silver → gold)

Pattern:


Source (API, DB) → S3 Bronze (raw) → Glue (transform) → S3 Silver (cleaned)
                                                             ↓
                                                      Lake Formation (govern)
                                                             ↓
                                                    S3 Gold (curated)
                                                             ↓
                                            Redshift/Athena/QuickSight

When to use: Centralized data platform with multi-consumer access

Well-Architected:

Operational Excellence: Glue Data Catalog; Lake Formation permissions
Security: Encryption; access control; audit logging
Cost: S3 + Athena for ad-hoc; Redshift for BI; Glacier for archive

4.6 CQRS (Command Query Responsibility Segregation)

Definition: Separate read and write models; write commands to event store; query from read-optimized views.

Pattern:


Write Path:             Read Path:
Command → Aggregate → Event Store → Denormalize → Read View
                                                        ↓
                                                    Query (fast, optimized)

When to use: Complex domain logic; audit requirements; multiple read views

Well-Architected:

Reliability: Event sourcing provides audit trail; replay for recovery
Performance: Read model optimized for queries
Cost: DynamoDB for event store + read model; cheap at scale

4.7 Command Pipeline (Batch Job Chain)

Definition: Sequential batch jobs; each transforms data and passes to next

Pattern:


Schedule → Glue1 (extract) → S3 (temp) → Glue2 (transform) → S3 (output) → Redshift (load)
                                                   ↓
                                            Step Functions (orchestration)

When to use: ETL pipelines; scheduled data processing

Well-Architected:

Operational Excellence: Step Functions for orchestration; error handling
Cost: Glue on-demand for variable workloads; schedule off-peak
Reliability: Checkpointing for restartable jobs; DLQ for failed steps

4.8 Agent-Tool Execution (Agentic AI)

Definition: AI agent reasons over user query; selects and executes tools; iterates until complete.

Pattern:


User Query → Bedrock Agent (reasoning) → Tool Selection
                                             ├── Lambda1 (query DB)
                                             ├── Lambda2 (call API)
                                             ├── DynamoDB (read state)
                                             └── OpenSearch (semantic search)
                                                     ↓
                                            Feedback loop (iterate if needed)
                                                     ↓
                                            Final response to user

When to use: Autonomous automation; conversational AI; decision support

Well-Architected:

Security: Tool access control; query validation; PII redaction
Reliability: Fallback tools; error recovery
Observability: Trace tool calls; audit agent decisions

4.9 Streaming Ingestion (Real-Time Data Pipeline)

Definition: Continuous data ingestion; real-time processing; multiple sinks.

Pattern:


Sensors/APIs → Kinesis → Analytics (windowed agg) → DynamoDB (current state)
                                 ↓
                            Lambda (enrichment)
                                 ↓
                    Multiple sinks:
                    ├── DynamoDB (dashboard)
                    ├── S3 (cold storage)
                    ├── OpenSearch (search/alerting)
                    └── SNS (alerts)

When to use: Real-time analytics; alerting; monitoring

Well-Architected:

Performance: Partition by shard; parallel processing
Cost: On-demand Kinesis for variable; provisioned for baseline
Reliability: Consumer checkpointing; DLQ for failed events

4.10 Multi-Region Active-Active

Definition: Same workload deployed in multiple regions; users routed to nearest; data replicated.

Pattern:


Global User → Route 53 (geolocation routing) → Region 1 (ALB → ECS → RDS Aurora Global)
                                              → Region 2 (ALB → ECS → RDS Aurora Global)
                                              → Region 3 (ALB → ECS → RDS Aurora Global)

When to use: Global applications; disaster recovery; low-latency for globally distributed users

Well-Architected:

Reliability: Automatic failover; regional disaster recovery
Performance: Users served from nearest region
Cost: 3x infrastructure cost; justified for critical, global workloads
Sustainability: Distributed load

Section 5: AWS Well-Architected Integration (Mandatory)

Every architectural decision must align with the six pillars of the AWS Well-Architected Framework. This section maps decisions to pillars and provides measurable indicators.

Pillar Alignment Template

For every major decision (service choice, architecture pattern, data design), answer:

Operational Excellence: Can teams operate this? Is it observable? Are procedures clear?
Security: Are data and access protected? Is compliance addressed?
Reliability: Can it recover from failure? What is RTO/RPO?
Performance Efficiency: Does it meet latency/throughput targets? Is it optimized?
Cost Optimization: Is it cost-effective? Are there cheaper alternatives?
Sustainability: Does it minimize energy/carbon? Is it resource-efficient?

5.1 Operational Excellence Pillar

Design Principles:

Organize teams around business outcomes
Implement observability for actionable insights
Safely automate where possible
Make frequent, small, reversible changes
Refine operations procedures frequently
Anticipate failure
Learn from operational events
Use managed services

Key Questions:

Are teams organized to own their systems end-to-end?
Is every component observable (metrics, logs, traces)?
Are deployments automated and safe (canary, blue-green)?
Can operators quickly diagnose and respond to issues?
Are runbooks and procedures documented and tested?

Best Practices by Service:

Service	Observability	Automation	Disaster Response
Lambda	CloudWatch Logs, X-Ray	SAM, CDK, CodePipeline	DLQ, retries, reserved concurrency
ECS	CloudWatch metrics, Container Insights	CodeDeploy, ECS task placement	Service health checks, auto-restart
RDS	Enhanced monitoring, Performance Insights, CloudWatch	AWS Config, automated backups	Multi-AZ failover, read replicas
DynamoDB	CloudWatch metrics, X-Ray tracing, TTL monitoring	Point-in-time recovery, backup service	Global tables, on-demand scaling
S3	Access logging, CloudTrail, CloudWatch metrics	Lifecycle policies, inventory, replication	Versioning, cross-region replication, Glacier
Kinesis	Iterator age, consumer lag, CloudWatch metrics	Auto-scaling shards, Lambda event mapping	Shard backup via S3, 24-hour retention

Measurable Indicators:

MTTR (Mean Time To Recovery): Target < 15 min for P1 incidents
Change failure rate: < 10% of deployments cause incidents
Deployment frequency: Daily or more
On-call fatigue: < 1 page/week per engineer

Anti-Patterns:

❌ Manual deployments; no version control
❌ No monitoring; discovering issues from customers
❌ Monolithic applications; all-or-nothing deployments
❌ Undocumented procedures; tribal knowledge

5.2 Security Pillar

Design Principles:

Implement strong identity foundation (least privilege)
Maintain traceability (audit all actions)
Protect data in transit and at-rest
Detect and investigate security events
Protect infrastructure
Prepare for security events

Key Questions:

Are all principals (users, roles, services) authenticated?
Are permissions least-privilege (does role have exactly needed permissions)?
Is all data encrypted (transit TLS, at-rest KMS)?
Is access logged and audited?
Are data classification and sensitivity levels defined?
Are compliance requirements met (HIPAA, PCI, GDPR)?

Best Practices by Tier:

Layer	Best Practice	Implementation
Identity & Access	Least privilege, role-based access	IAM policies, assume roles across accounts, MFA for console
Infrastructure	Network isolation, security groups	VPC, security groups (stateful), NACLs (stateless), WAF for web
Data Protection	Encrypt all data	KMS at-rest (S3, RDS, EBS), TLS in-transit, Secrets Manager for credentials
Detection	Monitor and alert on anomalies	CloudTrail (API audit), GuardDuty (threats), Config (compliance), Security Hub (aggregation)
Incident Response	Automated and manual playbooks	CloudWatch alarms → SNS → Lambda/email; IAM roles for responders

Measurable Indicators:

100% of data encrypted at-rest and in-transit
All API calls logged to CloudTrail
Zero exposed credentials (Secrets Manager, no hardcoding)
Compliance audit pass rate: 100% on critical controls
Incident detection time: < 5 min for automated, < 1 hour for manual

Anti-Patterns:

❌ Hardcoded credentials in code/config
❌ Public S3 buckets (unless intentional)
❌ Over-permissive IAM roles (e.g., AdministratorAccess)
❌ No encryption; no audit logging
❌ Manual credential rotation

5.3 Reliability Pillar

Design Principles:

Automatically recover from failure
Test failure scenarios
Stop guessing capacity
Manage change via automation

Key Questions:

Can the system recover automatically from failures?
Have failure scenarios been tested (chaos engineering)?
Is capacity provisioned to handle peaks without manual intervention?
Are changes made safely (automated rollback)?

Best Practices by Scenario:

Scenario	Strategy	Implementation
Service Failure	Auto-restart, circuit breaker	ECS health checks, ALB target deregistration, Step Functions retries
Database Failure	Multi-AZ failover, read replicas	Aurora Multi-AZ, RDS Multi-AZ, DynamoDB auto-replication
Data Loss	Backups, point-in-time recovery	RDS automated backups, S3 versioning, DynamoDB PITR, DMS (CDC)
Region Failure	Disaster recovery, multi-region	Cross-region replication (S3, RDS, DynamoDB global tables), Route 53 failover
Capacity Overload	Auto-scaling, circuit breakers	ASG, Lambda concurrency, SQS queue buffering, Step Functions error handling

RTO & RPO by Workload:

Workload Criticality	RTO	RPO	Strategy
Critical (financial, healthcare)	< 1 hour	Zero data loss	Multi-AZ + multi-region, synchronous replication, event sourcing
High (revenue-impacting)	1-4 hours	< 1 hour	Multi-AZ, async replication, hourly backups
Medium (operational)	4-24 hours	< 1 day	Single AZ, daily snapshots
Low (dev/test)	> 1 day	Not applicable	Backups acceptable; manual recovery OK

Measurable Indicators:

Availability: 99.9% for critical, 99% for high
MTTR: < 5 min for automated recovery
Unplanned downtime: < 43 min/month for 99.9%
Recovery test success: 100% of DR scenarios annually
Change rollback success: < 5 min

Anti-Patterns:

❌ Single points of failure (single AZ, single instance)
❌ No backups or unverified recovery
❌ Manual scaling (capacity runs out during spikes)
❌ No circuit breakers (cascading failures)
❌ Changes without automated rollback

5.4 Performance Efficiency Pillar

Design Principles:

Democratize advanced technologies
Go global in minutes (CloudFront)
Use serverless for variable workloads
Experiment often
Mechanical sympathy (align tech to workload)

Key Questions:

Does the system meet latency targets?
Is throughput optimized for the workload?
Are expensive operations (queries, compute) optimized?
Is caching used where beneficial?

Optimization by Service:

Service	Key Optimizations	Measurements
API Gateway	Caching, throttling, request validation	p99 latency < 200ms
Lambda	Provisioned concurrency, memory tuning, connection reuse	Cold start < 100ms, warm < 50ms
RDS/Aurora	Read replicas, indexes, query optimization, connection pooling	Query p95 < 100ms, throughput per core
DynamoDB	Partition key design, GSI, DAX cache, batch operations	p99 < 10ms, no hot partitions
S3	Multipart upload, batch operations, Transfer Acceleration, Intelligent-Tiering	Upload throughput, object retrieval latency
CloudFront	Origin Shield, cache headers, compression, HTTP/2	p99 latency < 100ms globally

Performance by Workload:

Workload Type	Latency Target	Optimization
Synchronous (API)	p99 < 200ms	Caching, query optimization, parallel requests
Real-time	p99 < 100ms	Local cache (ElastiCache), connection reuse, batch operations
Batch	Throughput optimized	Parallel processing, partitioning, appropriate instance size
Streaming	Sub-second processing	Partition key design, Kinesis shards, parallel Lambda invocation

Measurable Indicators:

p99 latency: < target threshold
Throughput: Scales linearly with resources (no bottlenecks)
Resource utilization: 60-80% for optimal cost/performance

Anti-Patterns:

❌ N+1 queries (repeated DB calls in loops)
❌ No caching (repeated expensive operations)
❌ Synchronous processing (blocking calls)
❌ Single-threaded/single-shard processing (can't parallelize)
❌ Unoptimized queries (full table scans)

5.5 Cost Optimization Pillar

Design Principles:

Implement Cloud Financial Management
Measure and attribute expenditure
Stop paying for under-utilized resources
Analyze and optimize over time
Use managed services to reduce operational cost

Key Questions:

Is the current spend justified by business value?
Are there cheaper service alternatives?
Are discounts being used (Reserved Instances, Savings Plans)?
Is capacity right-sized?
Are unused resources being cleaned up?

Cost Optimization by Service:

Service	Cost Driver	Optimization Strategy
Lambda	Requests + GB-seconds	On-demand for variable; consolidate functions if baseline traffic
ECS	EC2 instance hours	Fargate for variable; EC2 + Savings Plans for stable
RDS	Instance-hours + storage	Right-size instances; use Savings Plans (1-yr: 31% off, 3-yr: 43% off); read replicas for read-heavy
DynamoDB	Provisioned RCU/WCU or on-demand	On-demand for unpredictable; provisioned for baseline; consider reserve capacity
S3	Storage + requests + transfer	Intelligent-Tiering for unknown access; Glacier for archive; delete unused data
Redshift	Instance-hours + storage	Reserved instances; use Spectrum for external data; pause during low traffic
Kinesis	Shard-hours or GB ingested	On-demand for variable; provisioned for baseline; batch and compress

Pricing Model Selection:

Workload Pattern	Recommended	Cost Savings
Predictable, high utilization	Reserved Instances (3-year)	65% off on-demand
Predictable, multi-service baseline	Savings Plans (3-year)	60-65% off on-demand; flexible across services
Variable, bursty	On-demand or serverless	No commitment; higher per-unit cost
Batch, interruptible	Spot instances	70% off on-demand

Cost Monitoring & Attribution:

Cost tags per application, team, cost center
Budget alerts in Cost Explorer
Savings Plans recommendations (automated)
Trusted Advisor for cost optimization opportunities

Measurable Indicators:

Cost per transaction: Trending down quarter-over-quarter
Utilization: EC2 CPU 60-80%; under-utilized instances identified and removed
Discount penetration: > 70% of compute cost on Reserved/Savings Plans
Monthly savings from optimization: Documented and tracked

Anti-Patterns:

❌ Large reserved instance commitment for unproven workloads
❌ Oversized instances (paying for unused capacity)
❌ Running dev/test environments 24/7
❌ Keeping data in expensive storage (not using Intelligent-Tiering)
❌ No cost allocation or visibility

5.6 Sustainability Pillar

Design Principles:

Understand your sustainability impact
Establish goals and measure impact
Maximize utilization (reduce waste)
Adopt efficient hardware and architecture
Use managed services (AWS optimizes for efficiency)
Reduce downstream impact (minimize data transfer)

Key Sustainability Decisions:

Decision	High-Impact Option	Impact
Compute selection	Serverless + managed services	No idle infrastructure; AWS amortizes overhead
Regional placement	Use AWS Regions with renewable energy	Check AWS Sustainability Report; PPA-backed regions
Data storage	S3 Intelligent-Tiering → Glacier	Reduces storage footprint; archive old data
Instance types	Graviton, Trainium (AWS-built chips)	Higher energy efficiency than x86
Architecture	Batch processing, off-peak scheduling	Consolidate; avoid running 24/7 if not needed
Data transfer	Minimize inter-region/public data transfer	CloudFront for global distribution; VPC Endpoints for internal

Measurable Indicators:

Kilograms CO2 per transaction: Trending down
Instances with < 20% utilization: Identified and consolidated
Data stored in cold tiers (Glacier): Percentage of total
Energy efficiency score (AWS Carbon Intelligence): Tracking vs. industry baseline

Anti-Patterns:

❌ Running compute 24/7 (especially development/test)
❌ No data lifecycle policies (keeping hot storage forever)
❌ Using x86 instances when Graviton available
❌ Not considering region sustainability impact

Section 6: Cost & Scale Modeling Framework

6.1 Fixed vs. Variable Cost Analysis

Fixed Cost (per month, regardless of usage):

NAT Gateway: $32/month per AZ
ALB: $22/month base
Redshift cluster: $1.26/hour (minimum)
RDS instance: $50-$1000+/month (instance-based)

Variable Cost (per unit of usage):

Lambda: $0.0000002 per request + $0.0000166667 per GB-second
SQS: $0.40 per 1M requests
S3: $0.023 per GB/month + $0.0004 per 1k requests
Data transfer out: $0.09 per GB

6.2 Cost at Different Scales

Scenario: Build an API backend

Low Scale (100 req/sec, < 1 TB data):

Lambda: $4/month (requests) + $100/month (compute) = $104/month ✓ Cheapest
ECS Fargate: $350/month (1 vCPU, 2GB always-on) + $0 RDS = $350
RDS t3.micro: $17/month

Medium Scale (1000 req/sec, 100 GB data):

Lambda + DynamoDB: $40 (requests) + $1000 (compute) + $50 (DB) = $1090
ECS + RDS: $1400 (compute) + $200 (RDS t3.small) = $1600
Winner: Lambda still cheaper (less infrastructure overhead)

High Scale (10,000 req/sec, 1 TB data):

Lambda + DynamoDB: $400 + $10,000 + $500 = $10,900
ECS + RDS Aurora: $4000 (compute) + $1000 (Aurora) + $1000 (storage) = $6000 ✓ Cheaper
Winner: ECS + provisioned services (fixed cost amortized over high traffic)

6.3 Break-Even Analysis

Lambda vs. ECS Example:

Lambda: $0.0000002 per request + $0.0000166667 per GB-second
For 128 MB Lambda, 100ms execution:
- Cost per request: $0.0000002 + (128/1024 × 0.0000166667 × 0.1) = $0.00000176 per request
- 1M requests = $1.76
ECS (1 vCPU, 2 GB): $0.04695 + $0.00519 = $0.05214/hour = $1250/month
At 100ms execution time: 1 vCPU handles ~10 req/sec = 864k req/day = 26M req/month
Cost per request: $1250 / 26M = $0.000048 per request

Breakeven: Lambda cheaper below ~5k req/sec; ECS cheaper above (depending on execution time)

6.4 Data Gravity Analysis

Question: Should data stay in one region or be replicated?

Factors:

Data transfer cost out: $0.09 per GB (public internet)
S3 cross-region replication: Included (paid per GB replicated)
RDS Multi-AZ: Free (internal replication)
RDS cross-region: Paid data transfer ($0.02/GB)

Decision Matrix:

Single region, high data volume (100 TB+) and distributed users: CloudFront for distribution (caching layer)
Multi-region for disaster recovery: Use cross-region replication; amortize cost over disaster scenarios
Multi-region active-active: High cost; only for mission-critical global workloads

6.5 Cost Optimization Levers

Ordered by impact:

Right-sizing (biggest impact): Reduce instance size; use Intelligent-Tiering
- Potential savings: 30-50% if currently over-provisioned
Commitment discounts (Savings Plans, Reserved Instances): 40-65% off
- Potential savings: 40-65% of compute cost
- Requirement: 70%+ predictable baseline
Reserved capacity (DynamoDB, Redshift): 50%+ off
- Potential savings: 50% of baseline database cost
- Requirement: Stable, known baseline
Spot instances (for interruptible workloads): 70% off
- Potential savings: 70% of batch compute cost
- Trade-off: Interruption risk acceptable
Architecture changes (Lambda vs ECS, S3 Intelligent-Tiering): 20-50% off
- Potential savings: Varies by workload; biggest impact for right-sizing to service tier
Data transfer optimization: 20-30% off data costs
- Use CloudFront for global distribution; keep data local; compress

Section 7: Evolution & Change Strategy

7.1 MVP → Scale

MVP Phase (0-6 months, 1-100 users):

Use serverless (Lambda, Fargate, DynamoDB on-demand)
Minimize operational burden
Fast iteration; don't optimize prematurely
Architecture: API Gateway → Lambda → DynamoDB
Cost: ~$50-500/month

Scale Phase (6+ months, 1k-1M users):

Identify bottlenecks; migrate to provisioned services as needed
Add caching (CloudFront, ElastiCache)
Optimize costs: Reserved Instances, Savings Plans
Architecture: API Gateway → ECS/Lambda → RDS Aurora with read replicas → CloudFront
Cost: $1k-10k/month

Mature Phase (2+ years, 1M+ users):

Multi-region active-active for resilience
Advanced caching and CDN
Dedicated infrastructure; right-sized instances
Cost: $10k+/month

7.2 Monolith → Microservices

Phase 1: Decompose (0-3 months)

Identify service boundaries (domain-driven design)
Build event-driven orchestration (EventBridge, Step Functions, SNS/SQS)
Keep monolith and microservices running in parallel

Phase 2: Strangle Fig (3-12 months)

Gradually route traffic to microservices
Retire monolith modules as traffic shifts
Maintain backward compatibility

Phase 3: Mature (12+ months)

All traffic on microservices
Optimize service communication (caching, circuit breakers)
Consider service mesh (Istio) if > 20 services

7.3 Single-Region → Multi-Region

Phase 1: Failover (0-3 months)

Set up cross-region backup/restore
Automate backup and restore testing
RTO: hours; manual failover

Phase 2: Active-Passive (3-6 months)

Set up read replicas in secondary region
Automate failover via Route 53 health checks
RTO: < 5 min; automatic

Phase 3: Active-Active (6-12 months)

Data replicated bidirectionally (Aurora Global, DynamoDB Global Tables)
Load balanced across regions (Route 53 geolocation)
RTO: < 1 min; automatic; full active workload in both regions

7.4 Manual → Fully Automated

Infrastructure as Code (Week 0-2):

CloudFormation, Terraform, or CDK
Version-control infrastructure
Enable reproducible deployments

CI/CD Pipeline (Week 2-4):

GitHub/CodeCommit → CodeBuild → CodeDeploy
Automated tests before deployment
Blue-green or canary deployments

Operations Automation (Week 4-8):

CloudWatch alarms → SNS → Lambda (auto-remediation)
Patch automation (SSM Patch Manager)
Infrastructure health monitoring (Config, GuardDuty)

Observability (Week 8-12):

CloudWatch Logs Insights for querying
X-Ray for distributed tracing
Custom dashboards; alerts on SLO breaches

7.5 Static Systems → AI-Driven

Phase 1: Monitoring & Insights (0-3 months)

Collect metrics, logs, traces
Identify anomalies (CloudWatch Anomaly Detection)
Manual decision-making based on data

Phase 2: Basic Automation (3-6 months)

Auto-scaling rules based on metrics
CloudWatch alarms trigger Lambda for remediation
Chatbots for simple queries (Lex)

Phase 3: Intelligent Decision-Making (6-12 months)

ML models predict optimal resource allocation
Optimization recommendations (Cost Optimization Hub)
Agentic AI for complex decision workflows

Phase 4: Autonomous Operations (12+ months)

Agents autonomously execute remediation
Human-in-the-loop for high-impact decisions
Continuous learning from outcomes

Section 8: Decision Playbooks & Checklists

8.1 Universal Architecture Decision Checklist

Phase 1: Requirements Gathering (Week 1)

Output: Requirement document (1-2 pages)

Phase 2: Service Selection (Week 2)

Compute:

Lambda, ECS, EKS, EC2, or Batch?
Decision factor: Scale, latency, operational burden
Cost analysis: Provisioned vs. on-demand

Storage:

S3 (objects), EBS (block), EFS (file), RDS/DynamoDB (DB)?
Durability and availability targets met?
Encryption, compliance requirements met?

Database:

RDS (relational), DynamoDB (key-value), Redshift (warehouse), OpenSearch (search)?
Access patterns validated against service?
Scaling behavior acceptable?

Integration:

Lambda functions, API Gateway, EventBridge, SQS/SNS, Kinesis, Step Functions?
Coupling/decoupling adequate?
Error handling strategy defined?

Output: Service decision matrix (1 page)

Phase 3: Architecture Design (Week 3)

Draw high-level architecture (compute → storage → database)
Identify data flows (sync vs. async)
Map to architecture patterns (CRUD, event-driven, streaming, etc.)
Define failure scenarios and recovery strategy
Calculate cost at baseline, peak, and 2x peak scale
Identify cost optimization opportunities
Review against Well-Architected pillars

Output: Architecture diagram + Well-Architected scorecard

Phase 4: Implementation & Deployment (Weeks 4-6)

Code infrastructure (CloudFormation, Terraform, CDK)
Implement logging, monitoring, alerting (CloudWatch, X-Ray)
Set up CI/CD pipeline (CodePipeline, CodeBuild, CodeDeploy)
Write runbooks for operational procedures
Test failure scenarios (chaos engineering)
Performance test at 2x expected peak load
Security audit and penetration testing
Compliance validation (HIPAA, PCI, etc. if applicable)

Output: Deployment checklist, runbooks, test results

Phase 5: Go-Live (Week 7)

Production readiness review (deployment, security, operations)
Gradual rollout (5% → 25% → 50% → 100% traffic)
Monitor golden signals (latency, error rate, throughput)
Alert thresholds defined and tested
Incident response team briefed and on-call
Backup/disaster recovery procedures verified

Output: Incident response playbook, monitoring dashboard

Phase 6: Optimization & Learning (Week 8+)

Review cost monthly; identify optimizations
Analyze performance metrics; optimize hot paths
Conduct retrospectives on incidents
Update architecture based on learnings
Refine automation and runbooks

Output: Quarterly optimization report

8.2 Service Selection Flowchart

Compute Decision Tree:


"How long does the job run?"
├─ < 15 min
│  └─ "Variable or bursty load?"
│     ├─ Yes → Lambda
│     └─ No → "Latency critical?"
│        ├─ Yes → ECS (Fargate, warm)
│        └─ No → EC2 (if always-on cheaper)
├─ 15 min - 1 hour
│  └─ Batch (for batch jobs) or ECS (for services)
└─ > 1 hour
   └─ "Distributed processing needed?"
      ├─ Yes → EMR (Spark/Hadoop)
      └─ No → EC2, ECS, or SageMaker

Database Decision Tree:


"What type of queries?"
├─ SQL, complex joins, transactions
│  └─ RDS (MySQL, PostgreSQL) or Aurora (high throughput)
├─ Key-value, documents, real-time
│  └─ DynamoDB (provisioned for baseline, on-demand for variable)
├─ Large-scale analytics, BI
│  └─ Redshift (petabyte-scale OLAP)
├─ Full-text search, time-series logs
│  └─ OpenSearch
└─ Graph, time-series, other
   └─ Neptune (graph), Timestream (time-series), Keyspaces (Cassandra)

8.3 Failure Scenario Modeling

Template: For each critical component, document:

Component: e.g., RDS database, API Gateway, Lambda function
Failure Mode: e.g., instance crash, network partition, application error
Impact: e.g., "500 errors for 5 min, 1000 failed requests"
Detection: e.g., "CloudWatch alarm: 5xx error rate > 1%"
Recovery Time: e.g., "< 1 min (automatic failover)"
Prevention: e.g., "Multi-AZ, read replicas, automated health checks"

Example Scenarios:

Component	Failure	Impact	Detection	Recovery	Prevention
RDS (primary)	Instance crash	DB unavailable	CloudWatch metrics	Auto-failover to read replica (1 min)	Multi-AZ
Lambda function	Code error	500 responses	CloudWatch error metrics, X-Ray	Automatic retry; DLQ for inspection	Unit tests, canary deployment
API Gateway	DDoS attack	Request throttling	CloudWatch request count	Auto-scaling, WAF, Shield	WAF rules, rate limiting
S3 bucket	Accidentally deleted	Data loss	CloudWatch metrics drop	Restore from versioning or backup	Versioning enabled, lifecycle policies

8.4 Security Threat Modeling

Threat Model Template (STRIDE):

Threat Category	Threat	Mitigation	Implementation
S - Spoofing	Attacker impersonates API caller	Strong authentication, API key validation	API Gateway API keys, IAM roles, OAuth
T - Tampering	Attacker modifies data in transit	Encryption, integrity checks	TLS, HMAC, message signing
R - Repudiation	Attacker denies action	Audit logging, immutable records	CloudTrail, DynamoDB event sourcing
I - Information Disclosure	Attacker accesses sensitive data	Encryption, access control	KMS, IAM, data classification
D - Denial of Service	Attacker floods system	Rate limiting, auto-scaling, WAF	API Gateway throttling, WAF rules, Shield
E - Elevation of Privilege	Attacker gains higher access	Least privilege, MFA, role separation	IAM policies, MFA, role assumption logs

Section 9: Production Deployment Patterns

9.1 Blue-Green Deployment

Definition: Run two identical production environments (blue, green); switch traffic to green after validation.

Benefits:

Zero-downtime deployments
Easy rollback (switch back to blue)
Thorough testing in production environment before traffic
Cost: 2x infrastructure during deployment (brief window)

Implementation:


Users → ALB → Blue (v1) [current]
           → Green (v2) [standby, being deployed]

Test Green thoroughly; if successful:
ALB → Green (v2) [becomes current]

9.2 Canary Deployment

Definition: Gradually route traffic to new version; rollback if error rate exceeds threshold.

Benefits:

Low-risk; easy rollback
Real user traffic tests new code
Immediate detection of issues
Cost: Minimal additional infrastructure

Implementation:


Minute 0: 5% traffic → v2
Minute 5: 25% → v2 (if error rate normal)
Minute 10: 50% → v2
Minute 15: 100% → v2

If error rate exceeds threshold at any step: rollback to 0% → v2

9.3 Feature Flags

Definition: Deploy code without enabling features; toggle features on/off without redeployment.

Benefits:

Decouple deployment from release
Kill switches for problematic features
Gradual feature rollout

Implementation:


1if feature_flags.get("new_checkout_flow"):
2    # New code path
3else:
4    # Old code path

Section 10: AWS Service Selection Quick Reference

Use Case	Primary Service	Alternative	Trade-off
Static website	S3 + CloudFront	API Gateway + Lambda	Simpler (S3) vs. dynamic (Lambda)
REST API	API Gateway + Lambda	ECS + ALB	Serverless (Lambda) vs. control (ECS)
Microservices	ECS + ALB + SQS/SNS	EKS	Simplicity (ECS) vs. power (EKS)
Database (SQL)	RDS Aurora	RDS PostgreSQL	Performance/scale (Aurora) vs. cost (basic RDS)
Real-time database	DynamoDB	RDS	Milliseconds (DynamoDB) vs. complex queries (RDS)
Data warehouse	Redshift	Athena	Complex queries (Redshift) vs. ad-hoc (Athena)
Log analysis	OpenSearch	CloudWatch Logs Insights	Powered search (OpenSearch) vs. simple (CloudWatch)
Batch processing	Glue or Batch	EMR	Simplicity (Glue) vs. control (EMR)
Streaming	Kinesis	MSK	AWS-native (Kinesis) vs. portable (MSK)
Event routing	EventBridge	SNS + SQS	Rich routing (EventBridge) vs. simple (SNS/SQS)
Workflow orchestration	Step Functions	Apache Airflow (MWAA)	Simplicity (Step Functions) vs. power (Airflow)
ML model training	SageMaker	EC2 + Jupyter	Managed (SageMaker) vs. DIY (EC2)
LLM applications	Bedrock	SageMaker + custom models	Ease (Bedrock) vs. control (SageMaker)

Conclusion

This universal framework enables architects to:

Decompose any problem into 10 fundamental dimensions
Classify workloads into generic archetypes (request/response, event-driven, streaming, batch, workflows, data platforms, AI/ML, edge, hybrid)
Select AWS services with explicit decision criteria, tradeoffs, cost models, and scaling behavior
Design architectures using proven patterns (CRUD, event-driven, saga, fan-out/fan-in, data lakehouse, CQRS, streaming, multi-region)
Align with Well-Architected pillars (operational excellence, security, reliability, performance, cost, sustainability) with measurable indicators
Model costs and scale across different workload profiles and identify breakeven points
Plan evolution from MVP to scale, monolith to microservices, single-region to multi-region
Execute with confidence using playbooks, checklists, and deployment patterns

This guide is domain-agnostic and applies to financial systems, consumer apps, data platforms, AI/ML systems, real-time systems, batch workloads, legacy migrations, greenfield and brownfield architectures.

Use this as a reference handbook for system design, an enterprise architecture playbook for organizations, a teaching and onboarding reference for cloud teams, and a foundation for automating AWS architectural decisions.

Version: 1.0 (December 2024)
Last Updated: December 15, 2024
Framework Alignment: AWS Well-Architected Framework (June 2024)

AWS Universal Architecture Guide

AWS Universal Architecture Design & Decision Framework

Executive Summary

Section 1: Universal Requirement Decomposition Framework

1.1 Business Intent

1.2 User & System Actors

1.3 Data Characteristics

Volume

Velocity

Variety

Sensitivity (Data Classification)

1.4 Workload Type

Synchronous Workloads

Asynchronous Workloads

Batch Workloads

Streaming Workloads

1.5 Traffic & Scale Patterns

1.6 Availability & Durability Targets

1.7 Security & Compliance Needs

1.8 Cost Sensitivity

1.9 Operational Complexity

1.10 Change Frequency & Extensibility

Section 2: Workload Classification Engine

2.1 Request/Response Systems (Synchronous)

2.2 Event-Driven Systems (Asynchronous, Decoupled)

2.3 Stream Processing (Continuous, Low-Latency Analytics)

2.4 Batch Processing (Scheduled, Offline, High-Volume)

2.5 Long-Running Workflows (State Machines, Sagas)

2.6 Data-Intensive Platforms (Data Lakes, Warehouses, Analytics)

2.7 AI/ML & Agentic Systems

2.8 Edge, IoT, and Real-Time Communication

2.9 Hybrid & Multi-Account Systems

Section 3: AWS Service Decision Matrix (Core Framework)

3.1 Compute Services

AWS Lambda

Amazon ECS (Elastic Container Service)

Amazon EKS (Elastic Kubernetes Service)

Amazon EC2 (Elastic Compute Cloud)

AWS Batch

AWS Elastic Beanstalk

3.2 Storage Services

Amazon S3 (Simple Storage Service)

Amazon EBS (Elastic Block Store)

Amazon EFS (Elastic File System)

Amazon FSx for Windows File Server / Lustre

3.3 Database Services

Amazon RDS (Relational Database Service)

Amazon Aurora (MySQL/PostgreSQL-compatible)

Amazon DynamoDB

Amazon Redshift

Amazon OpenSearch (formerly Elasticsearch)

Amazon DynamoDB + Timestream

3.4 Integration & Messaging Services

Amazon SQS (Simple Queue Service)

Amazon SNS (Simple Notification Service)

Amazon EventBridge

Amazon Kinesis Data Streams

Amazon MSK (Managed Streaming for Apache Kafka)

AWS Step Functions

3.5 Networking Services

Amazon VPC (Virtual Private Cloud)

Application Load Balancer (ALB) / Network Load Balancer (NLB)

Amazon API Gateway

Amazon CloudFront

AWS Transit Gateway

AWS Direct Connect

3.6 Security Services

AWS IAM (Identity & Access Management)

AWS KMS (Key Management Service)

AWS Secrets Manager

Amazon GuardDuty

AWS WAF (Web Application Firewall)

3.7 Observability Services

Amazon CloudWatch

AWS X-Ray

Amazon Managed Prometheus / Grafana

Section 4: Architecture Pattern Library (Domain-Independent)

4.1 CRUD Backend

4.2 Event-Driven Orchestration

4.3 Saga Pattern (Distributed Transactions)