AWS Universal Architecture Framework - Quick Reference Guide

For rapid decision-making and architecture reviews

Part A: The 10-Dimension Rapid Assessment Checklist

Use this checklist when you have 30 minutes to understand a problem and make initial architectural decisions.

1. Business Intent (5 min)

What's the core value? (revenue driver, cost saving, risk mitigation)
Go-to-market timeline? (MVP weeks, scale months, mature years)
User base size? (10, 1k, 1M+)
Regulatory constraints? (HIPAA, PCI-DSS, GDPR, SOX, none)
Risk tolerance? (experimental, acceptable, mission-critical)

Output: 1-sentence business thesis

2. User & System Actors (3 min)

Users: How many? Which regions? Which devices/clients?
Concurrent users during peak? (10, 100, 1k, 10k+)
API integrations? (B2B partners, internal systems, mobile SDKs)
Automation integrators? (batch jobs, webhooks, event-driven)

Output: Actor matrix (type, count, geography, concurrency)

3. Data Characteristics (5 min)

Volume: GB? TB? PB?
Velocity: Batch (daily/hourly), streaming (continuous), real-time (milliseconds)?
Variety: SQL, JSON, images, time-series, unstructured?
Sensitivity: Public, internal, confidential, PII/regulated?
Retention: Transactional (months), historical (years), archive (indefinite)?

Output: Data classification (volume, velocity, variety, sensitivity)

4. Workload Type (3 min)

Synchronous? (API calls, user waits)
Asynchronous? (queues, events, background jobs)
Batch? (scheduled, high volume)
Streaming? (continuous, low-latency per event)
Long-running? (workflows, multi-step processes)

Output: Primary + secondary workload archetypes

5. Traffic & Scale (3 min)

Baseline requests/sec? (1, 10, 100, 1k, 10k+)
Peak requests/sec? (2x, 10x, 100x baseline)
Data transfer? (MB/s, GB/s)
Burst frequency? (never, daily, hourly, continuous)
Growth rate? (stable, linear, exponential)

Output: Traffic profile (baseline, peak, growth trajectory)

6. Availability & Durability (3 min)

RTO (Recovery Time Objective): Hours, minutes, seconds?
RPO (Recovery Point Objective): Days, hours, zero data loss?
Criticality: Development, non-critical, critical, mission-critical?
Failover: Manual, automatic, multi-region?
Data protection: Snapshots, replicas, event sourcing?

Output: Availability matrix (component, RTO, RPO, strategy)

7. Security & Compliance (3 min)

Data classification? (public, internal, confidential, PII)
Compliance frameworks? (HIPAA, PCI-DSS, GDPR, SOX, FedRAMP, none)
Encryption required? (at-rest, in-transit, both, none)
Access control model? (role-based, attribute-based, IP-restricted)
Audit/logging: None, basic, comprehensive?

Output: Security posture summary

8. Cost Sensitivity (2 min)

Budget: Unconstrained, $100/month, $1k/month, $10k+/month?
Cost model preference: CapEx, OpEx, or variable?
Commitment level: PAYG, 1-year, 3-year?
Cost optimization priority: Low, medium, high?

Output: Cost constraints and optimization targets

9. Operational Complexity (2 min)

Team maturity: Beginners, intermediate, advanced?
DevOps/SRE capability: None, basic, strong?
Operational burden tolerance: Very low, low, moderate?
Tools and processes: Existing, need to build, greenfield?

Output: Operational profile and staffing implications

10. Extensibility & Evolution (2 min)

Change frequency: Rarely, quarterly, monthly, weekly?
Integration needs: None, few, many?
Architecture path: Monolith OK, need microservices?
Technology lock-in tolerance: High, medium, low?

Output: Evolution roadmap outline

Part B: Service Selection Quick Decision Trees

Compute Selection


"How long does work run?"
├─ < 15 min
│  └─ "Variable load?"
│     ├─ Yes → Lambda (cost optimized)
│     └─ No → "Latency critical?"
│        ├─ Yes → ECS Fargate
│        └─ No → EC2 (if baseline cost justified)
├─ 15 min - 1 hour
│  └─ "Batch or service?"
│     ├─ Batch → Batch or EMR
│     └─ Service → ECS or EKS
└─ > 1 hour
   └─ EMR (distributed) or EC2 (standalone)

Cost Rule of Thumb:

Lambda: Best for < 100 req/sec, unpredictable
ECS: Best for 100-1000 req/sec, moderate peaks
EC2 Reserved: Best for > 1000 req/sec, predictable

Database Selection


"Access pattern?"
├─ Complex queries (joins, aggregations)
│  └─ "Scale (TB)?"
│     ├─ < 100 GB → RDS
│     └─ > 100 GB → Aurora or Redshift
├─ Key-value lookups, real-time
│  └─ DynamoDB
├─ Full-text search, logs
│  └─ OpenSearch
├─ Time-series metrics
│  └─ Timestream
└─ Graph relationships
   └─ Neptune

Cost Rule of Thumb:

DynamoDB on-demand: Best for variable traffic (0 baseline cost)
RDS with Savings Plans: Best for predictable SQL workloads
Redshift: Best for data warehouse (> 1TB, complex BI queries)

Storage Selection


"Data type?"
├─ Objects (files, media, archives)
│  └─ S3
├─ Block storage (EC2 volumes)
│  └─ EBS
├─ Shared filesystem
│  └─ "Windows required?"
│     ├─ Yes → FSx for Windows
│     └─ No → EFS
└─ Data lake (structured + unstructured)
   └─ S3 with Lake Formation

Integration Selection


"Communication pattern?"
├─ One-to-many (fanout)
│  └─ "Message history needed?"
│     ├─ Yes → EventBridge with archive
│     └─ No → SNS (simple)
├─ One-to-one (queue)
│  └─ "Ordering critical?"
│     ├─ Yes → SQS FIFO
│     └─ No → SQS Standard
├─ Streaming (continuous, ordered by shard)
│  └─ Kinesis
├─ Complex routing (rules, multiple conditions)
│  └─ EventBridge
└─ Multi-step workflow
   └─ Step Functions

Part C: Well-Architected Review Checklist (60 min)

Operational Excellence (12 min)

Are teams organized by business outcome (not technology)?
Can any team member quickly explain the system?
Is every component observable? (metrics, logs, traces)
Are deployments automated and safe? (blue-green, canary)
Do runbooks exist for common issues?
MTTR < 15 min for P1 incidents?
Incident postmortems conducted and improvements tracked?

Scoring: Count checks. 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement

Security (12 min)

All principals (users, roles, services) authenticated?
IAM policies follow least-privilege (not wildcards, not *)?
All data encrypted at-rest (KMS) and in-transit (TLS)?
All API calls logged to CloudTrail?
Data classification defined and enforced?
Secrets (DB passwords, API keys) in Secrets Manager, not code?
PII/sensitive data protected per regulation?

Scoring: 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement

Reliability (12 min)

Can any single service/component fail without total outage?
Are failure recovery steps tested quarterly?
RTO/RPO targets defined and met?
Backups automated and verified?
Multi-AZ deployment for critical components?
Circuit breakers and retry logic in place?
Graceful degradation if capacity exceeded?

Scoring: 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement

Performance Efficiency (12 min)

API latency p99 < target (e.g., 200ms)?
Caching used for expensive operations? (CloudFront, ElastiCache)
Database queries optimized? (indexes, query plans reviewed)
Async patterns used where synchronous not required?
Compute resources right-sized? (not overprovisioned)
Network optimized? (local processing, VPC Endpoints for AWS services)
No N+1 queries or polling loops?

Scoring: 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement

Cost Optimization (12 min)

Current monthly cost documented and justified?
Cost per transaction calculated? (trending down?)
70%+ of compute cost on Reserved/Savings Plans?
Right-sizing recommendations from Trusted Advisor implemented?
Unused resources (unattached volumes, stopped instances, idle databases) cleaned up?
Data in cheaper tiers? (Glacier for archive, Intelligent-Tiering for unknown access)
Cost allocation tags applied? (business unit, application, environment)

Scoring: 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement

Sustainability (8 min)

Managed services used where possible? (no idle infrastructure)
Auto-scaling configured? (no over-provisioned capacity)
Data stored in efficient tiers? (lifecycle policies active)
No unnecessary data copies or inter-region transfers?
Instance types modern? (Graviton, Trainium considered)

Scoring: 4-5 = Excellent; 2-3 = Good; < 2 = Needs improvement

Overall Score: (OE + Security + Reliability + Performance + Cost + Sustainability) / 6

90-100: Well-architected; ready for production
75-89: Good; address gaps in lower-scoring pillars
60-74: Needs improvement; prioritize security, reliability, cost
< 60: Critical issues; address before production

Part D: Cost Estimation Quick Formulas

Lambda

Cost = (Requests × 0.0000002) + (GB-seconds × 0.0000166667)

Example: 1M requests/month, 1GB, 100ms execution
= (1M × 0.0000002) + (1M × 0.1 sec × (1GB/1024) × 0.0000166667)
= 0.2 + 1.63
= ~$1.83/month

DynamoDB

On-Demand Cost = (Writes × 1.25) + (Reads × 0.25) per million

Example: 100k writes, 1M reads/month
= (100k × 1.25) + (1M × 0.25) / 1M
= 0.125 + 0.25
= $0.375/month (CHEAP!)

Provisioned Cost = (WCU × 1.25 × 730 hours) + (RCU × 0.25 × 730 hours) + Storage

Example: 10 WCU, 100 RCU, 100 GB
= (10 × 1.25 × 730) + (100 × 0.25 × 730) + (100 × 0.25)
= 9,125 + 18,250 + 25
= ~$27,400/month (EXPENSIVE!)

Breakeven: On-demand cheaper if < ~1M writes/day or < 3M reads/day

RDS

Cost = (Instance-hours × instance_rate) + (Storage GB × $0.10/month) + Backups

Example: db.t3.small (0.66/hour), 100GB storage
= (730 hours × $0.066) + (100 × $0.10)
= 48.18 + 10
= ~$58/month

Redshift

Cost = (Nodes × node_cost/hour × 730 hours) + (Storage × $0.10/month)

Example: 2 dc2.large nodes
= (2 × $1.26 × 730) + storage
= $1,841/month + storage

S3

Cost = (Storage GB × tier_rate) + (Requests × request_rate) + Transfer

Example: 1000 GB Standard, 1M requests/month, 100 GB outbound
= (1000 × $0.023) + (1M × 0.0004) + (100 × $0.09)
= 23 + 0.4 + 9
= ~$32.40/month

Part E: Decision Documentation Template

Use this template for every major architectural decision.


1# Architecture Decision Record (ADR)
2
3## Title: [Service Choice or Pattern Adoption]
4
5### Status: Proposed | Accepted | Deprecated
6
7### Context
8- Business intent: [What problem are we solving?]
9- Scale requirements: [Traffic, data volume, growth]
10- Constraints: [Budget, compliance, team skill, timeline]
11
12### Decision
13We will use **[AWS Service(s)]** to [implement pattern/solve problem].
14
15### Rationale
161. Functional requirement alignment: [How does it solve the problem?]
172. Non-functional trade-offs: [Cost, latency, operational burden, scalability]
183. Alternative considered: [Why not Lambda/ECS/RDS/etc?]
194. Cost analysis: [\$X/month at baseline, \$Y/month at 2x peak]
205. Well-Architected alignment:
21   - **Operational Excellence**: [Observability, automation, procedures] → ✓ / ✗
22   - **Security**: [Encryption, access control, auditing] → ✓ / ✗
23   - **Reliability**: [Failover, backups, RTO/RPO] → ✓ / ✗
24   - **Performance**: [Latency, throughput, caching] → ✓ / ✗
25   - **Cost**: [Cost-effective at scale?] → ✓ / ✗
26   - **Sustainability**: [Efficient resource use?] → ✓ / ✗
27
28### Consequences
29- Positive: [Faster time-to-market, lower cost, better scalability]
30- Negative: [Higher operational burden, vendor lock-in, cold starts]
31- Risks: [Single-region dependency, cache invalidation complexity]
32- Mitigation: [Add multi-AZ, circuit breaker, monitoring]
33
34### Evolution Path
351. MVP (Months 0-6): Use this service as-is
362. Scale (Months 6-12): Migrate to [alternative] if [specific metrics exceed thresholds]
373. Mature (Year 2+): Consider [next-level optimization]
38
39### Approval
40- Reviewed by: [Architect name]
41- Approved by: [Tech lead, manager]
42- Date: [YYYY-MM-DD]

Part F: Production Readiness Checklist

Before going live, verify:

Code & Deployment (Week before launch)

Code reviewed and merged to main
Unit tests pass (> 80% coverage)
Integration tests pass
CI/CD pipeline builds and deploys successfully
Secrets (DB passwords, API keys) in AWS Secrets Manager, not code
Blue-green or canary deployment tested
Rollback procedure documented and tested

Infrastructure & Security (Week before launch)

All infrastructure as code (CloudFormation/Terraform/CDK)
Security group rules follow least-privilege
Data encrypted at-rest (KMS) and in-transit (TLS)
IAM roles follow least-privilege (no wildcards)
Secrets Manager configured for credential rotation
VPC design reviewed (public/private subnets correct)
CloudTrail enabled for audit logging
WAF configured for web services

Monitoring & Alerting (Week before launch)

Data & Backups (Week before launch)

Backup strategy defined and tested (RTO/RPO verified)
Multi-AZ replication enabled
Point-in-time recovery tested
Cross-region disaster recovery plan documented
Data retention policies configured

Documentation & Runbooks (Week before launch)

Architecture diagram created and shared
API documentation complete (OpenAPI/Swagger)
Runbooks for common operational procedures:
- Incident response
- Scaling up/down
- Failover procedures
- Database migration
Deployment procedures documented
Rollback procedures documented

Load Testing (3 days before launch)

Load test at 2x expected peak traffic
Latency p99 < target
No errors under load
Auto-scaling triggers correctly
Circuit breakers prevent cascading failures

Security Audit (3 days before launch)

Penetration testing completed (if applicable)
OWASP Top 10 vulnerabilities checked
IAM permissions reviewed
No hardcoded secrets in code
Compliance requirements (HIPAA, PCI, GDPR) met

Final Sign-off (Day of launch)

Business owner approves launch
Tech lead approves launch
Security team approves launch
On-call team briefed
Incident response plan activated
Gradual rollout plan confirmed (5% → 25% → 50% → 100%)

Part G: Service Cost Comparison Matrix (Monthly Estimates)

Use Case	Lambda	ECS Fargate	EC2 (On-Demand)	EC2 (Reserved)	RDS	DynamoDB
API Backend (100 req/sec)	$300	$600	$800	$300	$60	$50
API Backend (1000 req/sec)	$2,500	$1,200	$2,400	$900	$200	$400
API Backend (10000 req/sec)	$25,000	$6,000	$12,000	$4,000	$800	$2,000
Batch Job (daily)	$20	$200	$500	$100	N/A	N/A
Streaming (1M events/day)	$100	$400	Varies	Varies	N/A	$50
Data Warehouse (100GB)	N/A	N/A	N/A	N/A	$100	N/A
Data Lake (1TB)	N/A	N/A	N/A	N/A	N/A	N/A

Note: Estimates are rough; actual costs depend on specific implementation details. Use AWS Pricing Calculator for accuracy.

Key Takeaways

Business intent first: Let business drivers (speed to market, cost, reliability) guide architecture
Measure before optimizing: Don't over-engineer; prove the requirements justify the complexity
Use managed services by default: Reduce operational burden; AWS optimizes for efficiency
Right-size continuously: Monitor utilization; resize monthly to match actual demand
Design for failure: Test recovery scenarios; prepare runbooks before incidents
Align with Well-Architected: Security and operational excellence are non-negotiable
Document decisions: ADRs help teams understand rationale and avoid repeated debates
Plan for evolution: MVP → Scale → Mature path provides clarity for multi-year roadmap
Monitor costs relentlessly: Cost anomalies reveal inefficiencies
Learn from incidents: Postmortems and blameless cultures drive continuous improvement

Last Updated: December 15, 2024
Framework Version: AWS Well-Architected (June 2024)

AWS Quick Reference Guide

AWS Universal Architecture Framework - Quick Reference Guide

Part A: The 10-Dimension Rapid Assessment Checklist

1. Business Intent (5 min)

2. User & System Actors (3 min)

3. Data Characteristics (5 min)

4. Workload Type (3 min)

5. Traffic & Scale (3 min)

6. Availability & Durability (3 min)

7. Security & Compliance (3 min)

8. Cost Sensitivity (2 min)

9. Operational Complexity (2 min)

10. Extensibility & Evolution (2 min)

Part B: Service Selection Quick Decision Trees

Compute Selection

Database Selection

Storage Selection

Integration Selection

Part C: Well-Architected Review Checklist (60 min)

Operational Excellence (12 min)

Security (12 min)

Reliability (12 min)

Performance Efficiency (12 min)

Cost Optimization (12 min)

Sustainability (8 min)

Part D: Cost Estimation Quick Formulas

Lambda

DynamoDB

RDS

Redshift

S3

Part E: Decision Documentation Template

Part F: Production Readiness Checklist

Code & Deployment (Week before launch)

Infrastructure & Security (Week before launch)

Monitoring & Alerting (Week before launch)

Data & Backups (Week before launch)

Documentation & Runbooks (Week before launch)

Load Testing (3 days before launch)

Security Audit (3 days before launch)

Final Sign-off (Day of launch)

Part G: Service Cost Comparison Matrix (Monthly Estimates)

Key Takeaways

On This Page

Quick Links

Related