AWS Quick Reference Guide
Rapid decision-making guide with quick reference checklists, decision trees, and cost estimation formulas for AWS architecture.
AWS Universal Architecture Framework - Quick Reference Guide
For rapid decision-making and architecture reviews
Part A: The 10-Dimension Rapid Assessment Checklist
Use this checklist when you have 30 minutes to understand a problem and make initial architectural decisions.
1. Business Intent (5 min)
- What's the core value? (revenue driver, cost saving, risk mitigation)
- Go-to-market timeline? (MVP weeks, scale months, mature years)
- User base size? (10, 1k, 1M+)
- Regulatory constraints? (HIPAA, PCI-DSS, GDPR, SOX, none)
- Risk tolerance? (experimental, acceptable, mission-critical)
Output: 1-sentence business thesis
2. User & System Actors (3 min)
- Users: How many? Which regions? Which devices/clients?
- Concurrent users during peak? (10, 100, 1k, 10k+)
- API integrations? (B2B partners, internal systems, mobile SDKs)
- Automation integrators? (batch jobs, webhooks, event-driven)
Output: Actor matrix (type, count, geography, concurrency)
3. Data Characteristics (5 min)
- Volume: GB? TB? PB?
- Velocity: Batch (daily/hourly), streaming (continuous), real-time (milliseconds)?
- Variety: SQL, JSON, images, time-series, unstructured?
- Sensitivity: Public, internal, confidential, PII/regulated?
- Retention: Transactional (months), historical (years), archive (indefinite)?
Output: Data classification (volume, velocity, variety, sensitivity)
4. Workload Type (3 min)
- Synchronous? (API calls, user waits)
- Asynchronous? (queues, events, background jobs)
- Batch? (scheduled, high volume)
- Streaming? (continuous, low-latency per event)
- Long-running? (workflows, multi-step processes)
Output: Primary + secondary workload archetypes
5. Traffic & Scale (3 min)
- Baseline requests/sec? (1, 10, 100, 1k, 10k+)
- Peak requests/sec? (2x, 10x, 100x baseline)
- Data transfer? (MB/s, GB/s)
- Burst frequency? (never, daily, hourly, continuous)
- Growth rate? (stable, linear, exponential)
Output: Traffic profile (baseline, peak, growth trajectory)
6. Availability & Durability (3 min)
- RTO (Recovery Time Objective): Hours, minutes, seconds?
- RPO (Recovery Point Objective): Days, hours, zero data loss?
- Criticality: Development, non-critical, critical, mission-critical?
- Failover: Manual, automatic, multi-region?
- Data protection: Snapshots, replicas, event sourcing?
Output: Availability matrix (component, RTO, RPO, strategy)
7. Security & Compliance (3 min)
- Data classification? (public, internal, confidential, PII)
- Compliance frameworks? (HIPAA, PCI-DSS, GDPR, SOX, FedRAMP, none)
- Encryption required? (at-rest, in-transit, both, none)
- Access control model? (role-based, attribute-based, IP-restricted)
- Audit/logging: None, basic, comprehensive?
Output: Security posture summary
8. Cost Sensitivity (2 min)
- Budget: Unconstrained, $100/month, $1k/month, $10k+/month?
- Cost model preference: CapEx, OpEx, or variable?
- Commitment level: PAYG, 1-year, 3-year?
- Cost optimization priority: Low, medium, high?
Output: Cost constraints and optimization targets
9. Operational Complexity (2 min)
- Team maturity: Beginners, intermediate, advanced?
- DevOps/SRE capability: None, basic, strong?
- Operational burden tolerance: Very low, low, moderate?
- Tools and processes: Existing, need to build, greenfield?
Output: Operational profile and staffing implications
10. Extensibility & Evolution (2 min)
- Change frequency: Rarely, quarterly, monthly, weekly?
- Integration needs: None, few, many?
- Architecture path: Monolith OK, need microservices?
- Technology lock-in tolerance: High, medium, low?
Output: Evolution roadmap outline
Part B: Service Selection Quick Decision Trees
Compute Selection
"How long does work run?" ├─ < 15 min │ └─ "Variable load?" │ ├─ Yes → Lambda (cost optimized) │ └─ No → "Latency critical?" │ ├─ Yes → ECS Fargate │ └─ No → EC2 (if baseline cost justified) ├─ 15 min - 1 hour │ └─ "Batch or service?" │ ├─ Batch → Batch or EMR │ └─ Service → ECS or EKS └─ > 1 hour └─ EMR (distributed) or EC2 (standalone)
Cost Rule of Thumb:
- Lambda: Best for < 100 req/sec, unpredictable
- ECS: Best for 100-1000 req/sec, moderate peaks
- EC2 Reserved: Best for > 1000 req/sec, predictable
Database Selection
"Access pattern?" ├─ Complex queries (joins, aggregations) │ └─ "Scale (TB)?" │ ├─ < 100 GB → RDS │ └─ > 100 GB → Aurora or Redshift ├─ Key-value lookups, real-time │ └─ DynamoDB ├─ Full-text search, logs │ └─ OpenSearch ├─ Time-series metrics │ └─ Timestream └─ Graph relationships └─ Neptune
Cost Rule of Thumb:
- DynamoDB on-demand: Best for variable traffic (0 baseline cost)
- RDS with Savings Plans: Best for predictable SQL workloads
- Redshift: Best for data warehouse (> 1TB, complex BI queries)
Storage Selection
"Data type?" ├─ Objects (files, media, archives) │ └─ S3 ├─ Block storage (EC2 volumes) │ └─ EBS ├─ Shared filesystem │ └─ "Windows required?" │ ├─ Yes → FSx for Windows │ └─ No → EFS └─ Data lake (structured + unstructured) └─ S3 with Lake Formation
Integration Selection
"Communication pattern?" ├─ One-to-many (fanout) │ └─ "Message history needed?" │ ├─ Yes → EventBridge with archive │ └─ No → SNS (simple) ├─ One-to-one (queue) │ └─ "Ordering critical?" │ ├─ Yes → SQS FIFO │ └─ No → SQS Standard ├─ Streaming (continuous, ordered by shard) │ └─ Kinesis ├─ Complex routing (rules, multiple conditions) │ └─ EventBridge └─ Multi-step workflow └─ Step Functions
Part C: Well-Architected Review Checklist (60 min)
Operational Excellence (12 min)
- Are teams organized by business outcome (not technology)?
- Can any team member quickly explain the system?
- Is every component observable? (metrics, logs, traces)
- Are deployments automated and safe? (blue-green, canary)
- Do runbooks exist for common issues?
- MTTR < 15 min for P1 incidents?
- Incident postmortems conducted and improvements tracked?
Scoring: Count checks. 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement
Security (12 min)
- All principals (users, roles, services) authenticated?
- IAM policies follow least-privilege (not wildcards, not *)?
- All data encrypted at-rest (KMS) and in-transit (TLS)?
- All API calls logged to CloudTrail?
- Data classification defined and enforced?
- Secrets (DB passwords, API keys) in Secrets Manager, not code?
- PII/sensitive data protected per regulation?
Scoring: 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement
Reliability (12 min)
- Can any single service/component fail without total outage?
- Are failure recovery steps tested quarterly?
- RTO/RPO targets defined and met?
- Backups automated and verified?
- Multi-AZ deployment for critical components?
- Circuit breakers and retry logic in place?
- Graceful degradation if capacity exceeded?
Scoring: 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement
Performance Efficiency (12 min)
- API latency p99 < target (e.g., 200ms)?
- Caching used for expensive operations? (CloudFront, ElastiCache)
- Database queries optimized? (indexes, query plans reviewed)
- Async patterns used where synchronous not required?
- Compute resources right-sized? (not overprovisioned)
- Network optimized? (local processing, VPC Endpoints for AWS services)
- No N+1 queries or polling loops?
Scoring: 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement
Cost Optimization (12 min)
- Current monthly cost documented and justified?
- Cost per transaction calculated? (trending down?)
- 70%+ of compute cost on Reserved/Savings Plans?
- Right-sizing recommendations from Trusted Advisor implemented?
- Unused resources (unattached volumes, stopped instances, idle databases) cleaned up?
- Data in cheaper tiers? (Glacier for archive, Intelligent-Tiering for unknown access)
- Cost allocation tags applied? (business unit, application, environment)
Scoring: 6-7 = Excellent; 4-5 = Good; < 4 = Needs improvement
Sustainability (8 min)
- Managed services used where possible? (no idle infrastructure)
- Auto-scaling configured? (no over-provisioned capacity)
- Data stored in efficient tiers? (lifecycle policies active)
- No unnecessary data copies or inter-region transfers?
- Instance types modern? (Graviton, Trainium considered)
Scoring: 4-5 = Excellent; 2-3 = Good; < 2 = Needs improvement
Overall Score: (OE + Security + Reliability + Performance + Cost + Sustainability) / 6
- 90-100: Well-architected; ready for production
- 75-89: Good; address gaps in lower-scoring pillars
- 60-74: Needs improvement; prioritize security, reliability, cost
- < 60: Critical issues; address before production
Part D: Cost Estimation Quick Formulas
Lambda
Cost = (Requests × 0.0000002) + (GB-seconds × 0.0000166667)
Example: 1M requests/month, 1GB, 100ms execution
= (1M × 0.0000002) + (1M × 0.1 sec × (1GB/1024) × 0.0000166667)
= 0.2 + 1.63
= ~$1.83/month
DynamoDB
On-Demand Cost = (Writes × 1.25) + (Reads × 0.25) per million
Example: 100k writes, 1M reads/month
= (100k × 1.25) + (1M × 0.25) / 1M
= 0.125 + 0.25
= $0.375/month (CHEAP!)
Provisioned Cost = (WCU × 1.25 × 730 hours) + (RCU × 0.25 × 730 hours) + Storage
Example: 10 WCU, 100 RCU, 100 GB
= (10 × 1.25 × 730) + (100 × 0.25 × 730) + (100 × 0.25)
= 9,125 + 18,250 + 25
= ~$27,400/month (EXPENSIVE!)
Breakeven: On-demand cheaper if < ~1M writes/day or < 3M reads/day
RDS
Cost = (Instance-hours × instance_rate) + (Storage GB × $0.10/month) + Backups
Example: db.t3.small (0.66/hour), 100GB storage
= (730 hours × $0.066) + (100 × $0.10)
= 48.18 + 10
= ~$58/month
Redshift
Cost = (Nodes × node_cost/hour × 730 hours) + (Storage × $0.10/month)
Example: 2 dc2.large nodes
= (2 × $1.26 × 730) + storage
= $1,841/month + storage
S3
Cost = (Storage GB × tier_rate) + (Requests × request_rate) + Transfer
Example: 1000 GB Standard, 1M requests/month, 100 GB outbound
= (1000 × $0.023) + (1M × 0.0004) + (100 × $0.09)
= 23 + 0.4 + 9
= ~$32.40/month
Part E: Decision Documentation Template
Use this template for every major architectural decision.
1# Architecture Decision Record (ADR) 2 3## Title: [Service Choice or Pattern Adoption] 4 5### Status: Proposed | Accepted | Deprecated 6 7### Context 8- Business intent: [What problem are we solving?] 9- Scale requirements: [Traffic, data volume, growth] 10- Constraints: [Budget, compliance, team skill, timeline] 11 12### Decision 13We will use **[AWS Service(s)]** to [implement pattern/solve problem]. 14 15### Rationale 161. Functional requirement alignment: [How does it solve the problem?] 172. Non-functional trade-offs: [Cost, latency, operational burden, scalability] 183. Alternative considered: [Why not Lambda/ECS/RDS/etc?] 194. Cost analysis: [\$X/month at baseline, \$Y/month at 2x peak] 205. Well-Architected alignment: 21 - **Operational Excellence**: [Observability, automation, procedures] → ✓ / ✗ 22 - **Security**: [Encryption, access control, auditing] → ✓ / ✗ 23 - **Reliability**: [Failover, backups, RTO/RPO] → ✓ / ✗ 24 - **Performance**: [Latency, throughput, caching] → ✓ / ✗ 25 - **Cost**: [Cost-effective at scale?] → ✓ / ✗ 26 - **Sustainability**: [Efficient resource use?] → ✓ / ✗ 27 28### Consequences 29- Positive: [Faster time-to-market, lower cost, better scalability] 30- Negative: [Higher operational burden, vendor lock-in, cold starts] 31- Risks: [Single-region dependency, cache invalidation complexity] 32- Mitigation: [Add multi-AZ, circuit breaker, monitoring] 33 34### Evolution Path 351. MVP (Months 0-6): Use this service as-is 362. Scale (Months 6-12): Migrate to [alternative] if [specific metrics exceed thresholds] 373. Mature (Year 2+): Consider [next-level optimization] 38 39### Approval 40- Reviewed by: [Architect name] 41- Approved by: [Tech lead, manager] 42- Date: [YYYY-MM-DD]
Part F: Production Readiness Checklist
Before going live, verify:
Code & Deployment (Week before launch)
- Code reviewed and merged to main
- Unit tests pass (> 80% coverage)
- Integration tests pass
- CI/CD pipeline builds and deploys successfully
- Secrets (DB passwords, API keys) in AWS Secrets Manager, not code
- Blue-green or canary deployment tested
- Rollback procedure documented and tested
Infrastructure & Security (Week before launch)
- All infrastructure as code (CloudFormation/Terraform/CDK)
- Security group rules follow least-privilege
- Data encrypted at-rest (KMS) and in-transit (TLS)
- IAM roles follow least-privilege (no wildcards)
- Secrets Manager configured for credential rotation
- VPC design reviewed (public/private subnets correct)
- CloudTrail enabled for audit logging
- WAF configured for web services
Monitoring & Alerting (Week before launch)
- CloudWatch dashboards created
- Key metrics identified (latency p99, error rate, throughput)
- Alarms configured for:
- Error rate > 1%
- Latency p99 > acceptable threshold
- CPU > 80%
- Database connection pool > 80%
- Auto-scaling triggered
- On-call team assigned
- Escalation procedure documented
Data & Backups (Week before launch)
- Backup strategy defined and tested (RTO/RPO verified)
- Multi-AZ replication enabled
- Point-in-time recovery tested
- Cross-region disaster recovery plan documented
- Data retention policies configured
Documentation & Runbooks (Week before launch)
- Architecture diagram created and shared
- API documentation complete (OpenAPI/Swagger)
- Runbooks for common operational procedures:
- Incident response
- Scaling up/down
- Failover procedures
- Database migration
- Deployment procedures documented
- Rollback procedures documented
Load Testing (3 days before launch)
- Load test at 2x expected peak traffic
- Latency p99 < target
- No errors under load
- Auto-scaling triggers correctly
- Circuit breakers prevent cascading failures
Security Audit (3 days before launch)
- Penetration testing completed (if applicable)
- OWASP Top 10 vulnerabilities checked
- IAM permissions reviewed
- No hardcoded secrets in code
- Compliance requirements (HIPAA, PCI, GDPR) met
Final Sign-off (Day of launch)
- Business owner approves launch
- Tech lead approves launch
- Security team approves launch
- On-call team briefed
- Incident response plan activated
- Gradual rollout plan confirmed (5% → 25% → 50% → 100%)
Part G: Service Cost Comparison Matrix (Monthly Estimates)
| Use Case | Lambda | ECS Fargate | EC2 (On-Demand) | EC2 (Reserved) | RDS | DynamoDB |
|---|---|---|---|---|---|---|
| API Backend (100 req/sec) | $300 | $600 | $800 | $300 | $60 | $50 |
| API Backend (1000 req/sec) | $2,500 | $1,200 | $2,400 | $900 | $200 | $400 |
| API Backend (10000 req/sec) | $25,000 | $6,000 | $12,000 | $4,000 | $800 | $2,000 |
| Batch Job (daily) | $20 | $200 | $500 | $100 | N/A | N/A |
| Streaming (1M events/day) | $100 | $400 | Varies | Varies | N/A | $50 |
| Data Warehouse (100GB) | N/A | N/A | N/A | N/A | $100 | N/A |
| Data Lake (1TB) | N/A | N/A | N/A | N/A | N/A | N/A |
Note: Estimates are rough; actual costs depend on specific implementation details. Use AWS Pricing Calculator for accuracy.
Key Takeaways
- Business intent first: Let business drivers (speed to market, cost, reliability) guide architecture
- Measure before optimizing: Don't over-engineer; prove the requirements justify the complexity
- Use managed services by default: Reduce operational burden; AWS optimizes for efficiency
- Right-size continuously: Monitor utilization; resize monthly to match actual demand
- Design for failure: Test recovery scenarios; prepare runbooks before incidents
- Align with Well-Architected: Security and operational excellence are non-negotiable
- Document decisions: ADRs help teams understand rationale and avoid repeated debates
- Plan for evolution: MVP → Scale → Mature path provides clarity for multi-year roadmap
- Monitor costs relentlessly: Cost anomalies reveal inefficiencies
- Learn from incidents: Postmortems and blameless cultures drive continuous improvement
Last Updated: December 15, 2024
Framework Version: AWS Well-Architected (June 2024)
On This Page
The table of contents is automatically generated from the document headings.