malcolm.cloud
AWS Well Architected Framework Summary
By Malcolm van Staden on 3rd August 2021
This is a bullet-form summary of the AWS Well Architected Framework, the full framework is available to read here
Operational Excellence
- Design Principles
- Perform operations as code
- Make frequent, small, reversable changes
- Refine operation procedures frequently
- Anticipate failure
- Learn from all operational failures
- Best Practices
- Organisation
- How do you determine what your priorities are?
- Evaluate external customer needs
- Evaluate internal customer needs
- Evaluate governance requirements
- Evaluate compliance requirements
- Evaluate threat landscape
- Evaluate trade-offs
- Manage benefits and risks
- How do you structure your organization to support your business outcomes?
- Resources have identified owners
- Processes and procedures have identified owners
- Operations activities have identified owners responsible for their performance
- Team members know what they are responsible for
- Mechanisms exist to identify responsibility and ownership
- Mechanisms exist to request additions, changes, and exception
- Responsibilities between teams are predefined or negotiated
- How does your organizational culture support your business outcomes?
- Executive Sponsorship
- Team members are empowered to take action when outcomes are at risk
- Escalation is encouraged
- Communications are timely, clear, and actionable
- Experimentation is encouraged
- Team members are enabled and encouraged to maintain and grow their skill sets
- Resource teams appropriately
- Diverse opinions are encouraged and sought within and across teams
- Prepare
- How do you design your workload so that you can understand its state?
- Implement application telemetry
- Implement and configure workload telemetry
- Implement user activity telemetry
- Implement dependency telemetry
- Implement transaction traceability
- How do you reduce defects, ease remediation, and improve flow into production?
- Use version control
- Test and validate changes
- Use configuration management systems
- Use build and deployment management systems
- Perform patch management
- Share design standards
- Implement practices to improve code quality
- Use multiple environments
- Make frequent, small, reversible changes
- How do you mitigate deployment risks?
- Plan for unsuccessful changes
- Test and validate changes
- Use deployment management systems
- Test using limited deployments
- Deploy using parallel environments
- Deploy frequent, small, reversible changes
- Fully automate integration and deployment
- Automate testing and rollback
- How do you know that you are ready to support a workload?
- Ensure personnel capability
- Ensure consistent review of operational readiness
- Use runbooks to perform procedures
- Use playbooks to investigate issues
- Make informed decisions to deploy systems and changes
- Operate
- How do you understand the health of your workload?
- Identify key performance indicators
- Define workload metrics
- Collect and analyse workload metrics
- Establish workload metrics baselines
- Learn expected patterns of activity for workload
- Alert when workload outcomes are at risk
- Alert when workload anomalies are detected
- Validate the achievement of outcomes and the effectiveness of KPIs and metrics
- How do you understand the health of your operations?
- Identify key performance indicators
- Define operations metrics
- Collect and analyse operations metrics
- Establish operations metrics baselines
- Learn the expected patterns of activity for operations
- Alert when operations outcomes are at risk
- Alert when operations anomalies are detected
- Validate the achievement of outcomes and the effectiveness of KPIs and metrics
- How do you manage workload and operations events?
- Use processes for event, incident, and problem management
- Have a process per alert
- Prioritize operational events based on business impact
- Define escalation paths
- Enable push notifications
- Communicate status through dashboards
- Automate responses to events
- Evolve
- How do you evolve operations?
- Have a process for continuous improvement
- Perform post-incident analysis
- Implement feedback loops
- Perform Knowledge Management
- Define drivers for improvement
- Validate insights
- Perform operations metrics reviews
- Document and share lessons learned
- Allocate time to make improvements
- How do you evolve operations?
- How do you understand the health of your workload?
- How do you design your workload so that you can understand its state?
- How do you determine what your priorities are?
- Organisation
Security
- Design Principles
- Implement a strong identity foundation
- Enable traceability
- Apply security at all layers
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data
- Prepare for security events
- Best Practices
- Security
- How do you securely operate your workload?
- Separate workloads using accounts
- Secure AWS account
- Identify and validate control objectives
- Keep up to date with security threats
- Keep up to date with security recommendations
- Automate testing and validation of security controls in pipelines
- Identify and prioritize risks using a threat model
- Evaluate and implement new security services and features regularly
- Identity and Access Management (IAM)
- How do you manage authentication for people and machines?
- Use strong sign-in mechanisms
- Use temporary credentials
- Store and use secrets securely
- Rely on a centralized identity provider
- Audit and rotate credentials periodically
- Leverage user groups and attributes
- How do you manage permissions for people and machines?
- Define access requirements
- Grant least privilege access
- Establish emergency access process
- Reduce permissions continuously
- Define permission guardrails for your organization
- Manage access based on life cycle
- Analyse public and cross account access
- Share resources securely
- Detection
- How do you detect and investigate security events?
- Configure service and application logging
- Analyse logs, findings, and metrics centrally
- Automate response to events
- Implement actionable security events
- Infrastructure Protection
- How do you protect your network resources?
- Create network layers
- Control traffic at all layers
- Automate network protection
- Implement inspection and protection
- How do you protect your compute resources?
- Perform vulnerability management
- Reduce attack surface
- Implement managed services
- Automate compute protection
- Enable people to perform actions at a distance
- Validate software integrity
- Data Protection
- How do you classify your data?
- Identify the data within your workload
- Define data protection controls
- Automate identification and classification
- Define data lifecycle management
- How do you protect your data at rest?
- Implement secure key management
- Enforce encryption at rest
- Automate data at rest protection
- Enforce access control
- Use mechanisms to keep people away from data
- How do you protect your data in transit?
- Implement secure key and certificate management
- Enforce encryption in transit
- Automate detection of unintended data access
- Authenticate network communications
- Incident Response
- How do you anticipate, respond to, and recover from incidents?
- Identify key personnel and external resources
- Develop incident management plans
- Prepare forensic capabilities
- Automate containment capability
- Pre-provision access
- Pre-deploy tools
- Run game days
- How do you anticipate, respond to, and recover from incidents?
- How do you classify your data?
- How do you protect your network resources?
- How do you detect and investigate security events?
- How do you manage authentication for people and machines?
- How do you securely operate your workload?
- Security
Reliability
- Design Principles
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally to increase aggregate workload availability
- Stop guessing capacity
- Manage change in automation
- Best Practices
- Foundations
- How do you manage service quotas and constraints?
- Aware of service quotas and constraints
- Manage service quotas across accounts and regions
- Accommodate fixed service quotas and constraints through architecture
- Monitor and manage quotas
- Automate quota management
- Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover
- How do you plan your network topology?
- Use highly available network connectivity for your workload public endpoints
- Provision redundant connectivity between private networks in the cloud and on-premises environments
- Ensure IP subnet allocation accounts for expansion and availability
- Prefer hub-and-spoke topologies over many-to-many mesh
- Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
- Workload Architecture
- How do you design your workload service architecture?
- Choose how to segment your workload
- Build services focused on specific business domains and functionality
- Provide service contracts per API
- How do you design interactions in a distributed system to prevent failures?
- Identify which kind of distributed system is required
- Implement loosely coupled dependencies
- Make all responses idempotent
- Do constant work
- How do you design interactions in a distributed system to mitigate or withstand failures?
- Implement graceful degradation to transform applicable hard dependencies into soft dependencies
- Throttle requests
- Control and limit retry calls
- Fail fast and limit queues
- Set client timeouts
- Make services stateless where possible
- Implement emergency levers
- Change Management
- How do you monitor workload resources?
- Monitor all components for the workload (Generation)
- Define and calculate metrics (Aggregation)
- Send notifications (Real-time processing and alarming)
- Automate responses (Real-time processing and alarming)
- Storage and Analytics
- Conduct reviews regularly
- Monitor end-to-end tracing of requests through your system
- How do you design your workload to adapt to changes in demand?
- Use automation when obtaining or scaling resources
- Obtain resources upon detection of impairment to a workload
- Obtain resources upon detection that more resources are needed for a workload
- Load test your workload
- How do you implement change?
- Use runbooks for standard activities such as deployment
- Integrate functional testing as part of your deployment
- Integrate resiliency testing as part of your deployment
- Deploy using immutable infrastructure
- Deploy changes with automation
- Failure Management
- How do you back up data?
- Identify and back up all data that needs to be backed up, or reproduce the data from sources
- Secure and encrypt backups
- Perform data backup automatically
- Perform periodic recovery of the data to verify backup integrity and processes
- How do you use fault isolation to protect your workload?
- Deploy the workload to multiple locations
- Automate recovery for components constrained to a single location
- Use bulkhead architectures to limit scope of impact
- How do you design your workload to withstand component failures?
- Monitor all components of the workload to detect failures
- Fail over to healthy resources
- Automate healing on all layers
- Use static stability to prevent bimodal behaviour
- Send notifications when events impact availability
- How do you test reliability?
- Use playbooks to investigate failures
- Perform post-incident analysis
- Test functional requirements
- Test scaling and performance requirements
- Test resiliency using chaos engineering
- Conduct game days regularly
- How do you plan for disaster recovery (DR)?
- Define recovery objectives for downtime and data loss
- Use defined recovery strategies to meet the recovery objectives
- Test disaster recovery implementation to validate the implementation
- Manage configuration drift at the DR site or region
- Automate recovery
- How do you back up data?
- How do you monitor workload resources?
- How do you design your workload service architecture?
- How do you manage service quotas and constraints?
- Foundations
Performance Efficiency
- Design Principles
- Democratize advanced technologies
- Go global in minutes
- Use serverless architectures
- Experiment more often
- Consider mechanical sympathy
- Best Practices
- Selection (Compute, Storage, Database, Network)
- How do you select the best performing architecture?
- Understand the available services and resources
- Define a process for architectural choices
- Factor cost requirements into decisions
- Use policies or reference architectures
- Use guidance from your cloud provider or an appropriate partner
- Benchmark existing workloads
- Load test your workload
- How do you select your compute solution?
- Evaluate the available compute options
- Understand the available compute configuration options
- Collect compute-related metrics
- Determine the required configuration by right-sizing
- Use the available elasticity of resources
- Re-evaluate compute needs based on metrics
- How do you select your storage solution?
- Understand storage characteristics and requirements
- Evaluate available configuration options
- Make decisions based on access patterns and metrics
- How do you select your database solution?
- Understand data characteristics
- Evaluate the available options
- Collect and record database performance metrics
- Choose data storage based on access patterns
- Optimize data storage based on access patterns and metrics
- How do you configure your networking solution?
- Understand how networking impacts performance
- Evaluate available networking features
- Choose appropriately sized dedicated connectivity or VPN for hybrid workloads
- Leverage load-balancing and encryption offloading
- Choose network protocols to improve performance
- Choose your workload’s location based on network requirements
- Optimize network configuration based on metrics
- Review
- How do you evolve your workload to take advantage of new releases?
- Stay up-to-date on new resources and services
- Define a process to improve workload performance
- Evolve workload performance over time
- Monitoring
- How do you monitor your resources to ensure they are performing?
- Record performance-related metrics
- Analyse metrics when events or incidents occur
- Establish Key Performance Indicators (KPIs) to measure workload performance
- Use monitoring to generate alarm-based notifications
- Review metrics at regular intervals
- Monitor and alarm proactively
- Trade-offs
- How do you use trade-offs to improve performance?
- Understand the areas where performance is most critical
- Learn about design patterns and services
- Identify how trade-offs impact customers and efficiency
- Measure the impact of performance improvements
- Use various performance-related strategies
- How do you use trade-offs to improve performance?
- How do you monitor your resources to ensure they are performing?
- How do you evolve your workload to take advantage of new releases?
- How do you select the best performing architecture?
- Selection (Compute, Storage, Database, Network)
Cost Optimisation
- Design Principles
- Implement cloud financial management
- Adopt a cost consumption model
- Measure overall efficiency
- Stop spending money on undifferentiated heavy lifting
- Analyse and attribute expenditure
- Best Practices
- Practice cloud financial management
- How do you implement cloud financial management?
- Establish a cost optimization function
- Establish a partnership between finance and technology
- Establish cloud budgets and forecasts
- Implement cost awareness in your organizational processes
- Report and notify on cost optimization
- Monitor cost proactively
- Keep up to date with new service releases
- Expenditure and usage awareness
- How do you govern usage?
- Develop policies based on your organization requirements
- Implement goals and targets
- Implement an account structure
- Implement groups and roles
- Implement cost controls
- Track project lifecycle
- How do you monitor usage and cost?
- Configure detailed information sources
- Identify cost attribution categories
- Establish organization metrics
- Configure billing and cost management tools
- Add organization information to cost and usage
- Allocate costs based on workload metrics
- How do you decommission resources?
- Track resources over their life time
- Implement a decommissioning process
- Decommission resources
- Decommission resources automatically
- Cost-effective resources
- How do you evaluate cost when you select services?
- Identify organization requirements for cost
- Analyse all components of this workload
- Perform a thorough analysis of each component
- Select software with cost effective licensing
- Select components of this workload to optimize cost in line with organization priorities
- Perform cost analysis for different usage over time
- How do you meet cost targets when you select resource type, size and number?
- Perform cost modelling
- Select resource type and size based on data
- Select resource type and size automatically based on metrics
- How do you use pricing models to reduce cost?
- Perform pricing model analysis
- Implement regions based on cost
- Select third party agreements with cost efficient terms
- Implement pricing models for all components of this workload
- Perform pricing model analysis at the master account level
- How do you plan for data transfer charges?
- Perform data transfer modelling
- Select components to optimize data transfer cost
- Implement services to reduce data transfer costs
- Manage demand and supply resources
- How do you manage demand, and supply resources?
- Perform an analysis on the workload demand
- Implement a buffer or throttle to manage demand
- Supply resources dynamically
- Optimise over time
- How do you evaluate new services?
- Develop a workload review process
- Review and analyse this workload regularly
- How do you evaluate new services?
- How do you manage demand, and supply resources?
- How do you evaluate cost when you select services?
- How do you govern usage?
- How do you implement cloud financial management?
- Practice cloud financial management
Home | About | My Apps | © Malcolm van Staden , all views are my own