malcolm.cloud

AWS Well Architected Framework Summary

By Malcolm van Staden on 3rd August 2021

This is a bullet-form summary of the AWS Well Architected Framework, the full framework is available to read here

AWS Well Architected Framework Summary mind map

Operational Excellence

Design Principles
- Perform operations as code
- Make frequent, small, reversable changes
- Refine operation procedures frequently
- Anticipate failure
- Learn from all operational failures
Best Practices
- Organisation
  - How do you determine what your priorities are?
    - Evaluate external customer needs
    - Evaluate internal customer needs
    - Evaluate governance requirements
    - Evaluate compliance requirements
    - Evaluate threat landscape
    - Evaluate trade-offs
    - Manage benefits and risks
  - How do you structure your organization to support your business outcomes?
    - Resources have identified owners
    - Processes and procedures have identified owners
    - Operations activities have identified owners responsible for their performance
    - Team members know what they are responsible for
    - Mechanisms exist to identify responsibility and ownership
    - Mechanisms exist to request additions, changes, and exception
    - Responsibilities between teams are predefined or negotiated
  - How does your organizational culture support your business outcomes?
    - Executive Sponsorship
    - Team members are empowered to take action when outcomes are at risk
    - Escalation is encouraged
    - Communications are timely, clear, and actionable
    - Experimentation is encouraged
    - Team members are enabled and encouraged to maintain and grow their skill sets
    - Resource teams appropriately
    - Diverse opinions are encouraged and sought within and across teams
  - Prepare
    - How do you design your workload so that you can understand its state?
      - Implement application telemetry
      - Implement and configure workload telemetry
      - Implement user activity telemetry
      - Implement dependency telemetry
      - Implement transaction traceability
    - How do you reduce defects, ease remediation, and improve flow into production?
      - Use version control
      - Test and validate changes
      - Use configuration management systems
      - Use build and deployment management systems
      - Perform patch management
      - Share design standards
      - Implement practices to improve code quality
      - Use multiple environments
      - Make frequent, small, reversible changes
    - How do you mitigate deployment risks?
      - Plan for unsuccessful changes
      - Test and validate changes
      - Use deployment management systems
      - Test using limited deployments
      - Deploy using parallel environments
      - Deploy frequent, small, reversible changes
      - Fully automate integration and deployment
      - Automate testing and rollback
    - How do you know that you are ready to support a workload?
      - Ensure personnel capability
      - Ensure consistent review of operational readiness
      - Use runbooks to perform procedures
      - Use playbooks to investigate issues
      - Make informed decisions to deploy systems and changes
    - Operate
      - How do you understand the health of your workload?
        
        Identify key performance indicators
        
        Define workload metrics
        
        Collect and analyse workload metrics
        
        Establish workload metrics baselines
        
        Learn expected patterns of activity for workload
        
        Alert when workload outcomes are at risk
        
        Alert when workload anomalies are detected
        
        Validate the achievement of outcomes and the effectiveness of KPIs and metrics
      - How do you understand the health of your operations?
        
        Identify key performance indicators
        
        Define operations metrics
        
        Collect and analyse operations metrics
        
        Establish operations metrics baselines
        
        Learn the expected patterns of activity for operations
        
        Alert when operations outcomes are at risk
        
        Alert when operations anomalies are detected
        
        Validate the achievement of outcomes and the effectiveness of KPIs and metrics
      - How do you manage workload and operations events?
        
        Use processes for event, incident, and problem management
        
        Have a process per alert
        
        Prioritize operational events based on business impact
        
        Define escalation paths
        
        Enable push notifications
        
        Communicate status through dashboards
        
        Automate responses to events
      - Evolve
        
        How do you evolve operations?
        
        Have a process for continuous improvement
        
        Perform post-incident analysis
        
        Implement feedback loops
        
        Perform Knowledge Management
        
        Define drivers for improvement
        
        Validate insights
        
        Perform operations metrics reviews
        
        Document and share lessons learned
        
        Allocate time to make improvements

Security

Design Principles
- Implement a strong identity foundation
- Enable traceability
- Apply security at all layers
- Automate security best practices
- Protect data in transit and at rest
- Keep people away from data
- Prepare for security events
Best Practices
- Security
  - How do you securely operate your workload?
    - Separate workloads using accounts
    - Secure AWS account
    - Identify and validate control objectives
    - Keep up to date with security threats
    - Keep up to date with security recommendations
    - Automate testing and validation of security controls in pipelines
    - Identify and prioritize risks using a threat model
    - Evaluate and implement new security services and features regularly
  - Identity and Access Management (IAM)
    - How do you manage authentication for people and machines?
      - Use strong sign-in mechanisms
      - Use temporary credentials
      - Store and use secrets securely
      - Rely on a centralized identity provider
      - Audit and rotate credentials periodically
      - Leverage user groups and attributes
    - How do you manage permissions for people and machines?
      - Define access requirements
      - Grant least privilege access
      - Establish emergency access process
      - Reduce permissions continuously
      - Define permission guardrails for your organization
      - Manage access based on life cycle
      - Analyse public and cross account access
      - Share resources securely
    - Detection
      - How do you detect and investigate security events?
        
        Configure service and application logging
        
        Analyse logs, findings, and metrics centrally
        
        Automate response to events
        
        Implement actionable security events
      - Infrastructure Protection
        
        How do you protect your network resources?
        
        Create network layers
        
        Control traffic at all layers
        
        Automate network protection
        
        Implement inspection and protection
        
        How do you protect your compute resources?
        
        Perform vulnerability management
        
        Reduce attack surface
        
        Implement managed services
        
        Automate compute protection
        
        Enable people to perform actions at a distance
        
        Validate software integrity
        
        Data Protection
        
        How do you classify your data?
        
        Identify the data within your workload
        
        Define data protection controls
        
        Automate identification and classification
        
        Define data lifecycle management
        
        How do you protect your data at rest?
        
        Implement secure key management
        
        Enforce encryption at rest
        
        Automate data at rest protection
        
        Enforce access control
        
        Use mechanisms to keep people away from data
        
        How do you protect your data in transit?
        
        Implement secure key and certificate management
        
        Enforce encryption in transit
        
        Automate detection of unintended data access
        
        Authenticate network communications
        
        Incident Response
        
        How do you anticipate, respond to, and recover from incidents?
        
        Identify key personnel and external resources
        
        Develop incident management plans
        
        Prepare forensic capabilities
        
        Automate containment capability
        
        Pre-provision access
        
        Pre-deploy tools
        
        Run game days

Reliability

Design Principles
- Automatically recover from failure
- Test recovery procedures
- Scale horizontally to increase aggregate workload availability
- Stop guessing capacity
- Manage change in automation
Best Practices
- Foundations
  - How do you manage service quotas and constraints?
    - Aware of service quotas and constraints
    - Manage service quotas across accounts and regions
    - Accommodate fixed service quotas and constraints through architecture
    - Monitor and manage quotas
    - Automate quota management
    - Ensure that a sufficient gap exists between the current quotas and the maximum usage to accommodate failover
  - How do you plan your network topology?
    - Use highly available network connectivity for your workload public endpoints
    - Provision redundant connectivity between private networks in the cloud and on-premises environments
    - Ensure IP subnet allocation accounts for expansion and availability
    - Prefer hub-and-spoke topologies over many-to-many mesh
    - Enforce non-overlapping private IP address ranges in all private address spaces where they are connected
  - Workload Architecture
    - How do you design your workload service architecture?
      - Choose how to segment your workload
      - Build services focused on specific business domains and functionality
      - Provide service contracts per API
    - How do you design interactions in a distributed system to prevent failures?
      - Identify which kind of distributed system is required
      - Implement loosely coupled dependencies
      - Make all responses idempotent
      - Do constant work
    - How do you design interactions in a distributed system to mitigate or withstand failures?
      - Implement graceful degradation to transform applicable hard dependencies into soft dependencies
      - Throttle requests
      - Control and limit retry calls
      - Fail fast and limit queues
      - Set client timeouts
      - Make services stateless where possible
      - Implement emergency levers
    - Change Management
      - How do you monitor workload resources?
        
        Monitor all components for the workload (Generation)
        
        Define and calculate metrics (Aggregation)
        
        Send notifications (Real-time processing and alarming)
        
        Automate responses (Real-time processing and alarming)
        
        Storage and Analytics
        
        Conduct reviews regularly
        
        Monitor end-to-end tracing of requests through your system
      - How do you design your workload to adapt to changes in demand?
        
        Use automation when obtaining or scaling resources
        
        Obtain resources upon detection of impairment to a workload
        
        Obtain resources upon detection that more resources are needed for a workload
        
        Load test your workload
      - How do you implement change?
        
        Use runbooks for standard activities such as deployment
        
        Integrate functional testing as part of your deployment
        
        Integrate resiliency testing as part of your deployment
        
        Deploy using immutable infrastructure
        
        Deploy changes with automation
      - Failure Management
        
        How do you back up data?
        
        Identify and back up all data that needs to be backed up, or reproduce the data from sources
        
        Secure and encrypt backups
        
        Perform data backup automatically
        
        Perform periodic recovery of the data to verify backup integrity and processes
        
        How do you use fault isolation to protect your workload?
        
        Deploy the workload to multiple locations
        
        Automate recovery for components constrained to a single location
        
        Use bulkhead architectures to limit scope of impact
        
        How do you design your workload to withstand component failures?
        
        Monitor all components of the workload to detect failures
        
        Fail over to healthy resources
        
        Automate healing on all layers
        
        Use static stability to prevent bimodal behaviour
        
        Send notifications when events impact availability
        
        How do you test reliability?
        
        Use playbooks to investigate failures
        
        Perform post-incident analysis
        
        Test functional requirements
        
        Test scaling and performance requirements
        
        Test resiliency using chaos engineering
        
        Conduct game days regularly
        
        How do you plan for disaster recovery (DR)?
        
        Define recovery objectives for downtime and data loss
        
        Use defined recovery strategies to meet the recovery objectives
        
        Test disaster recovery implementation to validate the implementation
        
        Manage configuration drift at the DR site or region
        
        Automate recovery

Performance Efficiency

Design Principles
- Democratize advanced technologies
- Go global in minutes
- Use serverless architectures
- Experiment more often
- Consider mechanical sympathy
Best Practices
- Selection (Compute, Storage, Database, Network)
  - How do you select the best performing architecture?
    - Understand the available services and resources
    - Define a process for architectural choices
    - Factor cost requirements into decisions
    - Use policies or reference architectures
    - Use guidance from your cloud provider or an appropriate partner
    - Benchmark existing workloads
    - Load test your workload
  - How do you select your compute solution?
    - Evaluate the available compute options
    - Understand the available compute configuration options
    - Collect compute-related metrics
    - Determine the required configuration by right-sizing
    - Use the available elasticity of resources
    - Re-evaluate compute needs based on metrics
  - How do you select your storage solution?
    - Understand storage characteristics and requirements
    - Evaluate available configuration options
    - Make decisions based on access patterns and metrics
  - How do you select your database solution?
    - Understand data characteristics
    - Evaluate the available options
    - Collect and record database performance metrics
    - Choose data storage based on access patterns
    - Optimize data storage based on access patterns and metrics
  - How do you configure your networking solution?
    - Understand how networking impacts performance
    - Evaluate available networking features
    - Choose appropriately sized dedicated connectivity or VPN for hybrid workloads
    - Leverage load-balancing and encryption offloading
    - Choose network protocols to improve performance
    - Choose your workload’s location based on network requirements
    - Optimize network configuration based on metrics
  - Review
    - How do you evolve your workload to take advantage of new releases?
      - Stay up-to-date on new resources and services
      - Define a process to improve workload performance
      - Evolve workload performance over time
    - Monitoring
      - How do you monitor your resources to ensure they are performing?
        
        Record performance-related metrics
        
        Analyse metrics when events or incidents occur
        
        Establish Key Performance Indicators (KPIs) to measure workload performance
        
        Use monitoring to generate alarm-based notifications
        
        Review metrics at regular intervals
        
        Monitor and alarm proactively
      - Trade-offs
        
        How do you use trade-offs to improve performance?
        
        Understand the areas where performance is most critical
        
        Learn about design patterns and services
        
        Identify how trade-offs impact customers and efficiency
        
        Measure the impact of performance improvements
        
        Use various performance-related strategies

Cost Optimisation

Design Principles
- Implement cloud financial management
- Adopt a cost consumption model
- Measure overall efficiency
- Stop spending money on undifferentiated heavy lifting
- Analyse and attribute expenditure
Best Practices
- Practice cloud financial management
  - How do you implement cloud financial management?
    - Establish a cost optimization function
    - Establish a partnership between finance and technology
    - Establish cloud budgets and forecasts
    - Implement cost awareness in your organizational processes
    - Report and notify on cost optimization
    - Monitor cost proactively
    - Keep up to date with new service releases
  - Expenditure and usage awareness
    - How do you govern usage?
      - Develop policies based on your organization requirements
      - Implement goals and targets
      - Implement an account structure
      - Implement groups and roles
      - Implement cost controls
      - Track project lifecycle
    - How do you monitor usage and cost?
      - Configure detailed information sources
      - Identify cost attribution categories
      - Establish organization metrics
      - Configure billing and cost management tools
      - Add organization information to cost and usage
      - Allocate costs based on workload metrics
    - How do you decommission resources?
      - Track resources over their life time
      - Implement a decommissioning process
      - Decommission resources
      - Decommission resources automatically
    - Cost-effective resources
      - How do you evaluate cost when you select services?
        
        Identify organization requirements for cost
        
        Analyse all components of this workload
        
        Perform a thorough analysis of each component
        
        Select software with cost effective licensing
        
        Select components of this workload to optimize cost in line with organization priorities
        
        Perform cost analysis for different usage over time
      - How do you meet cost targets when you select resource type, size and number?
        
        Perform cost modelling
        
        Select resource type and size based on data
        
        Select resource type and size automatically based on metrics
      - How do you use pricing models to reduce cost?
        
        Perform pricing model analysis
        
        Implement regions based on cost
        
        Select third party agreements with cost efficient terms
        
        Implement pricing models for all components of this workload
        
        Perform pricing model analysis at the master account level
      - How do you plan for data transfer charges?
        
        Perform data transfer modelling
        
        Select components to optimize data transfer cost
        
        Implement services to reduce data transfer costs
      - Manage demand and supply resources
        
        How do you manage demand, and supply resources?
        
        Perform an analysis on the workload demand
        
        Implement a buffer or throttle to manage demand
        
        Supply resources dynamically
        
        Optimise over time
        
        How do you evaluate new services?
        
        Develop a workload review process
        
        Review and analyse this workload regularly