Cloud Architecture Learning Roadmap
Master cloud architecture from foundational services through advanced multi-cloud strategies, security, and enterprise-scale infrastructure design
Duration: 32 weeks | 3 steps | 35 topics
Career Opportunities
- Cloud Architect
- Cloud Engineer
- Solutions Architect
- Cloud Infrastructure Engineer
- DevOps Architect
- Cloud Security Architect
Step 1: Cloud Fundamentals
Build a solid understanding of cloud computing models, core services across major providers, networking, identity management, and cost control
Time: 8 weeks | Level: beginner
- Cloud Computing Models (IaaS/PaaS/SaaS) (required) — Understand the three fundamental cloud service models and the shared responsibility model that defines security boundaries.
- IaaS provides virtualized infrastructure (compute, storage, networking) with maximum control and responsibility
- PaaS abstracts infrastructure management so developers focus on application code and data
- SaaS delivers complete applications over the internet with the provider managing everything
- The shared responsibility model defines which security tasks belong to the provider versus the customer at each level
- AWS Core Services (EC2, S3, VPC) (required) — Master the foundational AWS services for compute, storage, and networking that form the building blocks of cloud architectures.
- EC2 provides resizable compute capacity with instance types optimized for different workloads (compute, memory, GPU)
- S3 offers virtually unlimited object storage with multiple storage classes for cost optimization based on access patterns
- VPC enables isolated virtual networks with subnets, route tables, and security groups for network segmentation
- Availability Zones within regions provide physical redundancy for high-availability deployments
- Azure Core Services (required) — Learn Microsoft Azure's core compute, storage, and networking offerings and how they map to AWS equivalents for multi-cloud fluency.
- Azure Virtual Machines, App Service, and Azure Functions cover the compute spectrum from IaaS to serverless
- Azure Blob Storage, Disk Storage, and Azure Files address object, block, and file storage needs respectively
- Azure Virtual Networks, NSGs, and Azure Firewall provide layered network security and isolation
- Resource Groups and Management Groups organize Azure resources for access control and billing management
- GCP Core Services (required) — Explore Google Cloud Platform's core infrastructure services and its strengths in data analytics, AI, and Kubernetes-native computing.
- Compute Engine, Cloud Run, and Cloud Functions span the IaaS-to-serverless compute spectrum on GCP
- Cloud Storage provides unified object storage with automatic storage class transitions based on access frequency
- GKE (Google Kubernetes Engine) is an industry-leading managed Kubernetes service with Autopilot mode
- BigQuery offers serverless, petabyte-scale data analytics without infrastructure management
- IAM & Access Management (required) — Implement secure identity and access management using least-privilege policies, roles, federation, and multi-factor authentication.
- Least-privilege principle grants only the minimum permissions required for each user, role, or service
- IAM roles and service accounts enable applications to authenticate to cloud services without embedded credentials
- Federation with identity providers (Okta, Azure AD) enables single sign-on and centralized user management
- MFA and conditional access policies add additional security layers for privileged operations and sensitive resources
- Networking in the Cloud (required) — Design cloud network architectures with VPCs, subnets, routing, DNS, and connectivity options for secure, performant applications.
- Public and private subnets separate internet-facing resources from internal backend services
- Security groups (stateful) and NACLs (stateless) provide layered network access control at different granularities
- Route 53, Azure DNS, and Cloud DNS manage domain resolution with health checks and traffic routing policies
- VPN and Direct Connect/ExpressRoute provide secure, private connectivity between on-premises and cloud environments
- Cloud Storage Solutions (recommended) — Choose the right storage service for each workload: object storage, block storage, file storage, and archival with lifecycle management.
- Object storage (S3, Blob, GCS) is ideal for unstructured data like images, backups, and static assets
- Block storage (EBS, Managed Disks) provides high-performance volumes for databases and applications requiring low latency
- Lifecycle policies automatically transition data between storage tiers and expire old objects to optimize costs
- Cross-region replication ensures data durability and availability for disaster recovery scenarios
- Billing & Cost Management (recommended) — Monitor, forecast, and optimize cloud spending using budgets, alerts, reserved capacity, and cost allocation best practices.
- Budget alerts and spending anomaly detection prevent unexpected charges from runaway resources
- Reserved Instances and Savings Plans offer 30-72% discounts for predictable, steady-state workloads
- Cost allocation tags enable per-project and per-team billing breakdowns for accountability
- Right-sizing recommendations identify over-provisioned resources that can be downsized without performance impact
- Cloud CLI Tools (recommended) — Manage cloud resources efficiently from the command line using AWS CLI, Azure CLI, and gcloud for scripting and automation.
- CLI tools enable scriptable, repeatable cloud operations that can be version-controlled and automated
- Named profiles and configuration files manage multiple accounts and regions without credential confusion
- Output formatting options (JSON, table, YAML) allow easy parsing and integration with other tools
- Shell scripts combining CLI commands automate complex multi-step provisioning and teardown workflows
- Cloud Certifications Overview (optional) — Navigate the cloud certification landscape to plan a certification path that validates skills and accelerates career growth.
- Foundational certifications (Cloud Practitioner, AZ-900) validate broad cloud knowledge for any role
- Associate-level certifications (Solutions Architect, Azure Administrator) prove hands-on implementation skills
- Professional/Expert certifications demonstrate advanced architecture and specialization expertise
- On-Premise vs Cloud (optional) — Evaluate the trade-offs between on-premises infrastructure and cloud adoption including cost, scalability, compliance, and operational considerations.
- Cloud eliminates upfront capital expenditure in favor of operational pay-as-you-go pricing
- On-premises may be more cost-effective for highly predictable, sustained workloads at large scale
- Data sovereignty and compliance requirements may mandate specific geographic or infrastructure controls
Step 2: Cloud Architecture Design
Design resilient, scalable cloud architectures using well-architected frameworks, microservices, serverless patterns, and infrastructure as code
Time: 10 weeks | Level: intermediate
- Well-Architected Framework (required) — Apply the AWS Well-Architected Framework's six pillars to evaluate and improve cloud architecture decisions systematically.
- The six pillars are: Operational Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, and Sustainability
- Well-Architected Reviews identify risks and improvement opportunities before they become production incidents
- Trade-offs between pillars (e.g., higher reliability may increase cost) require explicit architectural decisions
- AWS, Azure, and GCP all provide their own well-architected frameworks with shared core principles
- High Availability & Fault Tolerance (required) — Design systems that remain operational during component failures using redundancy, health checks, and automated recovery across availability zones and regions.
- Multi-AZ deployments protect against single data center failures with automatic failover
- Health checks and auto-healing automatically replace unhealthy instances without manual intervention
- Circuit breaker patterns prevent cascading failures by isolating failing dependencies
- RTO (Recovery Time Objective) and RPO (Recovery Point Objective) define acceptable downtime and data loss targets
- Scalability Patterns (required) — Implement horizontal and vertical scaling strategies using auto-scaling groups, read replicas, caching layers, and partitioning.
- Horizontal scaling adds more instances to distribute load, while vertical scaling increases individual instance capacity
- Auto-scaling groups dynamically adjust capacity based on CloudWatch metrics, schedules, or predictive scaling
- Caching layers (ElastiCache, Memcached) reduce database load and improve response times for read-heavy workloads
- Database sharding and partitioning distribute data across multiple nodes for write scalability
- Microservices on Cloud (required) — Design and deploy microservices architectures using containers, service discovery, and inter-service communication patterns on cloud platforms.
- Each microservice owns its data store and communicates through well-defined APIs or event streams
- Service discovery (Cloud Map, Consul) enables services to locate each other dynamically in elastic environments
- API versioning and backward compatibility strategies prevent breaking changes from disrupting dependent services
- Distributed tracing (X-Ray, Jaeger) provides end-to-end visibility across service boundaries for debugging
- Serverless Architecture (Lambda, Functions) (required) — Build event-driven applications using serverless compute that automatically scales to zero and charges only for actual execution time.
- Serverless functions execute in response to events (HTTP, queue messages, file uploads) without server management
- Cold starts add latency on first invocation; provisioned concurrency or warm-up strategies mitigate this
- Step Functions and Durable Functions orchestrate complex multi-step workflows across serverless components
- Serverless is cost-effective for sporadic, event-driven workloads but can become expensive for sustained high-throughput
- Database Services (RDS, DynamoDB, CosmosDB) (required) — Select and configure managed database services for relational, NoSQL, and specialized data workloads with high availability and scaling.
- RDS and Cloud SQL provide managed relational databases with automated backups, patching, and Multi-AZ failover
- DynamoDB and Cosmos DB offer serverless NoSQL with single-digit millisecond latency at any scale
- Read replicas and global tables distribute read traffic and provide cross-region data access
- Database selection depends on access patterns: relational for complex queries, NoSQL for key-value and document workloads
- Load Balancing & CDN (recommended) — Distribute traffic across healthy instances with load balancers and accelerate content delivery with CDNs for global performance.
- Application Load Balancers route HTTP/HTTPS traffic with path-based and host-based routing rules
- Network Load Balancers handle millions of connections per second for TCP/UDP workloads with ultra-low latency
- CDNs (CloudFront, Azure CDN, Cloud CDN) cache content at edge locations to reduce latency for global users
- SSL/TLS termination at the load balancer offloads encryption work from backend instances
- Message Queues (SQS, EventBridge, Pub/Sub) (recommended) — Decouple services with asynchronous messaging using queues, event buses, and pub/sub systems for resilient event-driven architectures.
- Message queues decouple producers and consumers, allowing independent scaling and fault isolation
- Dead-letter queues capture failed messages for debugging without blocking the main processing pipeline
- EventBridge and Pub/Sub enable fan-out patterns where a single event triggers multiple downstream consumers
- FIFO queues guarantee exactly-once processing and message ordering for order-sensitive workflows
- API Gateway Design (recommended) — Expose backend services through managed API gateways with authentication, rate limiting, caching, and request transformation.
- API Gateways provide a single entry point that handles authentication, throttling, and request routing
- Usage plans and API keys enable monetization and rate limiting for different consumer tiers
- Request and response transformations adapt backend services to client-expected API contracts
- Caching at the gateway layer reduces backend load for frequently accessed, slowly changing data
- Infrastructure as Code (Terraform) (optional) — Define and provision cloud infrastructure declaratively using Terraform for repeatable, version-controlled, multi-cloud deployments.
- Declarative HCL configuration describes desired infrastructure state; Terraform plans and applies changes incrementally
- State files track resource mappings and must be stored remotely (S3, GCS) with locking for team collaboration
- Modules encapsulate reusable infrastructure patterns for consistent provisioning across environments
- Multi-provider support enables managing AWS, Azure, GCP, and third-party resources from a single codebase
- Cloud Networking Advanced (VPC Peering, Transit Gateway) (optional) — Connect multiple VPCs and accounts using peering, transit gateways, and private endpoints for enterprise-scale network architectures.
- VPC Peering provides direct, non-transitive connections between two VPCs with no bandwidth bottleneck
- Transit Gateway acts as a central hub connecting hundreds of VPCs and on-premises networks with simplified routing
- PrivateLink and Private Endpoints access cloud services over private IP addresses without traversing the public internet
- Cost Optimization Strategies (optional) — Apply advanced cost optimization techniques including spot instances, committed use discounts, architecture rightsizing, and waste elimination.
- Spot instances provide up to 90% savings for fault-tolerant, interruptible workloads like batch processing
- Committed use discounts (Savings Plans, Reserved Instances) reduce costs for steady-state production workloads
- Automated scheduling stops non-production resources outside business hours to eliminate idle spend
Step 3: Advanced Cloud Solutions
Architect enterprise-grade cloud solutions with multi-cloud strategies, migration planning, security architecture, and operational excellence at scale
Time: 12 weeks | Level: advanced
- Multi-Cloud Strategy (required) — Design architectures that leverage multiple cloud providers for resilience, vendor flexibility, and best-of-breed service selection.
- Multi-cloud avoids vendor lock-in and enables selecting the best service from each provider for specific workloads
- Abstraction layers (Terraform, Kubernetes) provide portability but add complexity and limit provider-specific features
- Data gravity and egress costs make data placement one of the most critical multi-cloud architecture decisions
- Unified identity and access management across providers requires federation and consistent policy enforcement
- Cloud Migration (6 R's) (required) — Plan and execute cloud migrations using the 6 R's framework: Rehost, Replatform, Refactor, Repurchase, Retire, and Retain.
- Rehosting (lift-and-shift) provides the fastest migration path with minimal code changes but limited cloud optimization
- Refactoring re-architects applications to leverage cloud-native services for maximum scalability and cost efficiency
- Portfolio assessment prioritizes applications by business value, technical complexity, and migration readiness
- Wave planning groups related applications for phased migration with clear dependencies and rollback procedures
- Disaster Recovery & Business Continuity (required) — Design disaster recovery architectures that meet business RTO/RPO requirements using backup, pilot light, warm standby, and multi-region active-active patterns.
- DR strategies range from backup-restore (hours RTO) to multi-region active-active (near-zero RTO) with increasing cost
- Pilot light maintains minimal always-on infrastructure that can scale up rapidly during a disaster event
- Regular DR testing through chaos engineering and gameday exercises validates recovery procedures before real incidents
- Automated runbooks reduce human error and recovery time during high-stress disaster scenarios
- Cloud Security Architecture (required) — Design defense-in-depth security architectures with network segmentation, encryption, threat detection, and incident response on cloud platforms.
- Defense in depth layers security controls at network, identity, application, and data levels for comprehensive protection
- Encryption at rest (KMS, managed keys) and in transit (TLS) protects data throughout its lifecycle
- Security services (GuardDuty, Security Center, Security Command Center) provide continuous threat detection and alerting
- Security automation through Infrastructure as Code ensures consistent security baselines across all environments
- Container Orchestration (ECS/EKS/GKE) (required) — Deploy and manage containerized applications at scale using managed Kubernetes services and container orchestration platforms.
- Managed Kubernetes services (EKS, AKS, GKE) handle control plane management, patching, and high availability
- Pods, Deployments, and Services are the core Kubernetes abstractions for running and exposing containerized workloads
- Helm charts package Kubernetes manifests for templated, versioned, and reusable application deployments
- Fargate and Cloud Run provide serverless container execution without managing underlying node infrastructure
- Cloud-Native CI/CD (required) — Build continuous integration and deployment pipelines using cloud-native services for automated building, testing, and releasing of applications.
- Cloud-native CI/CD services (CodePipeline, Azure Pipelines, Cloud Build) integrate tightly with their respective platforms
- Blue/green and canary deployment strategies reduce risk by gradually shifting traffic to new versions
- Container image scanning and policy enforcement in the pipeline prevent deploying vulnerable or non-compliant images
- GitOps (ArgoCD, Flux) uses Git repositories as the single source of truth for declarative infrastructure and application state
- Service Mesh (Istio/App Mesh) (recommended) — Implement service mesh infrastructure for advanced traffic management, observability, and security between microservices without application code changes.
- Sidecar proxies (Envoy) intercept all network traffic between services for transparent policy enforcement
- Traffic management features enable canary releases, traffic mirroring, and fault injection for resilience testing
- Mutual TLS (mTLS) automatically encrypts all service-to-service communication with identity-based authentication
- Distributed tracing and metrics collection provide deep observability into inter-service communication patterns
- Observability & Monitoring (CloudWatch, Stackdriver) (recommended) — Implement comprehensive observability with metrics, logs, and traces across cloud infrastructure and applications for proactive operations.
- The three pillars of observability (metrics, logs, traces) provide complementary views of system health and behavior
- Custom dashboards aggregate key metrics across services for at-a-glance operational awareness
- Log aggregation and structured logging enable rapid root cause analysis across distributed systems
- SLIs, SLOs, and error budgets quantify service reliability targets and guide operational priorities
- FinOps & Cloud Governance (recommended) — Establish organizational governance frameworks for cloud spending, compliance policies, resource management, and cross-team accountability.
- FinOps brings financial accountability to cloud spending through collaboration between engineering, finance, and business teams
- Organizational units, SCPs, and guardrails enforce security and compliance policies across multiple accounts
- Tagging strategies and cost allocation enable showback/chargeback models for team-level accountability
- Regular cloud optimization reviews identify waste and reallocate resources based on changing business priorities
- Edge Computing & IoT (optional) — Extend cloud architectures to the edge with IoT device management, edge compute, and data processing closer to the source.
- Edge computing processes data close to the source for low-latency responses and reduced bandwidth costs
- IoT device management platforms handle provisioning, monitoring, and OTA updates for device fleets at scale
- Edge-to-cloud data pipelines filter and aggregate data locally before sending summaries to the cloud
- Hybrid Cloud Architecture (optional) — Design architectures that span on-premises data centers and cloud environments with consistent management, networking, and security.
- Hybrid architectures keep latency-sensitive or compliance-restricted workloads on-premises while leveraging cloud for burst capacity
- Consistent tooling (Azure Arc, Anthos, EKS Anywhere) provides unified management across on-premises and cloud environments
- Dedicated connectivity (Direct Connect, ExpressRoute) provides reliable, low-latency links between on-premises and cloud
- Cloud Compliance & Auditing (optional) — Meet regulatory compliance requirements (HIPAA, PCI-DSS, SOC 2, GDPR) on cloud platforms with automated auditing and evidence collection.
- Cloud providers offer compliance certifications but customers must ensure their own configurations meet requirements
- AWS Config, Azure Policy, and Organization Policies continuously audit resource configurations against compliance rules
- Audit logs (CloudTrail, Activity Log, Audit Logs) provide tamper-proof records of all API actions for forensic analysis
