Skip to content

Latest commit

 

History

History
139 lines (108 loc) · 4.37 KB

File metadata and controls

139 lines (108 loc) · 4.37 KB

Production Readiness Checklist

✅ Security Best Practices

Network Security

  • Private Cluster: private_cluster_enabled = true - No public API endpoint
  • Network Isolation: Hub-spoke architecture with controlled peering
  • NSG Rules: Restrictive inbound rules, only allowing required ports
  • No Public IPs: Nodes have no public IP addresses
  • Network Policies: Azure network policies enabled for pod-to-pod security

Identity & Access

  • Azure AD RBAC: Enabled for cluster access control
  • Workload Identity: Enabled for pod-level Azure authentication
  • Managed Identity: System-assigned identity for cluster operations
  • Key Vault Integration: Secrets managed through Azure Key Vault

Data Protection

  • Encryption at Rest: Managed disks encrypted by default
  • Key Vault Access: Limited permissions (Get, List only)
  • ⚠️ Consider: Enable encryption at host for additional security

✅ High Availability

Cluster Configuration

  • Multi-Node: Minimum 3 system nodes, 4 Spark nodes
  • Auto-scaling: Enabled on both node pools
  • Availability Zones: Will use AZs if available in region
  • Separate Node Pools: System and workload separation

Networking

  • Load Balancer: Standard SKU with 2 outbound IPs
  • ExpressRoute: Redundant connectivity to on-premises
  • ⚠️ Consider: Add Azure Firewall in hub for additional control

✅ Monitoring & Operations

Observability

  • Container Insights: Full monitoring with Log Analytics
  • Log Retention: 30 days configured
  • Resource Tagging: Consistent tags for cost tracking

Maintenance

  • Auto-upgrade: Patch channel for Kubernetes versions
  • Maintenance Window: Sunday 2-6 AM configured
  • Pod Disruption Budgets: Should be configured in workloads

⚠️ Production Improvements Needed

1. Cluster Tier

# Current: sku_tier = "Free"
# Recommended for production:
sku_tier = "Standard"  # Provides uptime SLA

2. Availability Zones

# Add to default_node_pool and spark node pool:
zones = ["1", "2", "3"]

3. Disk Encryption

# Add to both node pools:
enable_host_encryption = true
os_disk_type = "Ephemeral"  # For stateless workloads

4. Backup Strategy

  • Configure Velero or Azure Backup for AKS
  • Implement persistent volume snapshots

5. Network Hardening

# Consider adding:
enable_private_cluster_public_fqdn = false
run_command_enabled = false  # Disable for production

6. Resource Limits

  • Set resource requests/limits on all pods
  • Configure pod security policies/standards

7. Monitoring Enhancements

  • Add Azure Monitor alerts
  • Configure diagnostic settings
  • Implement application performance monitoring

🔧 Deployment Prerequisites

  1. Hub VNet Configuration

    • Ensure hub VNet exists with name: net-eastus-hub
    • Verify ExpressRoute Gateway is deployed
    • Check peering permissions on hub resource group
  2. DNS Configuration

    • Configure DNS forwarders in hub if needed
    • Ensure on-premises DNS can resolve Azure private zones
  3. IP Address Management

    • Verify no IP conflicts with VNet ranges (10.0.1.0/24 - 10.0.4.0/24) and hub network
    • Confirm service CIDR (172.16.0.0/16) does not overlap with any connected networks
    • Document IP allocation for future growth

📋 Pre-deployment Checklist

  • Update hub_resource_group_name variable with actual value
  • Verify ExpressRoute circuit is active
  • Confirm hub VNet name is correct
  • Review and adjust Kubernetes version if needed
  • Plan for workload migration strategy
  • Configure backup solution
  • Set up monitoring alerts
  • Document disaster recovery procedures

🚀 Deployment Commands

# Set variables
export TF_VAR_hub_resource_group_name="your-actual-hub-rg"

# Plan with production settings
terraform plan -var="sku_tier=Standard" -out=tfplan.prod

# Apply
terraform apply tfplan.prod

# Post-deployment
az aks get-credentials --resource-group rg-aks-spark-prod --name aks-spark-cluster
kubectl get nodes

📊 Cost Optimization

  • Current setup uses Standard_D8s_v3 VMs
  • Consider spot instances for non-critical Spark workloads
  • Review autoscaling settings based on actual usage
  • Monitor and optimize outbound data transfer costs