- ✅ Private Cluster:
private_cluster_enabled = true- No public API endpoint - ✅ Network Isolation: Hub-spoke architecture with controlled peering
- ✅ NSG Rules: Restrictive inbound rules, only allowing required ports
- ✅ No Public IPs: Nodes have no public IP addresses
- ✅ Network Policies: Azure network policies enabled for pod-to-pod security
- ✅ Azure AD RBAC: Enabled for cluster access control
- ✅ Workload Identity: Enabled for pod-level Azure authentication
- ✅ Managed Identity: System-assigned identity for cluster operations
- ✅ Key Vault Integration: Secrets managed through Azure Key Vault
- ✅ Encryption at Rest: Managed disks encrypted by default
- ✅ Key Vault Access: Limited permissions (Get, List only)
⚠️ Consider: Enable encryption at host for additional security
- ✅ Multi-Node: Minimum 3 system nodes, 4 Spark nodes
- ✅ Auto-scaling: Enabled on both node pools
- ✅ Availability Zones: Will use AZs if available in region
- ✅ Separate Node Pools: System and workload separation
- ✅ Load Balancer: Standard SKU with 2 outbound IPs
- ✅ ExpressRoute: Redundant connectivity to on-premises
⚠️ Consider: Add Azure Firewall in hub for additional control
- ✅ Container Insights: Full monitoring with Log Analytics
- ✅ Log Retention: 30 days configured
- ✅ Resource Tagging: Consistent tags for cost tracking
- ✅ Auto-upgrade: Patch channel for Kubernetes versions
- ✅ Maintenance Window: Sunday 2-6 AM configured
- ✅ Pod Disruption Budgets: Should be configured in workloads
# Current: sku_tier = "Free"
# Recommended for production:
sku_tier = "Standard" # Provides uptime SLA# Add to default_node_pool and spark node pool:
zones = ["1", "2", "3"]# Add to both node pools:
enable_host_encryption = true
os_disk_type = "Ephemeral" # For stateless workloads- Configure Velero or Azure Backup for AKS
- Implement persistent volume snapshots
# Consider adding:
enable_private_cluster_public_fqdn = false
run_command_enabled = false # Disable for production- Set resource requests/limits on all pods
- Configure pod security policies/standards
- Add Azure Monitor alerts
- Configure diagnostic settings
- Implement application performance monitoring
-
Hub VNet Configuration
- Ensure hub VNet exists with name:
net-eastus-hub - Verify ExpressRoute Gateway is deployed
- Check peering permissions on hub resource group
- Ensure hub VNet exists with name:
-
DNS Configuration
- Configure DNS forwarders in hub if needed
- Ensure on-premises DNS can resolve Azure private zones
-
IP Address Management
- Verify no IP conflicts with VNet ranges (10.0.1.0/24 - 10.0.4.0/24) and hub network
- Confirm service CIDR (172.16.0.0/16) does not overlap with any connected networks
- Document IP allocation for future growth
- Update
hub_resource_group_namevariable with actual value - Verify ExpressRoute circuit is active
- Confirm hub VNet name is correct
- Review and adjust Kubernetes version if needed
- Plan for workload migration strategy
- Configure backup solution
- Set up monitoring alerts
- Document disaster recovery procedures
# Set variables
export TF_VAR_hub_resource_group_name="your-actual-hub-rg"
# Plan with production settings
terraform plan -var="sku_tier=Standard" -out=tfplan.prod
# Apply
terraform apply tfplan.prod
# Post-deployment
az aks get-credentials --resource-group rg-aks-spark-prod --name aks-spark-cluster
kubectl get nodes- Current setup uses Standard_D8s_v3 VMs
- Consider spot instances for non-critical Spark workloads
- Review autoscaling settings based on actual usage
- Monitor and optimize outbound data transfer costs