Infrastructure Documentation¶
Comprehensive guide to OVES cloud infrastructure, managed entirely through Infrastructure as Code (IaC) using Terraform.
Overview¶
The OVES infrastructure is built on AWS, spanning multiple regions with dedicated development and production environments. All infrastructure is defined, versioned, and deployed using Terraform, ensuring consistency, reproducibility, and easy disaster recovery.
Architecture Overview¶
┌─────────────────────────────────────────────────────────────────────┐
│ AWS Cloud │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ US-East-1 (Primary) │ │
│ │ │ │
│ │ ┌────────────────────┐ ┌────────────────────┐ │ │
│ │ │ Production VPC │ │ Development VPC │ │ │
│ │ │ │ │ │ │ │
│ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │
│ │ │ │ EKS Cluster │ │ │ │ EKS Cluster │ │ │ │
│ │ │ │ (Prod) │ │ │ │ (Dev) │ │ │ │
│ │ │ │ │ │ │ │ │ │ │ │
│ │ │ │ - In-house │ │ │ │ - Dev Apps │ │ │ │
│ │ │ │ Apps │ │ │ │ - 3rd Party │ │ │ │
│ │ │ │ - Select │ │ │ │ Services │ │ │ │
│ │ │ │ 3rd Party │ │ │ │ │ │ │ │
│ │ │ └──────────────┘ │ │ └──────────────┘ │ │ │
│ │ │ │ │ │ │ │
│ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │
│ │ │ │ EC2 │ │ │ │ EC2 │ │ │ │
│ │ │ │ Instances │ │ │ │ Instances │ │ │ │
│ │ │ └──────────────┘ │ │ └──────────────┘ │ │ │
│ │ │ │ │ │ │ │
│ │ │ ┌──────────────┐ │ │ ┌──────────────┐ │ │ │
│ │ │ │ EBS Volumes │ │ │ │ EBS Volumes │ │ │ │
│ │ │ └──────────────┘ │ │ └──────────────┘ │ │ │
│ │ └────────────────────┘ └────────────────────┘ │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ Shared Services │ │ │
│ │ │ - S3 Buckets │ │ │
│ │ │ - IAM Roles │ │ │
│ │ │ - Route53 Hosted Zones │ │ │
│ │ │ - CloudWatch │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
│ VPC Peering │
│ │ │
│ ┌────────────────────────────▼─────────────────────────────────┐ │
│ │ EU-Central-1 (Secondary) │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────┐ │ │
│ │ │ Production Database VPC │ │ │
│ │ │ - RDS Instances │ │ │
│ │ │ - DocumentDB │ │ │
│ │ │ - ElastiCache │ │ │
│ │ └────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
External Services:
- Cloudflare (Primary DNS)
- Route53 (China-accessible domains)
Infrastructure Components¶
1. Kubernetes Clusters¶
We operate two separate EKS (Elastic Kubernetes Service) clusters:
Production Cluster¶
Purpose: Hosts production workloads only
Specifications: - Region: US-East-1 - Node Groups: - General Purpose: t3.large (3-10 nodes, auto-scaling) - Compute Optimized: c5.xlarge (2-5 nodes, auto-scaling) - Memory Optimized: r5.large (2-4 nodes, auto-scaling) - Kubernetes Version: 1.28+ - Networking: AWS VPC CNI - Storage: EBS CSI Driver
Workloads: - In-house microservices (account, auth, client, thing, etc.) - Select third-party services (critical to production) - Production databases (MongoDB, Redis, PostgreSQL)
Access Control: - RBAC enabled - IAM roles for service accounts (IRSA) - Network policies enforced - Pod security standards (restricted)
Development Cluster¶
Purpose: Hosts development workloads and all third-party services
Specifications: - Region: US-East-1 - Node Groups: - General Purpose: t3.medium (2-8 nodes, auto-scaling) - Spot Instances: Mixed (cost optimization) - Kubernetes Version: 1.28+ - Networking: AWS VPC CNI - Storage: EBS CSI Driver
Workloads: - Development versions of microservices - All third-party services: - Grafana - Prometheus - Elasticsearch - InfluxDB - Kibana - Loki - AlertManager - Uptime Kuma - Checkly
Access Control: - RBAC enabled - Relaxed policies for development - Namespace isolation
2. Terraform Infrastructure as Code¶
All infrastructure is defined in Terraform with separate configurations for dev and prod:
Terraform Structure¶
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ ├── terraform.tfvars
│ └── backend.tf
├── modules/
│ ├── eks/
│ ├── vpc/
│ ├── iam/
│ ├── s3/
│ ├── ec2/
│ ├── rds/
│ └── networking/
└── shared/
├── route53.tf
└── cloudwatch.tf
Managed Resources¶
Compute: - EKS clusters and node groups - EC2 instances for legacy applications - Auto Scaling Groups - Launch templates
Networking: - VPCs and subnets - Internet Gateways - NAT Gateways - VPC Peering connections - Security Groups - Network ACLs
Storage: - S3 buckets for backups and artifacts - EBS volumes for stateful workloads - EFS file systems (if needed)
Database: - RDS instances (PostgreSQL, MySQL) - DocumentDB clusters - ElastiCache (Redis)
IAM: - IAM roles for services - IAM policies - Service accounts - IRSA (IAM Roles for Service Accounts)
DNS: - Route53 hosted zones (for China access) - DNS records
Monitoring: - CloudWatch log groups - CloudWatch alarms - SNS topics for alerts
Deployment Process¶
Development Environment:
cd terraform/environments/dev
terraform init
terraform plan
terraform apply
Production Environment:
cd terraform/environments/prod
terraform init
terraform plan
terraform apply -auto-approve=false # Requires manual approval
State Management: - Remote state stored in S3 - State locking with DynamoDB - Separate state files for dev and prod - Encrypted at rest
3. Networking¶
Ingress Controllers¶
NGINX Ingress Controller
All HTTP/HTTPS traffic is routed through NGINX Ingress Controller:
Features: - SSL/TLS termination - Path-based routing - Host-based routing - Rate limiting - Request/response modification - WebSocket support
Configuration Example:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: account-microservice
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.omnivoltaic.com
secretName: api-tls
rules:
- host: api.omnivoltaic.com
http:
paths:
- path: /account
pathType: Prefix
backend:
service:
name: account-microservice
port:
number: 3000
Load Balancers¶
Application Load Balancer (ALB): - HTTP/HTTPS traffic - Managed by AWS Load Balancer Controller - Automatic SSL certificate management
Network Load Balancer (NLB): - TCP traffic - Low latency requirements - Static IP addresses
HAProxy: - TCP route forwarding - Custom routing logic - Legacy application support
Configuration Example:
apiVersion: v1
kind: Service
metadata:
name: mqtt-broker
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
type: LoadBalancer
ports:
- port: 1883
targetPort: 1883
protocol: TCP
name: mqtt
selector:
app: mqtt-broker
Certificate Management¶
cert-manager + Let's Encrypt
Automatic SSL/TLS certificate provisioning and renewal:
Features: - Automatic certificate issuance - Automatic renewal (30 days before expiry) - Multiple issuers (Let's Encrypt staging/prod) - DNS-01 and HTTP-01 challenges - Wildcard certificate support
Configuration:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: devops@omnivoltaic.com
privateKeySecretRef:
name: letsencrypt-prod
solvers:
- http01:
ingress:
class: nginx
DNS Management¶
Cloudflare (Primary): - Primary DNS provider - DDoS protection - CDN capabilities - DNS analytics - API-driven management
AWS Route53 (Secondary): - China-accessible domains - Lower latency for Asian users - No VPN required for China access - Integrated with AWS services
Domain Structure:
Production:
- omnivoltaic.com (Cloudflare)
- api.omnivoltaic.com (Cloudflare)
- *.omnivoltaic.com (Cloudflare)
- cn.omnivoltaic.com (Route53 - China)
Development:
- dev.omnivoltaic.com (Cloudflare)
- *.dev.omnivoltaic.com (Cloudflare)
4. Storage¶
EBS Volumes¶
Purpose: Persistent storage for stateful applications
Use Cases: - Database storage (MongoDB, PostgreSQL, Redis) - Application data persistence - Log storage - File uploads
Configuration: - Type: gp3 (General Purpose SSD) - Size: 100GB - 1TB (auto-scaling enabled) - IOPS: 3000-16000 (based on workload) - Encryption: Enabled (AWS KMS) - Snapshots: Daily automated backups
StorageClass Example:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: ebs-sc
provisioner: ebs.csi.aws.com
parameters:
type: gp3
encrypted: "true"
iops: "3000"
throughput: "125"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
PersistentVolumeClaim Example:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mongodb-data
spec:
accessModes:
- ReadWriteOnce
storageClassName: ebs-sc
resources:
requests:
storage: 100Gi
S3 Buckets¶
Purpose: Object storage for backups, artifacts, and static files
Buckets:
- oves-backups-prod - Database and application backups
- oves-artifacts-prod - Build artifacts and releases
- oves-logs-prod - Log archives
- oves-static-prod - Static assets (images, documents)
- oves-terraform-state - Terraform state files
Features: - Versioning enabled - Encryption at rest (AES-256) - Lifecycle policies for cost optimization - Cross-region replication (for critical data) - Access logging
5. Auto Scaling¶
All services are configured with auto-scaling to handle varying loads:
Horizontal Pod Autoscaler (HPA)¶
Purpose: Automatically scale pods based on CPU/memory usage
Configuration Example:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: account-microservice-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: account-microservice
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Cluster Autoscaler¶
Purpose: Automatically scale cluster nodes based on pod demands
Features: - Adds nodes when pods can't be scheduled - Removes nodes when underutilized - Respects pod disruption budgets - Multi-AZ aware
Configuration: - Scale up threshold: 80% resource utilization - Scale down threshold: 50% resource utilization - Scale down delay: 10 minutes - Min nodes: 2 per node group - Max nodes: 10 per node group
Vertical Pod Autoscaler (VPA)¶
Purpose: Automatically adjust CPU/memory requests
Use Cases: - Optimize resource allocation - Reduce over-provisioning - Improve cluster utilization
6. VPC Peering¶
Purpose: Connect US-East-1 production cluster to EU-Central-1 database
Configuration: - Requester VPC: US-East-1 Production VPC - Accepter VPC: EU-Central-1 Database VPC - CIDR Blocks: Non-overlapping ranges - Route Tables: Updated for cross-region communication
Use Case: - Production backend services in US-East-1 - Production databases in EU-Central-1 - Low-latency cross-region communication - Data residency compliance (EU data stays in EU)
Security: - Security groups restrict traffic to database ports only - Network ACLs for additional layer of security - VPC Flow Logs for traffic monitoring
Best Practices¶
Infrastructure as Code¶
- Version Control: All Terraform code in Git
- Code Review: Pull requests required for infrastructure changes
- Testing: Use
terraform planbefore applying - Modules: Reusable modules for common patterns
- Documentation: Inline comments and README files
Security¶
- Least Privilege: Minimal IAM permissions
- Encryption: Encrypt data at rest and in transit
- Network Segmentation: Use security groups and network policies
- Secrets Management: Never hardcode secrets
- Audit Logging: Enable CloudTrail and VPC Flow Logs
Cost Optimization¶
- Right-Sizing: Use appropriate instance types
- Spot Instances: Use for non-critical workloads
- Auto-Scaling: Scale down during low usage
- Reserved Instances: For predictable workloads
- Storage Lifecycle: Archive old data to cheaper storage
High Availability¶
- Multi-AZ: Deploy across multiple availability zones
- Load Balancing: Distribute traffic across instances
- Health Checks: Automatic failure detection
- Backups: Regular automated backups
- Disaster Recovery: Tested recovery procedures
Troubleshooting¶
Common Issues¶
Cluster Access Issues¶
Symptom: Unable to connect to Kubernetes cluster
Solutions:
1. Update kubeconfig: aws eks update-kubeconfig --name <cluster-name>
2. Verify IAM permissions
3. Check security group rules
4. Verify VPN connection (if required)
Storage Issues¶
Symptom: PVC stuck in Pending state
Solutions:
1. Check StorageClass exists: kubectl get storageclass
2. Verify EBS CSI driver is running: kubectl get pods -n kube-system | grep ebs
3. Check AWS quotas for EBS volumes
4. Review pod events: kubectl describe pvc <pvc-name>
Networking Issues¶
Symptom: Services not accessible
Solutions:
1. Check Ingress configuration: kubectl get ingress
2. Verify DNS records in Cloudflare/Route53
3. Check certificate status: kubectl get certificate
4. Review NGINX Ingress logs: kubectl logs -n ingress-nginx <pod-name>