Infrastructure Documentation¶

Comprehensive guide to OVES cloud infrastructure, managed entirely through Infrastructure as Code (IaC) using Terraform.

Overview¶

The OVES infrastructure is built on AWS, spanning multiple regions with dedicated development and production environments. All infrastructure is defined, versioned, and deployed using Terraform, ensuring consistency, reproducibility, and easy disaster recovery.

Architecture Overview¶

┌─────────────────────────────────────────────────────────────────────┐
│                          AWS Cloud                                  │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    US-East-1 (Primary)                       │  │
│  │                                                              │  │
│  │  ┌────────────────────┐    ┌────────────────────┐           │  │
│  │  │  Production VPC    │    │  Development VPC   │           │  │
│  │  │                    │    │                    │           │  │
│  │  │  ┌──────────────┐  │    │  ┌──────────────┐  │           │  │
│  │  │  │ EKS Cluster  │  │    │  │ EKS Cluster  │  │           │  │
│  │  │  │  (Prod)      │  │    │  │  (Dev)       │  │           │  │
│  │  │  │              │  │    │  │              │  │           │  │
│  │  │  │ - In-house   │  │    │  │ - Dev Apps   │  │           │  │
│  │  │  │   Apps       │  │    │  │ - 3rd Party  │  │           │  │
│  │  │  │ - Select     │  │    │  │   Services   │  │           │  │
│  │  │  │   3rd Party  │  │    │  │              │  │           │  │
│  │  │  └──────────────┘  │    │  └──────────────┘  │           │  │
│  │  │                    │    │                    │           │  │
│  │  │  ┌──────────────┐  │    │  ┌──────────────┐  │           │  │
│  │  │  │ EC2          │  │    │  │ EC2          │  │           │  │
│  │  │  │ Instances    │  │    │  │ Instances    │  │           │  │
│  │  │  └──────────────┘  │    │  └──────────────┘  │           │  │
│  │  │                    │    │                    │           │  │
│  │  │  ┌──────────────┐  │    │  ┌──────────────┐  │           │  │
│  │  │  │ EBS Volumes  │  │    │  │ EBS Volumes  │  │           │  │
│  │  │  └──────────────┘  │    │  └──────────────┘  │           │  │
│  │  └────────────────────┘    └────────────────────┘           │  │
│  │                                                              │  │
│  │  ┌────────────────────────────────────────────────────────┐  │  │
│  │  │              Shared Services                           │  │  │
│  │  │  - S3 Buckets                                          │  │  │
│  │  │  - IAM Roles                                           │  │  │
│  │  │  - Route53 Hosted Zones                                │  │  │
│  │  │  - CloudWatch                                          │  │  │
│  │  └────────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│                          VPC Peering                                │
│                               │                                     │
│  ┌────────────────────────────▼─────────────────────────────────┐  │
│  │                    EU-Central-1 (Secondary)                  │  │
│  │                                                              │  │
│  │  ┌────────────────────────────────────────────────────────┐  │  │
│  │  │  Production Database VPC                               │  │  │
│  │  │  - RDS Instances                                       │  │  │
│  │  │  - DocumentDB                                          │  │  │
│  │  │  - ElastiCache                                         │  │  │
│  │  └────────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

External Services:
- Cloudflare (Primary DNS)
- Route53 (China-accessible domains)

Infrastructure Components¶

1. Kubernetes Clusters¶

We operate two separate EKS (Elastic Kubernetes Service) clusters:

Production Cluster¶

Purpose: Hosts production workloads only

Specifications: - Region: US-East-1 - Node Groups: - General Purpose: t3.large (3-10 nodes, auto-scaling) - Compute Optimized: c5.xlarge (2-5 nodes, auto-scaling) - Memory Optimized: r5.large (2-4 nodes, auto-scaling) - Kubernetes Version: 1.28+ - Networking: AWS VPC CNI - Storage: EBS CSI Driver

Workloads: - In-house microservices (account, auth, client, thing, etc.) - Select third-party services (critical to production) - Production databases (MongoDB, Redis, PostgreSQL)

Access Control: - RBAC enabled - IAM roles for service accounts (IRSA) - Network policies enforced - Pod security standards (restricted)

Development Cluster¶

Purpose: Hosts development workloads and all third-party services

Specifications: - Region: US-East-1 - Node Groups: - General Purpose: t3.medium (2-8 nodes, auto-scaling) - Spot Instances: Mixed (cost optimization) - Kubernetes Version: 1.28+ - Networking: AWS VPC CNI - Storage: EBS CSI Driver

Workloads: - Development versions of microservices - All third-party services: - Grafana - Prometheus - Elasticsearch - InfluxDB - Kibana - Loki - AlertManager - Uptime Kuma - Checkly

Access Control: - RBAC enabled - Relaxed policies for development - Namespace isolation

2. Terraform Infrastructure as Code¶

All infrastructure is defined in Terraform with separate configurations for dev and prod:

Terraform Structure¶

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       ├── terraform.tfvars
│       └── backend.tf
├── modules/
│   ├── eks/
│   ├── vpc/
│   ├── iam/
│   ├── s3/
│   ├── ec2/
│   ├── rds/
│   └── networking/
└── shared/
    ├── route53.tf
    └── cloudwatch.tf

Managed Resources¶

Compute: - EKS clusters and node groups - EC2 instances for legacy applications - Auto Scaling Groups - Launch templates

Networking: - VPCs and subnets - Internet Gateways - NAT Gateways - VPC Peering connections - Security Groups - Network ACLs

Storage: - S3 buckets for backups and artifacts - EBS volumes for stateful workloads - EFS file systems (if needed)

Database: - RDS instances (PostgreSQL, MySQL) - DocumentDB clusters - ElastiCache (Redis)

IAM: - IAM roles for services - IAM policies - Service accounts - IRSA (IAM Roles for Service Accounts)

DNS: - Route53 hosted zones (for China access) - DNS records

Monitoring: - CloudWatch log groups - CloudWatch alarms - SNS topics for alerts

Deployment Process¶

Development Environment:

cd terraform/environments/dev
terraform init
terraform plan
terraform apply

Production Environment:

cd terraform/environments/prod
terraform init
terraform plan
terraform apply -auto-approve=false  # Requires manual approval

State Management: - Remote state stored in S3 - State locking with DynamoDB - Separate state files for dev and prod - Encrypted at rest

3. Networking¶

Ingress Controllers¶

NGINX Ingress Controller

All HTTP/HTTPS traffic is routed through NGINX Ingress Controller:

Features: - SSL/TLS termination - Path-based routing - Host-based routing - Rate limiting - Request/response modification - WebSocket support

Configuration Example:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: account-microservice
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.omnivoltaic.com
    secretName: api-tls
  rules:
  - host: api.omnivoltaic.com
    http:
      paths:
      - path: /account
        pathType: Prefix
        backend:
          service:
            name: account-microservice
            port:
              number: 3000

Load Balancers¶

Application Load Balancer (ALB): - HTTP/HTTPS traffic - Managed by AWS Load Balancer Controller - Automatic SSL certificate management

Network Load Balancer (NLB): - TCP traffic - Low latency requirements - Static IP addresses

HAProxy: - TCP route forwarding - Custom routing logic - Legacy application support

Configuration Example:

apiVersion: v1
kind: Service
metadata:
  name: mqtt-broker
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  ports:
  - port: 1883
    targetPort: 1883
    protocol: TCP
    name: mqtt
  selector:
    app: mqtt-broker

Certificate Management¶

cert-manager + Let's Encrypt

Automatic SSL/TLS certificate provisioning and renewal:

Features: - Automatic certificate issuance - Automatic renewal (30 days before expiry) - Multiple issuers (Let's Encrypt staging/prod) - DNS-01 and HTTP-01 challenges - Wildcard certificate support

Configuration:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: devops@omnivoltaic.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx

DNS Management¶

Cloudflare (Primary): - Primary DNS provider - DDoS protection - CDN capabilities - DNS analytics - API-driven management

AWS Route53 (Secondary): - China-accessible domains - Lower latency for Asian users - No VPN required for China access - Integrated with AWS services

Domain Structure:

Production:
- omnivoltaic.com (Cloudflare)
- api.omnivoltaic.com (Cloudflare)
- *.omnivoltaic.com (Cloudflare)
- cn.omnivoltaic.com (Route53 - China)

Development:
- dev.omnivoltaic.com (Cloudflare)
- *.dev.omnivoltaic.com (Cloudflare)

4. Storage¶

EBS Volumes¶

Purpose: Persistent storage for stateful applications

Use Cases: - Database storage (MongoDB, PostgreSQL, Redis) - Application data persistence - Log storage - File uploads

Configuration: - Type: gp3 (General Purpose SSD) - Size: 100GB - 1TB (auto-scaling enabled) - IOPS: 3000-16000 (based on workload) - Encryption: Enabled (AWS KMS) - Snapshots: Daily automated backups

StorageClass Example:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
  iops: "3000"
  throughput: "125"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

PersistentVolumeClaim Example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mongodb-data
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ebs-sc
  resources:
    requests:
      storage: 100Gi

S3 Buckets¶

Purpose: Object storage for backups, artifacts, and static files

Buckets: - oves-backups-prod - Database and application backups - oves-artifacts-prod - Build artifacts and releases - oves-logs-prod - Log archives - oves-static-prod - Static assets (images, documents) - oves-terraform-state - Terraform state files

Features: - Versioning enabled - Encryption at rest (AES-256) - Lifecycle policies for cost optimization - Cross-region replication (for critical data) - Access logging

5. Auto Scaling¶

All services are configured with auto-scaling to handle varying loads:

Horizontal Pod Autoscaler (HPA)¶

Purpose: Automatically scale pods based on CPU/memory usage

Configuration Example:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: account-microservice-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: account-microservice
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Cluster Autoscaler¶

Purpose: Automatically scale cluster nodes based on pod demands

Features: - Adds nodes when pods can't be scheduled - Removes nodes when underutilized - Respects pod disruption budgets - Multi-AZ aware

Configuration: - Scale up threshold: 80% resource utilization - Scale down threshold: 50% resource utilization - Scale down delay: 10 minutes - Min nodes: 2 per node group - Max nodes: 10 per node group

Vertical Pod Autoscaler (VPA)¶

Purpose: Automatically adjust CPU/memory requests

Use Cases: - Optimize resource allocation - Reduce over-provisioning - Improve cluster utilization

6. VPC Peering¶

Purpose: Connect US-East-1 production cluster to EU-Central-1 database

Configuration: - Requester VPC: US-East-1 Production VPC - Accepter VPC: EU-Central-1 Database VPC - CIDR Blocks: Non-overlapping ranges - Route Tables: Updated for cross-region communication

Use Case: - Production backend services in US-East-1 - Production databases in EU-Central-1 - Low-latency cross-region communication - Data residency compliance (EU data stays in EU)

Security: - Security groups restrict traffic to database ports only - Network ACLs for additional layer of security - VPC Flow Logs for traffic monitoring

Best Practices¶

Infrastructure as Code¶

Version Control: All Terraform code in Git
Code Review: Pull requests required for infrastructure changes
Testing: Use terraform plan before applying
Modules: Reusable modules for common patterns
Documentation: Inline comments and README files

Security¶

Least Privilege: Minimal IAM permissions
Encryption: Encrypt data at rest and in transit
Network Segmentation: Use security groups and network policies
Secrets Management: Never hardcode secrets
Audit Logging: Enable CloudTrail and VPC Flow Logs

Cost Optimization¶

Right-Sizing: Use appropriate instance types
Spot Instances: Use for non-critical workloads
Auto-Scaling: Scale down during low usage
Reserved Instances: For predictable workloads
Storage Lifecycle: Archive old data to cheaper storage

High Availability¶

Multi-AZ: Deploy across multiple availability zones
Load Balancing: Distribute traffic across instances
Health Checks: Automatic failure detection
Backups: Regular automated backups
Disaster Recovery: Tested recovery procedures

Troubleshooting¶

Common Issues¶

Cluster Access Issues¶

Symptom: Unable to connect to Kubernetes cluster

Solutions: 1. Update kubeconfig: aws eks update-kubeconfig --name <cluster-name> 2. Verify IAM permissions 3. Check security group rules 4. Verify VPN connection (if required)

Storage Issues¶

Symptom: PVC stuck in Pending state

Solutions: 1. Check StorageClass exists: kubectl get storageclass 2. Verify EBS CSI driver is running: kubectl get pods -n kube-system | grep ebs 3. Check AWS quotas for EBS volumes 4. Review pod events: kubectl describe pvc <pvc-name>

Networking Issues¶

Symptom: Services not accessible

Solutions: 1. Check Ingress configuration: kubectl get ingress 2. Verify DNS records in Cloudflare/Route53 3. Check certificate status: kubectl get certificate 4. Review NGINX Ingress logs: kubectl logs -n ingress-nginx <pod-name>

Infrastructure Documentation¶

Overview¶

Architecture Overview¶

Infrastructure Components¶

1. Kubernetes Clusters¶

Production Cluster¶

Development Cluster¶

2. Terraform Infrastructure as Code¶

Terraform Structure¶

Managed Resources¶

Deployment Process¶

3. Networking¶

Ingress Controllers¶

Load Balancers¶

Certificate Management¶

DNS Management¶

4. Storage¶

EBS Volumes¶

S3 Buckets¶

5. Auto Scaling¶

Horizontal Pod Autoscaler (HPA)¶

Cluster Autoscaler¶

Vertical Pod Autoscaler (VPA)¶

6. VPC Peering¶

Best Practices¶

Infrastructure as Code¶

Security¶

Cost Optimization¶

High Availability¶

Troubleshooting¶

Common Issues¶

Cluster Access Issues¶

Storage Issues¶

Networking Issues¶

Related Documentation¶