Skip to content

Infrastructure Documentation

Comprehensive guide to OVES cloud infrastructure, managed entirely through Infrastructure as Code (IaC) using Terraform.

Overview

The OVES infrastructure is built on AWS, spanning multiple regions with dedicated development and production environments. All infrastructure is defined, versioned, and deployed using Terraform, ensuring consistency, reproducibility, and easy disaster recovery.

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                          AWS Cloud                                  │
│                                                                     │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                    US-East-1 (Primary)                       │  │
│  │                                                              │  │
│  │  ┌────────────────────┐    ┌────────────────────┐           │  │
│  │  │  Production VPC    │    │  Development VPC   │           │  │
│  │  │                    │    │                    │           │  │
│  │  │  ┌──────────────┐  │    │  ┌──────────────┐  │           │  │
│  │  │  │ EKS Cluster  │  │    │  │ EKS Cluster  │  │           │  │
│  │  │  │  (Prod)      │  │    │  │  (Dev)       │  │           │  │
│  │  │  │              │  │    │  │              │  │           │  │
│  │  │  │ - In-house   │  │    │  │ - Dev Apps   │  │           │  │
│  │  │  │   Apps       │  │    │  │ - 3rd Party  │  │           │  │
│  │  │  │ - Select     │  │    │  │   Services   │  │           │  │
│  │  │  │   3rd Party  │  │    │  │              │  │           │  │
│  │  │  └──────────────┘  │    │  └──────────────┘  │           │  │
│  │  │                    │    │                    │           │  │
│  │  │  ┌──────────────┐  │    │  ┌──────────────┐  │           │  │
│  │  │  │ EC2          │  │    │  │ EC2          │  │           │  │
│  │  │  │ Instances    │  │    │  │ Instances    │  │           │  │
│  │  │  └──────────────┘  │    │  └──────────────┘  │           │  │
│  │  │                    │    │                    │           │  │
│  │  │  ┌──────────────┐  │    │  ┌──────────────┐  │           │  │
│  │  │  │ EBS Volumes  │  │    │  │ EBS Volumes  │  │           │  │
│  │  │  └──────────────┘  │    │  └──────────────┘  │           │  │
│  │  └────────────────────┘    └────────────────────┘           │  │
│  │                                                              │  │
│  │  ┌────────────────────────────────────────────────────────┐  │  │
│  │  │              Shared Services                           │  │  │
│  │  │  - S3 Buckets                                          │  │  │
│  │  │  - IAM Roles                                           │  │  │
│  │  │  - Route53 Hosted Zones                                │  │  │
│  │  │  - CloudWatch                                          │  │  │
│  │  └────────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
│                          VPC Peering                                │
│                               │                                     │
│  ┌────────────────────────────▼─────────────────────────────────┐  │
│  │                    EU-Central-1 (Secondary)                  │  │
│  │                                                              │  │
│  │  ┌────────────────────────────────────────────────────────┐  │  │
│  │  │  Production Database VPC                               │  │  │
│  │  │  - RDS Instances                                       │  │  │
│  │  │  - DocumentDB                                          │  │  │
│  │  │  - ElastiCache                                         │  │  │
│  │  └────────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────────┘  │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘

External Services:
- Cloudflare (Primary DNS)
- Route53 (China-accessible domains)

Infrastructure Components

1. Kubernetes Clusters

We operate two separate EKS (Elastic Kubernetes Service) clusters:

Production Cluster

Purpose: Hosts production workloads only

Specifications: - Region: US-East-1 - Node Groups: - General Purpose: t3.large (3-10 nodes, auto-scaling) - Compute Optimized: c5.xlarge (2-5 nodes, auto-scaling) - Memory Optimized: r5.large (2-4 nodes, auto-scaling) - Kubernetes Version: 1.28+ - Networking: AWS VPC CNI - Storage: EBS CSI Driver

Workloads: - In-house microservices (account, auth, client, thing, etc.) - Select third-party services (critical to production) - Production databases (MongoDB, Redis, PostgreSQL)

Access Control: - RBAC enabled - IAM roles for service accounts (IRSA) - Network policies enforced - Pod security standards (restricted)

Development Cluster

Purpose: Hosts development workloads and all third-party services

Specifications: - Region: US-East-1 - Node Groups: - General Purpose: t3.medium (2-8 nodes, auto-scaling) - Spot Instances: Mixed (cost optimization) - Kubernetes Version: 1.28+ - Networking: AWS VPC CNI - Storage: EBS CSI Driver

Workloads: - Development versions of microservices - All third-party services: - Grafana - Prometheus - Elasticsearch - InfluxDB - Kibana - Loki - AlertManager - Uptime Kuma - Checkly

Access Control: - RBAC enabled - Relaxed policies for development - Namespace isolation

2. Terraform Infrastructure as Code

All infrastructure is defined in Terraform with separate configurations for dev and prod:

Terraform Structure

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       ├── terraform.tfvars
│       └── backend.tf
├── modules/
│   ├── eks/
│   ├── vpc/
│   ├── iam/
│   ├── s3/
│   ├── ec2/
│   ├── rds/
│   └── networking/
└── shared/
    ├── route53.tf
    └── cloudwatch.tf

Managed Resources

Compute: - EKS clusters and node groups - EC2 instances for legacy applications - Auto Scaling Groups - Launch templates

Networking: - VPCs and subnets - Internet Gateways - NAT Gateways - VPC Peering connections - Security Groups - Network ACLs

Storage: - S3 buckets for backups and artifacts - EBS volumes for stateful workloads - EFS file systems (if needed)

Database: - RDS instances (PostgreSQL, MySQL) - DocumentDB clusters - ElastiCache (Redis)

IAM: - IAM roles for services - IAM policies - Service accounts - IRSA (IAM Roles for Service Accounts)

DNS: - Route53 hosted zones (for China access) - DNS records

Monitoring: - CloudWatch log groups - CloudWatch alarms - SNS topics for alerts

Deployment Process

Development Environment:

cd terraform/environments/dev
terraform init
terraform plan
terraform apply

Production Environment:

cd terraform/environments/prod
terraform init
terraform plan
terraform apply -auto-approve=false  # Requires manual approval

State Management: - Remote state stored in S3 - State locking with DynamoDB - Separate state files for dev and prod - Encrypted at rest

3. Networking

Ingress Controllers

NGINX Ingress Controller

All HTTP/HTTPS traffic is routed through NGINX Ingress Controller:

Features: - SSL/TLS termination - Path-based routing - Host-based routing - Rate limiting - Request/response modification - WebSocket support

Configuration Example:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: account-microservice
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
    nginx.ingress.kubernetes.io/rate-limit: "100"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.omnivoltaic.com
    secretName: api-tls
  rules:
  - host: api.omnivoltaic.com
    http:
      paths:
      - path: /account
        pathType: Prefix
        backend:
          service:
            name: account-microservice
            port:
              number: 3000

Load Balancers

Application Load Balancer (ALB): - HTTP/HTTPS traffic - Managed by AWS Load Balancer Controller - Automatic SSL certificate management

Network Load Balancer (NLB): - TCP traffic - Low latency requirements - Static IP addresses

HAProxy: - TCP route forwarding - Custom routing logic - Legacy application support

Configuration Example:

apiVersion: v1
kind: Service
metadata:
  name: mqtt-broker
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
spec:
  type: LoadBalancer
  ports:
  - port: 1883
    targetPort: 1883
    protocol: TCP
    name: mqtt
  selector:
    app: mqtt-broker

Certificate Management

cert-manager + Let's Encrypt

Automatic SSL/TLS certificate provisioning and renewal:

Features: - Automatic certificate issuance - Automatic renewal (30 days before expiry) - Multiple issuers (Let's Encrypt staging/prod) - DNS-01 and HTTP-01 challenges - Wildcard certificate support

Configuration:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: devops@omnivoltaic.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx

DNS Management

Cloudflare (Primary): - Primary DNS provider - DDoS protection - CDN capabilities - DNS analytics - API-driven management

AWS Route53 (Secondary): - China-accessible domains - Lower latency for Asian users - No VPN required for China access - Integrated with AWS services

Domain Structure:

Production:
- omnivoltaic.com (Cloudflare)
- api.omnivoltaic.com (Cloudflare)
- *.omnivoltaic.com (Cloudflare)
- cn.omnivoltaic.com (Route53 - China)

Development:
- dev.omnivoltaic.com (Cloudflare)
- *.dev.omnivoltaic.com (Cloudflare)

4. Storage

EBS Volumes

Purpose: Persistent storage for stateful applications

Use Cases: - Database storage (MongoDB, PostgreSQL, Redis) - Application data persistence - Log storage - File uploads

Configuration: - Type: gp3 (General Purpose SSD) - Size: 100GB - 1TB (auto-scaling enabled) - IOPS: 3000-16000 (based on workload) - Encryption: Enabled (AWS KMS) - Snapshots: Daily automated backups

StorageClass Example:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: ebs-sc
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
  iops: "3000"
  throughput: "125"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

PersistentVolumeClaim Example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mongodb-data
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: ebs-sc
  resources:
    requests:
      storage: 100Gi

S3 Buckets

Purpose: Object storage for backups, artifacts, and static files

Buckets: - oves-backups-prod - Database and application backups - oves-artifacts-prod - Build artifacts and releases - oves-logs-prod - Log archives - oves-static-prod - Static assets (images, documents) - oves-terraform-state - Terraform state files

Features: - Versioning enabled - Encryption at rest (AES-256) - Lifecycle policies for cost optimization - Cross-region replication (for critical data) - Access logging

5. Auto Scaling

All services are configured with auto-scaling to handle varying loads:

Horizontal Pod Autoscaler (HPA)

Purpose: Automatically scale pods based on CPU/memory usage

Configuration Example:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: account-microservice-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: account-microservice
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Cluster Autoscaler

Purpose: Automatically scale cluster nodes based on pod demands

Features: - Adds nodes when pods can't be scheduled - Removes nodes when underutilized - Respects pod disruption budgets - Multi-AZ aware

Configuration: - Scale up threshold: 80% resource utilization - Scale down threshold: 50% resource utilization - Scale down delay: 10 minutes - Min nodes: 2 per node group - Max nodes: 10 per node group

Vertical Pod Autoscaler (VPA)

Purpose: Automatically adjust CPU/memory requests

Use Cases: - Optimize resource allocation - Reduce over-provisioning - Improve cluster utilization

6. VPC Peering

Purpose: Connect US-East-1 production cluster to EU-Central-1 database

Configuration: - Requester VPC: US-East-1 Production VPC - Accepter VPC: EU-Central-1 Database VPC - CIDR Blocks: Non-overlapping ranges - Route Tables: Updated for cross-region communication

Use Case: - Production backend services in US-East-1 - Production databases in EU-Central-1 - Low-latency cross-region communication - Data residency compliance (EU data stays in EU)

Security: - Security groups restrict traffic to database ports only - Network ACLs for additional layer of security - VPC Flow Logs for traffic monitoring

Best Practices

Infrastructure as Code

  1. Version Control: All Terraform code in Git
  2. Code Review: Pull requests required for infrastructure changes
  3. Testing: Use terraform plan before applying
  4. Modules: Reusable modules for common patterns
  5. Documentation: Inline comments and README files

Security

  1. Least Privilege: Minimal IAM permissions
  2. Encryption: Encrypt data at rest and in transit
  3. Network Segmentation: Use security groups and network policies
  4. Secrets Management: Never hardcode secrets
  5. Audit Logging: Enable CloudTrail and VPC Flow Logs

Cost Optimization

  1. Right-Sizing: Use appropriate instance types
  2. Spot Instances: Use for non-critical workloads
  3. Auto-Scaling: Scale down during low usage
  4. Reserved Instances: For predictable workloads
  5. Storage Lifecycle: Archive old data to cheaper storage

High Availability

  1. Multi-AZ: Deploy across multiple availability zones
  2. Load Balancing: Distribute traffic across instances
  3. Health Checks: Automatic failure detection
  4. Backups: Regular automated backups
  5. Disaster Recovery: Tested recovery procedures

Troubleshooting

Common Issues

Cluster Access Issues

Symptom: Unable to connect to Kubernetes cluster

Solutions: 1. Update kubeconfig: aws eks update-kubeconfig --name <cluster-name> 2. Verify IAM permissions 3. Check security group rules 4. Verify VPN connection (if required)

Storage Issues

Symptom: PVC stuck in Pending state

Solutions: 1. Check StorageClass exists: kubectl get storageclass 2. Verify EBS CSI driver is running: kubectl get pods -n kube-system | grep ebs 3. Check AWS quotas for EBS volumes 4. Review pod events: kubectl describe pvc <pvc-name>

Networking Issues

Symptom: Services not accessible

Solutions: 1. Check Ingress configuration: kubectl get ingress 2. Verify DNS records in Cloudflare/Route53 3. Check certificate status: kubectl get certificate 4. Review NGINX Ingress logs: kubectl logs -n ingress-nginx <pod-name>