Skip to content

AWS Infrastructure Documentation

Complete guide to OVES AWS resources, managed via Terraform.

Overview

All OVES infrastructure runs on Amazon Web Services (AWS), managed entirely through Terraform Infrastructure as Code. Resources span multiple regions with separate configurations for development and production environments.

AWS Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         AWS Account                             │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                    US-East-1 (Primary)                   │  │
│  │                                                          │  │
│  │  ┌────────────────┐  ┌────────────────┐                 │  │
│  │  │  Production    │  │  Development   │                 │  │
│  │  │     VPC        │  │      VPC       │                 │  │
│  │  │                │  │                │                 │  │
│  │  │  - EKS Cluster │  │  - EKS Cluster │                 │  │
│  │  │  - EC2 Instances│  │  - EC2 Instances│                 │  │
│  │  │  - RDS (some)  │  │  - RDS (dev)   │                 │  │
│  │  │  - ElastiCache │  │  - ElastiCache │                 │  │
│  │  └────────────────┘  └────────────────┘                 │  │
│  │                                                          │  │
│  │  ┌────────────────────────────────────────────────────┐  │  │
│  │  │           Shared Services                          │  │  │
│  │  │  - S3 Buckets (backups, logs, terraform state)    │  │  │
│  │  │  - IAM Roles & Policies                            │  │  │
│  │  │  - Route53 Hosted Zones                            │  │  │
│  │  │  - CloudWatch Logs & Metrics                       │  │  │
│  │  │  - ECR (deprecated, using ghcr.io now)             │  │  │
│  │  └────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                 │
│                      VPC Peering Connection                     │
│                               │                                 │
│  ┌────────────────────────────▼─────────────────────────────┐  │
│  │                  EU-Central-1 (Secondary)                │  │
│  │                                                          │  │
│  │  ┌────────────────────────────────────────────────────┐  │  │
│  │  │         Production Database VPC                    │  │  │
│  │  │  - RDS PostgreSQL (primary production DB)          │  │  │
│  │  │  - DocumentDB (MongoDB compatible)                 │  │  │
│  │  │  - ElastiCache Redis                               │  │  │
│  │  └────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                 │
│  External DNS: Cloudflare (Primary), Route53 (China access)    │
└─────────────────────────────────────────────────────────────────┘

Core AWS Services

1. EKS (Elastic Kubernetes Service)

Production Cluster (oves-prod): - Region: us-east-1 - Version: 1.28+ - Node Groups: General (t3.large), Compute (c5.xlarge), Memory (r5.large) - Managed via Terraform - Private API endpoint

Development Cluster (oves-dev): - Region: us-east-1 - Version: 1.28+ - Node Groups: General (t3.medium), Spot instances - Public API endpoint

Terraform Configuration:

module "eks_prod" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name    = "oves-prod"
  cluster_version = "1.28"

  vpc_id     = module.vpc_prod.vpc_id
  subnet_ids = module.vpc_prod.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["t3.large"]
      min_size       = 3
      max_size       = 10
      desired_size   = 3
    }
    compute = {
      instance_types = ["c5.xlarge"]
      min_size       = 2
      max_size       = 5
      desired_size   = 2
    }
  }

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      most_recent = true
    }
  }
}

2. EC2 (Elastic Compute Cloud)

Use Cases: - Legacy applications not yet containerized - Services requiring specific OS configurations - Jump hosts / bastion servers - CI/CD runners (self-hosted)

Instance Types: - t3.medium - General purpose workloads - t3.large - Higher capacity needs - c5.large - Compute-intensive tasks

Terraform Example:

resource "aws_instance" "app_server" {
  ami           = data.aws_ami.amazon_linux_2.id
  instance_type = "t3.medium"

  subnet_id              = module.vpc.private_subnets[0]
  vpc_security_group_ids = [aws_security_group.app.id]

  iam_instance_profile = aws_iam_instance_profile.app.name

  user_data = templatefile("${path.module}/user_data.sh", {
    environment = "production"
  })

  tags = {
    Name        = "app-server-prod"
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

3. VPC (Virtual Private Cloud)

Production VPC: - CIDR: 10.0.0.0/16 - Availability Zones: 3 - Public Subnets: 3 (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24) - Private Subnets: 3 (10.0.11.0/24, 10.0.12.0/24, 10.0.13.0/24) - NAT Gateways: 3 (one per AZ) - Internet Gateway: 1

Development VPC: - CIDR: 10.1.0.0/16 - Availability Zones: 1 (cost optimization) - Public Subnets: 1 - Private Subnets: 1 - NAT Gateway: 1

Terraform Configuration:

module "vpc_prod" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "oves-prod-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
  public_subnets  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = false
  enable_dns_hostnames = true
  enable_dns_support   = true

  public_subnet_tags = {
    "kubernetes.io/role/elb" = 1
  }

  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
  }

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

4. RDS (Relational Database Service)

Production Databases: - PostgreSQL 14.x (primary application database) - MySQL 8.0 (legacy applications) - Multi-AZ deployment for high availability - Automated backups (7-day retention) - Encryption at rest

Terraform Configuration:

resource "aws_db_instance" "postgres_prod" {
  identifier = "oves-prod-postgres"

  engine         = "postgres"
  engine_version = "14.9"
  instance_class = "db.t3.large"

  allocated_storage     = 100
  max_allocated_storage = 500
  storage_type          = "gp3"
  storage_encrypted     = true

  db_name  = "oves_production"
  username = "admin"
  password = var.db_password  # From Terraform Cloud / Vault

  multi_az               = true
  db_subnet_group_name   = aws_db_subnet_group.prod.name
  vpc_security_group_ids = [aws_security_group.rds.id]

  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

5. S3 (Simple Storage Service)

Buckets:

Bucket Name Purpose Versioning Lifecycle
oves-backups-prod Database backups Enabled 90 days → Glacier
oves-logs-prod Application logs Disabled 30 days → Delete
oves-terraform-state Terraform state Enabled Never delete
oves-artifacts-prod Build artifacts Enabled 180 days → Delete
oves-static-prod Static assets Disabled Never delete

Terraform Configuration:

resource "aws_s3_bucket" "backups" {
  bucket = "oves-backups-prod"

  tags = {
    Name        = "Production Backups"
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

resource "aws_s3_bucket_versioning" "backups" {
  bucket = aws_s3_bucket.backups.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "backups" {
  bucket = aws_s3_bucket.backups.id

  rule {
    id     = "archive-old-backups"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    expiration {
      days = 365
    }
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "backups" {
  bucket = aws_s3_bucket.backups.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

6. ElastiCache (Redis)

Production Redis: - Engine: Redis 7.x - Node Type: cache.t3.medium - Number of Nodes: 2 (primary + replica) - Multi-AZ: Enabled - Encryption: In-transit and at-rest

Terraform Configuration:

resource "aws_elasticache_replication_group" "redis_prod" {
  replication_group_id       = "oves-prod-redis"
  replication_group_description = "Production Redis cluster"

  engine         = "redis"
  engine_version = "7.0"
  node_type      = "cache.t3.medium"

  num_cache_clusters         = 2
  automatic_failover_enabled = true
  multi_az_enabled          = true

  subnet_group_name  = aws_elasticache_subnet_group.prod.name
  security_group_ids = [aws_security_group.redis.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token                 = var.redis_auth_token

  snapshot_retention_limit = 5
  snapshot_window         = "03:00-05:00"

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

7. IAM (Identity and Access Management)

Key Roles: - EKS cluster role - EKS node group role - IRSA (IAM Roles for Service Accounts) for pods - EC2 instance profiles - Lambda execution roles

Terraform Example (IRSA):

module "irsa_account_microservice" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.0"

  role_name = "account-microservice"

  role_policy_arns = {
    policy = aws_iam_policy.account_microservice.arn
  }

  oidc_providers = {
    main = {
      provider_arn               = module.eks_prod.oidc_provider_arn
      namespace_service_accounts = ["production:account-microservice"]
    }
  }
}

resource "aws_iam_policy" "account_microservice" {
  name = "account-microservice-policy"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Resource = "${aws_s3_bucket.backups.arn}/*"
      }
    ]
  })
}

8. Route53 (DNS)

Hosted Zones: - omnivoltaic.com - Primary domain (for China access) - Internal zones for service discovery

Terraform Configuration:

resource "aws_route53_zone" "main" {
  name = "omnivoltaic.com"

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

resource "aws_route53_record" "api_china" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "cn.omnivoltaic.com"
  type    = "A"

  alias {
    name                   = aws_lb.api.dns_name
    zone_id                = aws_lb.api.zone_id
    evaluate_target_health = true
  }
}

9. CloudWatch

Log Groups: - /aws/eks/oves-prod/cluster - EKS control plane logs - /aws/rds/instance/oves-prod-postgres/postgresql - Database logs - /aws/lambda/* - Lambda function logs - /aws/ec2/* - EC2 instance logs

Alarms: - High CPU utilization - Low disk space - RDS connection count - ELB unhealthy targets

Terraform Example:

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "eks-node-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "300"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "This metric monitors ec2 cpu utilization"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    AutoScalingGroupName = module.eks_prod.node_groups["general"].resources[0].autoscaling_groups[0].name
  }
}

VPC Peering

Purpose: Connect US-East-1 production cluster to EU-Central-1 databases

Configuration:

resource "aws_vpc_peering_connection" "us_to_eu" {
  vpc_id        = module.vpc_prod_us.vpc_id
  peer_vpc_id   = module.vpc_prod_eu.vpc_id
  peer_region   = "eu-central-1"
  auto_accept   = false

  tags = {
    Name = "US-East-1 to EU-Central-1"
  }
}

resource "aws_vpc_peering_connection_accepter" "eu" {
  provider                  = aws.eu
  vpc_peering_connection_id = aws_vpc_peering_connection.us_to_eu.id
  auto_accept               = true
}

resource "aws_route" "us_to_eu" {
  route_table_id            = module.vpc_prod_us.private_route_table_ids[0]
  destination_cidr_block    = module.vpc_prod_eu.vpc_cidr_block
  vpc_peering_connection_id = aws_vpc_peering_connection.us_to_eu.id
}

Terraform Structure

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   ├── backend.tf
│   │   └── outputs.tf
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       ├── terraform.tfvars
│       ├── backend.tf
│       └── outputs.tf
├── modules/
│   ├── eks/
│   ├── vpc/
│   ├── rds/
│   ├── s3/
│   ├── iam/
│   └── security-groups/
└── shared/
    ├── route53.tf
    └── cloudwatch.tf

Cost Optimization

Strategies

  1. Use Spot Instances (Dev cluster)
  2. Right-size Resources (Regular reviews)
  3. S3 Lifecycle Policies (Archive old data)
  4. Reserved Instances (For predictable workloads)
  5. Auto-scaling (Scale down during low usage)
  6. Single NAT Gateway (Dev environment)

Security Best Practices

  1. Encryption: All data encrypted at rest and in transit
  2. IAM: Least privilege access
  3. Security Groups: Restrictive rules
  4. VPC: Private subnets for databases
  5. Secrets: Never in code, use Secrets Manager/Vault
  6. Logging: CloudTrail enabled for audit
  7. MFA: Required for console access

Common Operations

Terraform Commands

# Initialize
cd terraform/environments/prod
terraform init

# Plan changes
terraform plan

# Apply changes
terraform apply

# Destroy resources (careful!)
terraform destroy

# View state
terraform state list
terraform state show aws_eks_cluster.prod

AWS CLI Commands

# List EKS clusters
aws eks list-clusters --region us-east-1

# Describe cluster
aws eks describe-cluster --name oves-prod --region us-east-1

# List EC2 instances
aws ec2 describe-instances --region us-east-1

# List S3 buckets
aws s3 ls

# View CloudWatch logs
aws logs tail /aws/eks/oves-prod/cluster --follow

Troubleshooting

EKS Cluster Issues

# Check cluster status
aws eks describe-cluster --name oves-prod

# View node group status
aws eks describe-nodegroup --cluster-name oves-prod --nodegroup-name general

# Check CloudWatch logs
aws logs tail /aws/eks/oves-prod/cluster --follow

RDS Connection Issues

# Check RDS status
aws rds describe-db-instances --db-instance-identifier oves-prod-postgres

# Test connectivity
psql -h <endpoint> -U admin -d oves_production

# Check security groups
aws ec2 describe-security-groups --group-ids sg-xxxxx