AWS Infrastructure Documentation¶

Complete guide to OVES AWS resources, managed via Terraform.

Overview¶

All OVES infrastructure runs on Amazon Web Services (AWS), managed entirely through Terraform Infrastructure as Code. Resources span multiple regions with separate configurations for development and production environments.

AWS Architecture¶

┌─────────────────────────────────────────────────────────────────┐
│                         AWS Account                             │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                    US-East-1 (Primary)                   │  │
│  │                                                          │  │
│  │  ┌────────────────┐  ┌────────────────┐                 │  │
│  │  │  Production    │  │  Development   │                 │  │
│  │  │     VPC        │  │      VPC       │                 │  │
│  │  │                │  │                │                 │  │
│  │  │  - EKS Cluster │  │  - EKS Cluster │                 │  │
│  │  │  - EC2 Instances│  │  - EC2 Instances│                 │  │
│  │  │  - RDS (some)  │  │  - RDS (dev)   │                 │  │
│  │  │  - ElastiCache │  │  - ElastiCache │                 │  │
│  │  └────────────────┘  └────────────────┘                 │  │
│  │                                                          │  │
│  │  ┌────────────────────────────────────────────────────┐  │  │
│  │  │           Shared Services                          │  │  │
│  │  │  - S3 Buckets (backups, logs, terraform state)    │  │  │
│  │  │  - IAM Roles & Policies                            │  │  │
│  │  │  - Route53 Hosted Zones                            │  │  │
│  │  │  - CloudWatch Logs & Metrics                       │  │  │
│  │  │  - ECR (deprecated, using ghcr.io now)             │  │  │
│  │  └────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                 │
│                      VPC Peering Connection                     │
│                               │                                 │
│  ┌────────────────────────────▼─────────────────────────────┐  │
│  │                  EU-Central-1 (Secondary)                │  │
│  │                                                          │  │
│  │  ┌────────────────────────────────────────────────────┐  │  │
│  │  │         Production Database VPC                    │  │  │
│  │  │  - RDS PostgreSQL (primary production DB)          │  │  │
│  │  │  - DocumentDB (MongoDB compatible)                 │  │  │
│  │  │  - ElastiCache Redis                               │  │  │
│  │  └────────────────────────────────────────────────────┘  │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                 │
│  External DNS: Cloudflare (Primary), Route53 (China access)    │
└─────────────────────────────────────────────────────────────────┘

Core AWS Services¶

1. EKS (Elastic Kubernetes Service)¶

Production Cluster (oves-prod): - Region: us-east-1 - Version: 1.28+ - Node Groups: General (t3.large), Compute (c5.xlarge), Memory (r5.large) - Managed via Terraform - Private API endpoint

Development Cluster (oves-dev): - Region: us-east-1 - Version: 1.28+ - Node Groups: General (t3.medium), Spot instances - Public API endpoint

Terraform Configuration:

module "eks_prod" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"

  cluster_name    = "oves-prod"
  cluster_version = "1.28"

  vpc_id     = module.vpc_prod.vpc_id
  subnet_ids = module.vpc_prod.private_subnets

  eks_managed_node_groups = {
    general = {
      instance_types = ["t3.large"]
      min_size       = 3
      max_size       = 10
      desired_size   = 3
    }
    compute = {
      instance_types = ["c5.xlarge"]
      min_size       = 2
      max_size       = 5
      desired_size   = 2
    }
  }

  cluster_addons = {
    coredns = {
      most_recent = true
    }
    kube-proxy = {
      most_recent = true
    }
    vpc-cni = {
      most_recent = true
    }
    aws-ebs-csi-driver = {
      most_recent = true
    }
  }
}

2. EC2 (Elastic Compute Cloud)¶

Use Cases: - Legacy applications not yet containerized - Services requiring specific OS configurations - Jump hosts / bastion servers - CI/CD runners (self-hosted)

Instance Types: - t3.medium - General purpose workloads - t3.large - Higher capacity needs - c5.large - Compute-intensive tasks

Terraform Example:

resource "aws_instance" "app_server" {
  ami           = data.aws_ami.amazon_linux_2.id
  instance_type = "t3.medium"

  subnet_id              = module.vpc.private_subnets[0]
  vpc_security_group_ids = [aws_security_group.app.id]

  iam_instance_profile = aws_iam_instance_profile.app.name

  user_data = templatefile("${path.module}/user_data.sh", {
    environment = "production"
  })

  tags = {
    Name        = "app-server-prod"
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

3. VPC (Virtual Private Cloud)¶

Production VPC: - CIDR: 10.0.0.0/16 - Availability Zones: 3 - Public Subnets: 3 (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24) - Private Subnets: 3 (10.0.11.0/24, 10.0.12.0/24, 10.0.13.0/24) - NAT Gateways: 3 (one per AZ) - Internet Gateway: 1

Development VPC: - CIDR: 10.1.0.0/16 - Availability Zones: 1 (cost optimization) - Public Subnets: 1 - Private Subnets: 1 - NAT Gateway: 1

Terraform Configuration:

module "vpc_prod" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "oves-prod-vpc"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
  public_subnets  = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = false
  enable_dns_hostnames = true
  enable_dns_support   = true

  public_subnet_tags = {
    "kubernetes.io/role/elb" = 1
  }

  private_subnet_tags = {
    "kubernetes.io/role/internal-elb" = 1
  }

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

4. RDS (Relational Database Service)¶

Production Databases: - PostgreSQL 14.x (primary application database) - MySQL 8.0 (legacy applications) - Multi-AZ deployment for high availability - Automated backups (7-day retention) - Encryption at rest

Terraform Configuration:

resource "aws_db_instance" "postgres_prod" {
  identifier = "oves-prod-postgres"

  engine         = "postgres"
  engine_version = "14.9"
  instance_class = "db.t3.large"

  allocated_storage     = 100
  max_allocated_storage = 500
  storage_type          = "gp3"
  storage_encrypted     = true

  db_name  = "oves_production"
  username = "admin"
  password = var.db_password  # From Terraform Cloud / Vault

  multi_az               = true
  db_subnet_group_name   = aws_db_subnet_group.prod.name
  vpc_security_group_ids = [aws_security_group.rds.id]

  backup_retention_period = 7
  backup_window          = "03:00-04:00"
  maintenance_window     = "Mon:04:00-Mon:05:00"

  enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

5. S3 (Simple Storage Service)¶

Buckets:

Bucket Name	Purpose	Versioning	Lifecycle
`oves-backups-prod`	Database backups	Enabled	90 days → Glacier
`oves-logs-prod`	Application logs	Disabled	30 days → Delete
`oves-terraform-state`	Terraform state	Enabled	Never delete
`oves-artifacts-prod`	Build artifacts	Enabled	180 days → Delete
`oves-static-prod`	Static assets	Disabled	Never delete

Terraform Configuration:

resource "aws_s3_bucket" "backups" {
  bucket = "oves-backups-prod"

  tags = {
    Name        = "Production Backups"
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

resource "aws_s3_bucket_versioning" "backups" {
  bucket = aws_s3_bucket.backups.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_lifecycle_configuration" "backups" {
  bucket = aws_s3_bucket.backups.id

  rule {
    id     = "archive-old-backups"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "GLACIER"
    }

    expiration {
      days = 365
    }
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "backups" {
  bucket = aws_s3_bucket.backups.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

6. ElastiCache (Redis)¶

Production Redis: - Engine: Redis 7.x - Node Type: cache.t3.medium - Number of Nodes: 2 (primary + replica) - Multi-AZ: Enabled - Encryption: In-transit and at-rest

Terraform Configuration:

resource "aws_elasticache_replication_group" "redis_prod" {
  replication_group_id       = "oves-prod-redis"
  replication_group_description = "Production Redis cluster"

  engine         = "redis"
  engine_version = "7.0"
  node_type      = "cache.t3.medium"

  num_cache_clusters         = 2
  automatic_failover_enabled = true
  multi_az_enabled          = true

  subnet_group_name  = aws_elasticache_subnet_group.prod.name
  security_group_ids = [aws_security_group.redis.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token                 = var.redis_auth_token

  snapshot_retention_limit = 5
  snapshot_window         = "03:00-05:00"

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

7. IAM (Identity and Access Management)¶

Key Roles: - EKS cluster role - EKS node group role - IRSA (IAM Roles for Service Accounts) for pods - EC2 instance profiles - Lambda execution roles

Terraform Example (IRSA):

module "irsa_account_microservice" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
  version = "~> 5.0"

  role_name = "account-microservice"

  role_policy_arns = {
    policy = aws_iam_policy.account_microservice.arn
  }

  oidc_providers = {
    main = {
      provider_arn               = module.eks_prod.oidc_provider_arn
      namespace_service_accounts = ["production:account-microservice"]
    }
  }
}

resource "aws_iam_policy" "account_microservice" {
  name = "account-microservice-policy"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Resource = "${aws_s3_bucket.backups.arn}/*"
      }
    ]
  })
}

8. Route53 (DNS)¶

Hosted Zones: - omnivoltaic.com - Primary domain (for China access) - Internal zones for service discovery

Terraform Configuration:

resource "aws_route53_zone" "main" {
  name = "omnivoltaic.com"

  tags = {
    Environment = "production"
    ManagedBy   = "terraform"
  }
}

resource "aws_route53_record" "api_china" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "cn.omnivoltaic.com"
  type    = "A"

  alias {
    name                   = aws_lb.api.dns_name
    zone_id                = aws_lb.api.zone_id
    evaluate_target_health = true
  }
}

9. CloudWatch¶

Log Groups: - /aws/eks/oves-prod/cluster - EKS control plane logs - /aws/rds/instance/oves-prod-postgres/postgresql - Database logs - /aws/lambda/* - Lambda function logs - /aws/ec2/* - EC2 instance logs

Alarms: - High CPU utilization - Low disk space - RDS connection count - ELB unhealthy targets

Terraform Example:

resource "aws_cloudwatch_metric_alarm" "high_cpu" {
  alarm_name          = "eks-node-high-cpu"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = "300"
  statistic           = "Average"
  threshold           = "80"
  alarm_description   = "This metric monitors ec2 cpu utilization"
  alarm_actions       = [aws_sns_topic.alerts.arn]

  dimensions = {
    AutoScalingGroupName = module.eks_prod.node_groups["general"].resources[0].autoscaling_groups[0].name
  }
}

VPC Peering¶

Purpose: Connect US-East-1 production cluster to EU-Central-1 databases

Configuration:

resource "aws_vpc_peering_connection" "us_to_eu" {
  vpc_id        = module.vpc_prod_us.vpc_id
  peer_vpc_id   = module.vpc_prod_eu.vpc_id
  peer_region   = "eu-central-1"
  auto_accept   = false

  tags = {
    Name = "US-East-1 to EU-Central-1"
  }
}

resource "aws_vpc_peering_connection_accepter" "eu" {
  provider                  = aws.eu
  vpc_peering_connection_id = aws_vpc_peering_connection.us_to_eu.id
  auto_accept               = true
}

resource "aws_route" "us_to_eu" {
  route_table_id            = module.vpc_prod_us.private_route_table_ids[0]
  destination_cidr_block    = module.vpc_prod_eu.vpc_cidr_block
  vpc_peering_connection_id = aws_vpc_peering_connection.us_to_eu.id
}

Terraform Structure¶

terraform/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── terraform.tfvars
│   │   ├── backend.tf
│   │   └── outputs.tf
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       ├── terraform.tfvars
│       ├── backend.tf
│       └── outputs.tf
├── modules/
│   ├── eks/
│   ├── vpc/
│   ├── rds/
│   ├── s3/
│   ├── iam/
│   └── security-groups/
└── shared/
    ├── route53.tf
    └── cloudwatch.tf

Cost Optimization¶

Strategies¶

Use Spot Instances (Dev cluster)
Right-size Resources (Regular reviews)
S3 Lifecycle Policies (Archive old data)
Reserved Instances (For predictable workloads)
Auto-scaling (Scale down during low usage)
Single NAT Gateway (Dev environment)

Security Best Practices¶

Encryption: All data encrypted at rest and in transit
IAM: Least privilege access
Security Groups: Restrictive rules
VPC: Private subnets for databases
Secrets: Never in code, use Secrets Manager/Vault
Logging: CloudTrail enabled for audit
MFA: Required for console access

Common Operations¶

Terraform Commands¶

# Initialize
cd terraform/environments/prod
terraform init

# Plan changes
terraform plan

# Apply changes
terraform apply

# Destroy resources (careful!)
terraform destroy

# View state
terraform state list
terraform state show aws_eks_cluster.prod

AWS CLI Commands¶

# List EKS clusters
aws eks list-clusters --region us-east-1

# Describe cluster
aws eks describe-cluster --name oves-prod --region us-east-1

# List EC2 instances
aws ec2 describe-instances --region us-east-1

# List S3 buckets
aws s3 ls

# View CloudWatch logs
aws logs tail /aws/eks/oves-prod/cluster --follow

Troubleshooting¶

EKS Cluster Issues¶

# Check cluster status
aws eks describe-cluster --name oves-prod

# View node group status
aws eks describe-nodegroup --cluster-name oves-prod --nodegroup-name general

# Check CloudWatch logs
aws logs tail /aws/eks/oves-prod/cluster --follow

RDS Connection Issues¶

# Check RDS status
aws rds describe-db-instances --db-instance-identifier oves-prod-postgres

# Test connectivity
psql -h <endpoint> -U admin -d oves_production

# Check security groups
aws ec2 describe-security-groups --group-ids sg-xxxxx

AWS Infrastructure Documentation¶

Overview¶

AWS Architecture¶

Core AWS Services¶

1. EKS (Elastic Kubernetes Service)¶

2. EC2 (Elastic Compute Cloud)¶

3. VPC (Virtual Private Cloud)¶

4. RDS (Relational Database Service)¶

5. S3 (Simple Storage Service)¶

6. ElastiCache (Redis)¶

7. IAM (Identity and Access Management)¶

8. Route53 (DNS)¶

9. CloudWatch¶

VPC Peering¶

Terraform Structure¶

Cost Optimization¶

Strategies¶

Security Best Practices¶

Common Operations¶

Terraform Commands¶

AWS CLI Commands¶

Troubleshooting¶

EKS Cluster Issues¶

RDS Connection Issues¶

Related Documentation¶