AWS Infrastructure Documentation¶
Complete guide to OVES AWS resources, managed via Terraform.
Overview¶
All OVES infrastructure runs on Amazon Web Services (AWS), managed entirely through Terraform Infrastructure as Code. Resources span multiple regions with separate configurations for development and production environments.
AWS Architecture¶
┌─────────────────────────────────────────────────────────────────┐
│ AWS Account │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ US-East-1 (Primary) │ │
│ │ │ │
│ │ ┌────────────────┐ ┌────────────────┐ │ │
│ │ │ Production │ │ Development │ │ │
│ │ │ VPC │ │ VPC │ │ │
│ │ │ │ │ │ │ │
│ │ │ - EKS Cluster │ │ - EKS Cluster │ │ │
│ │ │ - EC2 Instances│ │ - EC2 Instances│ │ │
│ │ │ - RDS (some) │ │ - RDS (dev) │ │ │
│ │ │ - ElastiCache │ │ - ElastiCache │ │ │
│ │ └────────────────┘ └────────────────┘ │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────┐ │ │
│ │ │ Shared Services │ │ │
│ │ │ - S3 Buckets (backups, logs, terraform state) │ │ │
│ │ │ - IAM Roles & Policies │ │ │
│ │ │ - Route53 Hosted Zones │ │ │
│ │ │ - CloudWatch Logs & Metrics │ │ │
│ │ │ - ECR (deprecated, using ghcr.io now) │ │ │
│ │ └────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ VPC Peering Connection │
│ │ │
│ ┌────────────────────────────▼─────────────────────────────┐ │
│ │ EU-Central-1 (Secondary) │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────┐ │ │
│ │ │ Production Database VPC │ │ │
│ │ │ - RDS PostgreSQL (primary production DB) │ │ │
│ │ │ - DocumentDB (MongoDB compatible) │ │ │
│ │ │ - ElastiCache Redis │ │ │
│ │ └────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ External DNS: Cloudflare (Primary), Route53 (China access) │
└─────────────────────────────────────────────────────────────────┘
Core AWS Services¶
1. EKS (Elastic Kubernetes Service)¶
Production Cluster (oves-prod):
- Region: us-east-1
- Version: 1.28+
- Node Groups: General (t3.large), Compute (c5.xlarge), Memory (r5.large)
- Managed via Terraform
- Private API endpoint
Development Cluster (oves-dev):
- Region: us-east-1
- Version: 1.28+
- Node Groups: General (t3.medium), Spot instances
- Public API endpoint
Terraform Configuration:
module "eks_prod" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "oves-prod"
cluster_version = "1.28"
vpc_id = module.vpc_prod.vpc_id
subnet_ids = module.vpc_prod.private_subnets
eks_managed_node_groups = {
general = {
instance_types = ["t3.large"]
min_size = 3
max_size = 10
desired_size = 3
}
compute = {
instance_types = ["c5.xlarge"]
min_size = 2
max_size = 5
desired_size = 2
}
}
cluster_addons = {
coredns = {
most_recent = true
}
kube-proxy = {
most_recent = true
}
vpc-cni = {
most_recent = true
}
aws-ebs-csi-driver = {
most_recent = true
}
}
}
2. EC2 (Elastic Compute Cloud)¶
Use Cases: - Legacy applications not yet containerized - Services requiring specific OS configurations - Jump hosts / bastion servers - CI/CD runners (self-hosted)
Instance Types: - t3.medium - General purpose workloads - t3.large - Higher capacity needs - c5.large - Compute-intensive tasks
Terraform Example:
resource "aws_instance" "app_server" {
ami = data.aws_ami.amazon_linux_2.id
instance_type = "t3.medium"
subnet_id = module.vpc.private_subnets[0]
vpc_security_group_ids = [aws_security_group.app.id]
iam_instance_profile = aws_iam_instance_profile.app.name
user_data = templatefile("${path.module}/user_data.sh", {
environment = "production"
})
tags = {
Name = "app-server-prod"
Environment = "production"
ManagedBy = "terraform"
}
}
3. VPC (Virtual Private Cloud)¶
Production VPC: - CIDR: 10.0.0.0/16 - Availability Zones: 3 - Public Subnets: 3 (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24) - Private Subnets: 3 (10.0.11.0/24, 10.0.12.0/24, 10.0.13.0/24) - NAT Gateways: 3 (one per AZ) - Internet Gateway: 1
Development VPC: - CIDR: 10.1.0.0/16 - Availability Zones: 1 (cost optimization) - Public Subnets: 1 - Private Subnets: 1 - NAT Gateway: 1
Terraform Configuration:
module "vpc_prod" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "oves-prod-vpc"
cidr = "10.0.0.0/16"
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
private_subnets = ["10.0.11.0/24", "10.0.12.0/24", "10.0.13.0/24"]
public_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
enable_nat_gateway = true
single_nat_gateway = false
enable_dns_hostnames = true
enable_dns_support = true
public_subnet_tags = {
"kubernetes.io/role/elb" = 1
}
private_subnet_tags = {
"kubernetes.io/role/internal-elb" = 1
}
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
4. RDS (Relational Database Service)¶
Production Databases: - PostgreSQL 14.x (primary application database) - MySQL 8.0 (legacy applications) - Multi-AZ deployment for high availability - Automated backups (7-day retention) - Encryption at rest
Terraform Configuration:
resource "aws_db_instance" "postgres_prod" {
identifier = "oves-prod-postgres"
engine = "postgres"
engine_version = "14.9"
instance_class = "db.t3.large"
allocated_storage = 100
max_allocated_storage = 500
storage_type = "gp3"
storage_encrypted = true
db_name = "oves_production"
username = "admin"
password = var.db_password # From Terraform Cloud / Vault
multi_az = true
db_subnet_group_name = aws_db_subnet_group.prod.name
vpc_security_group_ids = [aws_security_group.rds.id]
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
enabled_cloudwatch_logs_exports = ["postgresql", "upgrade"]
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
5. S3 (Simple Storage Service)¶
Buckets:
| Bucket Name | Purpose | Versioning | Lifecycle |
|---|---|---|---|
oves-backups-prod |
Database backups | Enabled | 90 days → Glacier |
oves-logs-prod |
Application logs | Disabled | 30 days → Delete |
oves-terraform-state |
Terraform state | Enabled | Never delete |
oves-artifacts-prod |
Build artifacts | Enabled | 180 days → Delete |
oves-static-prod |
Static assets | Disabled | Never delete |
Terraform Configuration:
resource "aws_s3_bucket" "backups" {
bucket = "oves-backups-prod"
tags = {
Name = "Production Backups"
Environment = "production"
ManagedBy = "terraform"
}
}
resource "aws_s3_bucket_versioning" "backups" {
bucket = aws_s3_bucket.backups.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_lifecycle_configuration" "backups" {
bucket = aws_s3_bucket.backups.id
rule {
id = "archive-old-backups"
status = "Enabled"
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "backups" {
bucket = aws_s3_bucket.backups.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
6. ElastiCache (Redis)¶
Production Redis: - Engine: Redis 7.x - Node Type: cache.t3.medium - Number of Nodes: 2 (primary + replica) - Multi-AZ: Enabled - Encryption: In-transit and at-rest
Terraform Configuration:
resource "aws_elasticache_replication_group" "redis_prod" {
replication_group_id = "oves-prod-redis"
replication_group_description = "Production Redis cluster"
engine = "redis"
engine_version = "7.0"
node_type = "cache.t3.medium"
num_cache_clusters = 2
automatic_failover_enabled = true
multi_az_enabled = true
subnet_group_name = aws_elasticache_subnet_group.prod.name
security_group_ids = [aws_security_group.redis.id]
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token = var.redis_auth_token
snapshot_retention_limit = 5
snapshot_window = "03:00-05:00"
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
7. IAM (Identity and Access Management)¶
Key Roles: - EKS cluster role - EKS node group role - IRSA (IAM Roles for Service Accounts) for pods - EC2 instance profiles - Lambda execution roles
Terraform Example (IRSA):
module "irsa_account_microservice" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.0"
role_name = "account-microservice"
role_policy_arns = {
policy = aws_iam_policy.account_microservice.arn
}
oidc_providers = {
main = {
provider_arn = module.eks_prod.oidc_provider_arn
namespace_service_accounts = ["production:account-microservice"]
}
}
}
resource "aws_iam_policy" "account_microservice" {
name = "account-microservice-policy"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "${aws_s3_bucket.backups.arn}/*"
}
]
})
}
8. Route53 (DNS)¶
Hosted Zones:
- omnivoltaic.com - Primary domain (for China access)
- Internal zones for service discovery
Terraform Configuration:
resource "aws_route53_zone" "main" {
name = "omnivoltaic.com"
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
resource "aws_route53_record" "api_china" {
zone_id = aws_route53_zone.main.zone_id
name = "cn.omnivoltaic.com"
type = "A"
alias {
name = aws_lb.api.dns_name
zone_id = aws_lb.api.zone_id
evaluate_target_health = true
}
}
9. CloudWatch¶
Log Groups:
- /aws/eks/oves-prod/cluster - EKS control plane logs
- /aws/rds/instance/oves-prod-postgres/postgresql - Database logs
- /aws/lambda/* - Lambda function logs
- /aws/ec2/* - EC2 instance logs
Alarms: - High CPU utilization - Low disk space - RDS connection count - ELB unhealthy targets
Terraform Example:
resource "aws_cloudwatch_metric_alarm" "high_cpu" {
alarm_name = "eks-node-high-cpu"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = "300"
statistic = "Average"
threshold = "80"
alarm_description = "This metric monitors ec2 cpu utilization"
alarm_actions = [aws_sns_topic.alerts.arn]
dimensions = {
AutoScalingGroupName = module.eks_prod.node_groups["general"].resources[0].autoscaling_groups[0].name
}
}
VPC Peering¶
Purpose: Connect US-East-1 production cluster to EU-Central-1 databases
Configuration:
resource "aws_vpc_peering_connection" "us_to_eu" {
vpc_id = module.vpc_prod_us.vpc_id
peer_vpc_id = module.vpc_prod_eu.vpc_id
peer_region = "eu-central-1"
auto_accept = false
tags = {
Name = "US-East-1 to EU-Central-1"
}
}
resource "aws_vpc_peering_connection_accepter" "eu" {
provider = aws.eu
vpc_peering_connection_id = aws_vpc_peering_connection.us_to_eu.id
auto_accept = true
}
resource "aws_route" "us_to_eu" {
route_table_id = module.vpc_prod_us.private_route_table_ids[0]
destination_cidr_block = module.vpc_prod_eu.vpc_cidr_block
vpc_peering_connection_id = aws_vpc_peering_connection.us_to_eu.id
}
Terraform Structure¶
terraform/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── terraform.tfvars
│ │ ├── backend.tf
│ │ └── outputs.tf
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ ├── terraform.tfvars
│ ├── backend.tf
│ └── outputs.tf
├── modules/
│ ├── eks/
│ ├── vpc/
│ ├── rds/
│ ├── s3/
│ ├── iam/
│ └── security-groups/
└── shared/
├── route53.tf
└── cloudwatch.tf
Cost Optimization¶
Strategies¶
- Use Spot Instances (Dev cluster)
- Right-size Resources (Regular reviews)
- S3 Lifecycle Policies (Archive old data)
- Reserved Instances (For predictable workloads)
- Auto-scaling (Scale down during low usage)
- Single NAT Gateway (Dev environment)
Security Best Practices¶
- Encryption: All data encrypted at rest and in transit
- IAM: Least privilege access
- Security Groups: Restrictive rules
- VPC: Private subnets for databases
- Secrets: Never in code, use Secrets Manager/Vault
- Logging: CloudTrail enabled for audit
- MFA: Required for console access
Common Operations¶
Terraform Commands¶
# Initialize
cd terraform/environments/prod
terraform init
# Plan changes
terraform plan
# Apply changes
terraform apply
# Destroy resources (careful!)
terraform destroy
# View state
terraform state list
terraform state show aws_eks_cluster.prod
AWS CLI Commands¶
# List EKS clusters
aws eks list-clusters --region us-east-1
# Describe cluster
aws eks describe-cluster --name oves-prod --region us-east-1
# List EC2 instances
aws ec2 describe-instances --region us-east-1
# List S3 buckets
aws s3 ls
# View CloudWatch logs
aws logs tail /aws/eks/oves-prod/cluster --follow
Troubleshooting¶
EKS Cluster Issues¶
# Check cluster status
aws eks describe-cluster --name oves-prod
# View node group status
aws eks describe-nodegroup --cluster-name oves-prod --nodegroup-name general
# Check CloudWatch logs
aws logs tail /aws/eks/oves-prod/cluster --follow
RDS Connection Issues¶
# Check RDS status
aws rds describe-db-instances --db-instance-identifier oves-prod-postgres
# Test connectivity
psql -h <endpoint> -U admin -d oves_production
# Check security groups
aws ec2 describe-security-groups --group-ids sg-xxxxx