Skip to content

Commit

Permalink
feat!: Enable autoscaling and install required ancillary tooling to s…
Browse files Browse the repository at this point in the history
…upport it (#257)

* feat!: add autoscaling to terraform modules and change variable lookup order for instance sizing

* feat!: add cluster autoscaler to cluster

BREAKING CHANGE: A number of variable defaults are removed and variables renamed for node counts.
  • Loading branch information
danielpanzella authored Oct 9, 2024
1 parent 47b06e1 commit 8d48434
Show file tree
Hide file tree
Showing 12 changed files with 274 additions and 65 deletions.
38 changes: 38 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,22 @@ module "wandb" {

- Run `terraform init` and `terraform apply`

## Cluster Sizing

By default, the type of kubernetes instances, number of instances, redis cluster size, and database instance sizes are
standardized via configurations in [./deployment-size.tf](deployment-size.tf), and is configured via the `size` input
variable.

Available sizes are, `small`, `medium`, `large`, `xlarge`, and `xxlarge`. Default is `small`.

All the values set via `deployment-size.tf` can be overridden by setting the appropriate input variables.

- `kubernetes_instance_types` - The instance type for the EKS nodes
- `kubernetes_min_nodes_per_az` - The minimum number of nodes in each AZ for the EKS cluster
- `kubernetes_max_nodes_per_az` - The maximum number of nodes in each AZ for the EKS cluster
- `elasticache_node_type` - The instance type for the redis cluster
- `database_instance_class` - The instance type for the database

## Examples

We have included documentation and reference examples for additional common
Expand Down Expand Up @@ -263,6 +279,28 @@ Upgrades must be executed in step-wise fashion from one version to the next. You

See our upgrade guide [here](./docs/operator-migration/readme.md)

### Upgrading from 4.x -> 5.x

5.0.0 introduced autoscaling to the EKS cluster and made the `size` variable the preferred way to set the cluster size.
Previously, unless the `size` variable was set explicitly, there were default values for the following variables:
- `kubernetes_instance_types`
- `kubernetes_node_count`
- `elasticache_node_type`
- `database_instance_class`

The `size` variable is now defaulted to `small`, and the following values to can be used to partially override the values
set by the `size` variable:
- `kubernetes_instance_types`
- `kubernetes_min_nodes_per_az`
- `kubernetes_max_nodes_per_az`
- `elasticache_node_type`
- `database_instance_class`

For more information on the available sizes, see the [Cluster Sizing](#cluster-sizing) section.

If having the cluster scale nodes in and out is not desired, the `kubernetes_min_nodes_per_az` and
`kubernetes_max_nodes_per_az` can be set to the same value to prevent the cluster from scaling.

### Upgrading from 3.x -> 4.x

- If egress access for retrieving the wandb/controller image is not available, Terraform apply may experience failures.
Expand Down
45 changes: 25 additions & 20 deletions deployment-size.tf
Original file line number Diff line number Diff line change
Expand Up @@ -6,34 +6,39 @@
locals {
deployment_size = {
small = {
db = "db.r6g.large",
node_count = 2,
node_instance = "r6i.xlarge"
cache = "cache.m6g.large"
db = "db.r6g.large",
min_nodes_per_az = 1,
max_nodes_per_az = 2,
node_instance = "r6i.xlarge"
cache = "cache.m6g.large"
},
medium = {
db = "db.r6g.xlarge",
node_count = 2,
node_instance = "r6i.xlarge"
cache = "cache.m6g.large"
db = "db.r6g.xlarge",
min_nodes_per_az = 1,
max_nodes_per_az = 2,
node_instance = "r6i.xlarge"
cache = "cache.m6g.large"
},
large = {
db = "db.r6g.2xlarge",
node_count = 2,
node_instance = "r6i.2xlarge"
cache = "cache.m6g.xlarge"
db = "db.r6g.2xlarge",
min_nodes_per_az = 1,
max_nodes_per_az = 2,
node_instance = "r6i.2xlarge"
cache = "cache.m6g.xlarge"
},
xlarge = {
db = "db.r6g.4xlarge",
node_count = 3,
node_instance = "r6i.2xlarge"
cache = "cache.m6g.xlarge"
db = "db.r6g.4xlarge",
min_nodes_per_az = 1,
max_nodes_per_az = 2,
node_instance = "r6i.2xlarge"
cache = "cache.m6g.xlarge"
},
xxlarge = {
db = "db.r6g.8xlarge",
node_count = 3,
node_instance = "r6i.4xlarge"
cache = "cache.m6g.2xlarge"
db = "db.r6g.8xlarge",
min_nodes_per_az = 1,
max_nodes_per_az = 3,
node_instance = "r6i.4xlarge"
cache = "cache.m6g.2xlarge"
}
}
}
22 changes: 14 additions & 8 deletions main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ locals {
use_external_bucket = var.bucket_name != ""
s3_kms_key_arn = local.use_external_bucket || var.bucket_kms_key_arn != "" ? var.bucket_kms_key_arn : local.default_kms_key
use_internal_queue = local.use_external_bucket || var.use_internal_queue
elasticache_node_type = coalesce(var.elasticache_node_type, local.deployment_size[var.size].cache)
database_instance_class = coalesce(var.database_instance_class, local.deployment_size[var.size].db)
kubernetes_instance_types = coalesce(var.kubernetes_instance_types, [local.deployment_size[var.size].node_instance])
kubernetes_min_nodes_per_az = coalesce(var.kubernetes_min_nodes_per_az, local.deployment_size[var.size].min_nodes_per_az)
kubernetes_max_nodes_per_az = coalesce(var.kubernetes_max_nodes_per_az, local.deployment_size[var.size].max_nodes_per_az)
}

module "file_storage" {
Expand Down Expand Up @@ -84,7 +89,7 @@ module "database" {
database_name = var.database_name
master_username = var.database_master_username

instance_class = try(local.deployment_size[var.size].db, var.database_instance_class)
instance_class = local.database_instance_class
engine_version = var.database_engine_version
snapshot_identifier = var.database_snapshot_identifier
sort_buffer_size = var.database_sort_buffer_size
Expand Down Expand Up @@ -136,11 +141,13 @@ module "app_eks" {
namespace = var.namespace
kms_key_arn = local.default_kms_key

instance_types = try([local.deployment_size[var.size].node_instance], var.kubernetes_instance_types)
desired_capacity = try(local.deployment_size[var.size].node_count, var.kubernetes_node_count)
map_accounts = var.kubernetes_map_accounts
map_roles = var.kubernetes_map_roles
map_users = var.kubernetes_map_users
instance_types = local.kubernetes_instance_types
min_nodes = local.kubernetes_min_nodes_per_az
max_nodes = local.kubernetes_max_nodes_per_az

map_accounts = var.kubernetes_map_accounts
map_roles = var.kubernetes_map_roles
map_users = var.kubernetes_map_users

bucket_kms_key_arns = compact([
local.default_kms_key,
Expand Down Expand Up @@ -240,7 +247,7 @@ module "redis" {
vpc_id = local.network_id
redis_subnet_group_name = local.network_elasticache_subnet_group_name
vpc_subnets_cidr_blocks = local.network_elasticache_subnet_cidrs
node_type = try(local.deployment_size[var.size].cache, var.elasticache_node_type)
node_type = local.elasticache_node_type
kms_key_arn = local.database_kms_key_arn
}

Expand Down Expand Up @@ -371,4 +378,3 @@ module "wandb" {
}
}
}

34 changes: 34 additions & 0 deletions modules/app_eks/cluster_autoscaler/ClusterAutoscaler.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeScalingActivities",
"ec2:DescribeImages",
"ec2:DescribeInstanceTypes",
"ec2:DescribeLaunchTemplateVersions",
"ec2:GetInstanceTypesFromInstanceRequirements",
"eks:DescribeNodegroup"
],
"Resource": ["*"]
},
{
"Effect": "Allow",
"Action": [
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup"
],
"Resource": ["*"],
"Condition": {
"StringEquals": {
"aws:ResourceTag/k8s.io/cluster-autoscaler/enabled": "true",
"aws:ResourceTag/k8s.io/cluster-autoscaler/${namespace}": "owned"
}
}
}
]
}
49 changes: 49 additions & 0 deletions modules/app_eks/cluster_autoscaler/cluster_autoscaler.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
data "aws_region" "current" {}

resource "helm_release" "cluster-autoscaler" {
chart = "cluster-autoscaler"
name = "cluster-autoscaler"
repository = "https://kubernetes.github.io/autoscaler"
namespace = "cluster-autoscaler"
create_namespace = true

set {
name = "fullnameOverride"
value = "cluster-autoscaler"
}

set {
name = "autoDiscovery.clusterName"
value = var.namespace
}

set {
name = "awsRegion"
value = data.aws_region.current.name
}

set {
name = "rbac.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn"
value = aws_iam_role.default.arn
}

set {
name = "extraArgs.balance-similar-node-groups"
value = "true"
}

set {
name = "extraArgs.balancing-ignore-label"
value = "eks.amazonaws.com/nodegroup"
}

set {
name = "extraArgs.balancing-ignore-label"
value = "eks.amazonaws.com/sourceLaunchTemplateId"
}

set {
name = "extraArgs.balancing-ignore-label"
value = "topology.ebs.csi.aws.com/zone"
}
}
32 changes: 32 additions & 0 deletions modules/app_eks/cluster_autoscaler/iam.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
data "aws_iam_policy_document" "default" {
statement {
actions = ["sts:AssumeRoleWithWebIdentity"]
effect = "Allow"

condition {
test = "StringLike"
variable = "${replace(var.oidc_provider.url, "https://", "")}:sub"
values = ["system:serviceaccount:cluster-autoscaler:*"]
}

principals {
identifiers = [var.oidc_provider.arn]
type = "Federated"
}
}
}

resource "aws_iam_role" "default" {
assume_role_policy = data.aws_iam_policy_document.default.json
name = "${var.namespace}-cluster-autoscaler"
}

resource "aws_iam_policy" "default" {
policy = templatefile("${path.module}/ClusterAutoscaler.json", { namespace = var.namespace })
name = "${var.namespace}-cluster-autoscaler"
}

resource "aws_iam_role_policy_attachment" "default" {
role = aws_iam_role.default.name
policy_arn = aws_iam_policy.default.arn
}
10 changes: 10 additions & 0 deletions modules/app_eks/cluster_autoscaler/variables.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
variable "namespace" {
type = string
}

variable "oidc_provider" {
type = object({
arn = string
url = string
})
}
52 changes: 36 additions & 16 deletions modules/app_eks/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,11 @@ locals {
create_launch_template = (local.encrypt_ebs_volume || local.system_reserved != "")
}

data "aws_subnet" "private" {
count = length(var.network_private_subnets)
id = var.network_private_subnets[count.index]
}


module "eks" {
source = "terraform-aws-modules/eks/aws"
Expand Down Expand Up @@ -41,25 +46,31 @@ module "eks" {
}
] : null

# node_security_group_enable_recommended_rules = false
worker_additional_security_group_ids = [aws_security_group.primary_workers.id]
node_groups_defaults = {
create_launch_template = local.create_launch_template,
disk_encrypted = local.encrypt_ebs_volume,
disk_kms_key_id = var.kms_key_arn,
disk_type = "gp3"
enable_monitoring = true
force_update_version = local.encrypt_ebs_volume,
iam_role_arn = aws_iam_role.node.arn,
instance_types = var.instance_types,
kubelet_extra_args = local.system_reserved != "" ? "--system-reserved=${local.system_reserved}" : "",
metadata_http_put_response_hop_limit = 2
metadata_http_tokens = "required",
version = var.cluster_version,
}

node_groups = {
primary = {
create_launch_template = local.create_launch_template,
desired_capacity = var.desired_capacity,
disk_encrypted = local.encrypt_ebs_volume,
disk_kms_key_id = var.kms_key_arn,
disk_type = "gp3"
enable_monitoring = true
force_update_version = local.encrypt_ebs_volume,
iam_role_arn = aws_iam_role.node.arn,
instance_types = var.instance_types,
kubelet_extra_args = local.system_reserved != "" ? "--system-reserved=${local.system_reserved}" : "",
max_capacity = 5,
metadata_http_put_response_hop_limit = 2
metadata_http_tokens = "required",
min_capacity = var.desired_capacity,
version = var.cluster_version,
for subnet in data.aws_subnet.private : regex(".*[[:digit:]]([[:alpha:]])", subnet.availability_zone)[0] => {
subnets = [subnet.id]
scaling_config = {
desired_size = var.min_nodes
max_size = var.max_nodes
min_size = var.min_nodes
}
}
}

Expand Down Expand Up @@ -169,3 +180,12 @@ module "external_dns" {

depends_on = [module.eks]
}

module "cluster_autoscaler" {
source = "./cluster_autoscaler"

namespace = var.namespace
oidc_provider = aws_iam_openid_connect_provider.eks

depends_on = [module.eks]
}
2 changes: 1 addition & 1 deletion modules/app_eks/outputs.tf
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
output "autoscaling_group_names" {
value = { for name, value in module.eks.node_groups : name => lookup(lookup(lookup(value, "resources")[0], "autoscaling_groups")[0], "name") }
}
output "cluster_id" {
output "cluster_name" {
value = module.eks.cluster_id
description = "ID of the created EKS cluster"
}
Expand Down
8 changes: 7 additions & 1 deletion modules/app_eks/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,13 @@ variable "service_port" {
default = 32543
}

variable "desired_capacity" {
variable "min_nodes" {
description = "Desired number of worker nodes."
type = number
default = 2
}

variable "max_nodes" {
description = "Desired number of worker nodes."
type = number
default = 2
Expand Down
Loading

0 comments on commit 8d48434

Please sign in to comment.