Horizontal Autoscaling with Nomad APM and AWS ASG plugins

Ferhat Vurucu
5 min readMar 12, 2021

The Nomad Autoscaler is an autoscaling daemon for Nomad, architectured around plugins to allow for easy extensibility in terms of supported metrics sources, scaling targets and scaling algorithms.

Building

The Nomad Autoscaler can be easily run as a Nomad job with the APM, Target and Strategy plugins.

job "autoscaler" {
datacenters = ["eu-central-1a"]
group "autoscaler" {
count = 1
network {
port "http" {
to = 8080
}
}
task "autoscaler" {
driver = "docker"
config {
image = "hashicorp/nomad-autoscaler:0.3.0"
command = "nomad-autoscaler"
args = [
"agent",
"-config",
"${NOMAD_TASK_DIR}/config.hcl",
"-http-bind-address",
"0.0.0.0",
"-policy-dir",
"${NOMAD_TASK_DIR}/policies/"
]
ports = ["http"]
}
service {
name = "autoscaler"
port = "http"
}
}
}
}

Templating

APM, Target and Strategy plugins must be added as a template to the Nomad job file.

template {
data = <<EOF
nomad {
address = “https://{{env “attr.unique.network.ip-address” }}:4646"
region = “eu-central-1”
skip_verify = true
}
apm “nomad-apm” {
driver = “nomad-apm”
}
target “aws-asg” {
driver = “aws-asg”
config = {
aws_region = “{{ $x := env “attr.platform.aws.placement.availability-zone” }}{{ $length := len $x |subtract 1 }}{{ slice $x 0 $length}}”
}
}
strategy “target-value” {
driver = “target-value”
}
EOF
destination = “${NOMAD_TASK_DIR}/config.hcl”
}

Horizontal Cluster Autoscaling

This is achieved by interacting with the AWS ASG target plugin to start or terminate new Nomad clients based on metrics such as the remaining free schedulable CPU or memory.

Nomad APM Plugin

The Nomad APM allows querying Nomad to understand the currently allocated resource as a percentage of the total available. Querying Nomad client node metrics is be done using the percentage-allocated, CPU and memory.

Target Value Strategy Plugin

The target value strategy plugin will perform count calculations in order to keep the value resulting from the APM query at or around a specified target.

AWS AutoScaling Group Plugin

The AWS ASG target plugin allows for the scaling of the Nomad cluster clients via manipulating AWS AutoScaling Groups.

Horizontal Cluster Scaling policy is added as a template to the Nomad job file.

template {
data = <<EOF
scaling “cluster_policy” {
enabled = true
min = 3
max = 10
policy {
cooldown = “5m”
evaluation_interval = “3m”
check “cpu_allocated_percentage” {
source = “nomad-apm”
query = “percentage-allocated_cpu”
query_window = “5m”
strategy “target-value” {
target = 70
}
}
check “memory_allocated_percentage” {
source = “nomad-apm”
query = “percentage-allocated_memory”
query_window = “5m”
strategy “target-value” {
target = 70
}
}
target “aws-asg” {
dry-run = “false”
aws_asg_name = “nomad-client”
node_class = “autoscaler”
node_drain_deadline = “5m”
node_purge = “true”
}
}
}
EOF
destination = “${NOMAD_TASK_DIR}/policies/hashistack.hcl”
}

Telemetry

The telemetry stanza configures Nomad’s publication of metrics and telemetry to third-party systems. Nomad should publish runtime metrics of nodes and allocations. This must be added to client configs.

telemetry {
publish_allocation_metrics = true
publish_node_metrics = true
}

Node Class

It’s an arbitrary string used to logically group client nodes by user-defined class. This must be added to client configs.

client {
enabled = true
node_class = “autoscaler”
}

Nomad UI

Nomad UI provides Cluster Topology to analyze CPU and memory reservations in the cluster.

Nomad UI

AWS ASG plugin will update the capacity based on metrics such as the remaining free schedulable CPU or memory. This will trigger Desired Capacity when the value is higher than 70 in our configuration.

AWS ASG

AWS ASG Activity History is used for following the latest status, description and cause of scaling.

Activity History

While scaling down, Nomad Autoscaler is detaching instances from AWS ASG and drain node first to make sure applications are migrated to other clients properly. Then instance will be terminated.

Activity History

Horizontal Application Autoscaling

This is achieved by modifying the number of allocations in a task group based on the value of a relevant metric, such as CPU and memory utilization or the number of open connections. It can be enabled by configuring autoscaling policies on individual Nomad jobs using the scaling block.

Nomad APM Plugin

The Nomad APM allows querying Nomad to understand the current resource usage of a task group. Querying Nomad task group metrics is be done using the avg, min, max and sum syntax. The metric value can be cpu and memory.

scaling {
enabled = true
min = 1
max = 2
policy {
cooldown = “3m”
evaluation_interval = “1m”
check “avg_cpu” {
source = “nomad-apm”
query = “avg_cpu-allocated”
query_window = “3m”
strategy “target-value” {
target = 70
}
}
check “avg_memory” {
source = “nomad-apm”
query = “avg_memory-allocated”
query_window = “3m”
strategy “target-value” {
target = 70
}
}
}
}

Nomad UI

Nomad UI provides Scaling Timeline and Recent Scaling Events to analyze for Job workloads.

Scaling Timeline

Testing

Stress tool can be used to impose load and stress on tasks.

nomad alloc exec <ALLOCID> stress --vm 1 --vm-bytes 250M
Resource Utilization

When the average cpu or memory target value is higher than 70, Nomad APM will trigger a new replica of a task. It will re-evaluate once you stop your stress test and scale down based on evaluation interval and query window.

Scaling Events

Check Calculations

The checks are executed at the same time during a policy evaluation and the results can conflict with each other. In a scenario like this, the autoscaler iterates the results the chooses the safest result which results in retaining the most capacity of the resource.

  • ScaleOut and ScaleIn => ScaleOut
  • ScaleOut and ScaleNone => ScaleOut
  • ScaleIn and ScaleNone => ScaleNone

References

https://www.nomadproject.io/docs/autoscaling

https://github.com/hashicorp/nomad-autoscaler

--

--