1 of 21

Optimization Guides

What do you want to do with Akamas?

Optimize application costs and resource efficiency

Kubernetes microservices

Offline optimizations

Live optimizations

Optimize cost of a Kubernetes deployment subject to Horizontal Pod Autoscaler

In this guide, you optimize the cost (or resource footprint) of a Kubernetes deployment where the number of replicas is controlled by the HPA. The study tunes both pod resource settings (CPU and memory requests and limits) and HPA options (target CPU utilization) at the same time, while also taking into account your application performance and reliability requirements (SLOs). This optimization happens in production, leveraging Akamas live optimization capabilities.

Prerequisites

an Akamas instance
a Kubernetes cluster, with a deployment to be optimized
a Horizontal Pod Autoscaler working on the desired deployment
a supported telemetry data source configured to collect metrics from the target Kubernetes cluster (see here for the full list)
a way to apply configuration changes recommended by Akamas to the target deployment and HPA. In this guide, Akamas interacts directly with the Kubernetes APIs via kubectl.You need a service account with permissions to update your deployment (see below for other integration options).

Optimization setup

In this guide, we assume the following setup:

the Kubernetes deployment to be optimized is called frontend (in the hipster-shop namespace)
in the deployment, there is a container named server, where the app runs
the HPA is called frontend-hpa
both Dynatrace and Prometheus are used as observability tools

Let's set up the Akamas optimization for this use case.

System

For this optimization, you need the following components to model the frontend tech stack:

The Kubernetes Workload, Container and Pod components, containing metrics like CPU used for the different objects and parameters to be tuned like CPU limits at the container levels (from the Kubernetes optimization pack)
An HPA component, which contains HPA parameters like the target CPU utilization
A Web Application component, which contains service-level metrics like throughput and response time of the microservice (from the Web Applicationoptimization pack)

Let's start by creating the system, which represents the Kubernetes deployment to be optimized. To create it, write a system.yaml manifest like this:

name: frontend
description: The frontend Kubernetes deployment

Then run:

akamas create system system.yaml

Now create the three Kubernetes components. Create a workload.yaml manifest like the following:

name: workload_frontend
description: The frontend Kubernetes workload
componentType: Kubernetes Workload
properties:
  prometheus:
    namespace: hipster-shop
    deployment: frontend

Then create a container.yaml manifest like the following:

name: server
description: The server Kubernetes container
componentType: Kubernetes Container
properties:
  prometheus:
    namespace: hipster-shop
    pod: frontend.*
    container: server

And a pod.yaml manifest like the following:

name: pod_frontend
description: The frontend Kubernetes pod
componentType: Kubernetes Pod
properties:
  prometheus:
    namespace: hipster-shop
    pod: frontend.*

Now create the entities by running:

CREATE BATCH

akamas create component workload.yaml frontend-2
akamas create component container.yaml frontend-2
akamas create component pod.yaml frontend-2

Now create anapplication.yaml manifest like the following:

name: webapp
description: The web application of frontend deployment
componentType: Web Application
properties:
  dynatrace:
    id: SERVICE-80258F7AA97F2E4D
  prometheus:
    namespace: hipster-shop-2
    pod: frontend.*
    container: server

Notice the component includes properties that specify how Dynatrace telemetry will look up this container in the Kubernetes cluster.

These properties are dependent upon the telemetry provider you are using. See the reference for the full list of supported providers and relative configurations.

The run:

akamas create component application.yaml frontend-2

Finally, create anhpa.yaml manifest like the following:

name: frontend_hpa
description: The HPA for the frontend
componentType: HPA

The HPA component does not provide any metric, so we do not need to specify anything about the workload.

NOTA PER STEFANO DONI: STO IGNORANDO IL FATTO CHE VADA CREATO IL COMPONENT TYPE ED I PARAMETRI

Then run:

akamas create component hpa.yaml frontend-2

Workflow

To optimize a Kubernetes microservice in production, you need to create a workflow that defines how the new configuration recommended by Akamas will be deployed in production.

Let's explore the high-level tasks required in this scenario and the options you have to adapt it to your environment:

1) Update the Kubernetes deployment and HPA configurations

The first step is to update the Kubernetes deployment and HPA with the new configuration. This can be done in several ways depending on your environment and processes:

A simple option is to let Akamas directly update the Kubernetes entities leveraging the Kubernetes APIs via kubectl commands.
Another option is to follow an Infrastructure-as-code approach, where the configuration change is managed via pull requests to a Git repository, leveraging your pipelines to deploy the change in production.

In this guide, we take the first option and use the kubectl patch and kubectl apply commands to configure the new deployment and the HPA.

These commands are executed from the toolbox, an Akamas utility that can be enabled in an Akamas installation on Kubernetes. Make sure that kubectl is configured correctly to connect to your Kubernetes cluster and can update your target deployment. See here for more details.

2) Wait for the new deployment to be rolled out in production

In a live optimization, Akamas needs to understand when the new deployment rollout is complete and whether it was completed successfully or not. This is key information for Akamas AI to observe and optimize your applications safely.

This task can be done in several ways depending on how you manage changes, as discussed in the previous task:

A simple option is to use thekubectl rollout command to wait for the deployment rollout completion. This is the approach used in this guide.
Another option is to follow an Infrastructure-as-code approach, where a change is managed via pull requests to a Git repository, leveraging your pipelines to deploy in production. In this situation, the deployment process is executed externally and is not controlled by Akamas. Hence, the workflow task will periodically poll the Kubernetes deployment to recognize when the new deployment has landed in production.

See here for an example of an Infrastructure-as-code automation approach. TODO LINK

3) Wait for the appropriate time to start the experiment

When dealing with the HPA, it is important that Akamas always observes the same timeframe.

If the configuration change requires too much time (e.g., because it requires a manual step), the akamas experiments will see a different workload pattern (e.g., we could observe the night instead of the day). This would make the analysis quite complex, especially for humans.

Albeit Akamas handles different workload patterns, it's always better to run each experiment on the same time slot, so that each configuration is evaluated against a similar workload pattern.

In this example we assume that we want to evaluate a new configuration every hour, hence we will insert a workload step that waits for the end of the current hour.

Typically, this depends on the configuration process of your application.

4) Observe how the application behaves with the new configuration

In a live optimization, Akamas simply needs to wait for a given observation interval, while the application works in production with the new configuration. Telemetry metrics will be collected during this observation period and will be analyzed by Akamas AI to recommend the next configuration.

Since we decided to evaluate a configuration every hour, we use a 55 minute observation interval, leaving 5 minutes for the configuration process.

Let's now create a workflow.yaml manifest like the following:

name: frontend-11-delayedApproval-hpa-1hour-system2
tasks:
  - name: configure frontend
    operator: FileConfigurator
    arguments:
      source:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
        path: /work/examples/hipstershop-hpa/hipstershop-2/ak-frontend.sh.templ
      target:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
        path: /work/ak-frontend-2.sh

  - name: apply frontend
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
      command: sh /work/ak-frontend-2.sh hipster-shop-2 frontend

  - name: verify frontend
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
      command: kubectl rollout status --timeout=5m deployment/frontend -n hipster-shop-2;

  - name: configure hpa
    operator: FileConfigurator
    arguments:
      source:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
        path: /work/examples/hipstershop-hpa/hipstershop-2/frontend-hpa-v2.yaml.templ
      target:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
        path: /work/frontend-hpa-v2-2.yaml

  - name: apply hpa
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
      command: kubectl apply -f /work/frontend-hpa-v2-2.yaml -n hipster-shop-2

  - name: check if we are in time or wait for start of next hour
    operator: Executor
    arguments:
      host:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
      command: if [ $(date +%M) -lt 55 ]; then sleep $((60*(60 - $(date +%M)))); else sleep 0; fi

  - name: observe 55 minutes
    operator: Sleep
    arguments:
      seconds: 3300

Then run:

akamas create workflow workflow.yaml

Telemetry

To collect metrics of your target Kubernetes deployment, you create a telemetry instance based on your observability setup.

Create a dynatrace.yamlmanifest like the following:

provider: Dynatrace
config:
  url: <YOUR_DYNATRACE_URL>
  token: <YOUR_DYNATRACE_TOKEN>
  pushEvents: false

Then run:

akamas create telemetry-instance dynatrace.yaml frontend-2

Create a prometheus.yamlmanifest like the following:

provider: Prometheus
config:
  address: prom-kube-prometheus-stack-prometheus.monitoring
  port: 9090
  duration: 60
  logLevel: DETAILED
metrics:
  - metric: cost
    datasourceMetric: 'sum(kube_pod_container_resource_requests{resource="cpu" %FILTERS%})*29 + sum(kube_pod_container_resource_requests{resource="memory" %FILTERS%})/1024/1024/1024*3.2'

Then run:

akamas create telemetry-instance prometheus.yaml frontend-2

Study

It's now time to create the Akamas study to achieve your optimization objectives.

Let's explore how the study is designed by going through the main concepts. The complete study manifest is available at the bottom.

Goal

Your overall objective is to reduce the cost (or resource footprint) of a Kubernetes deployment. To do that, you need to define the goal, which is a metric (or combination of metrics) representing the deployment cost to be minimized.

There are different approaches to measuring the cost of Kubernetes deployments:

A simple approach is to consider that Kubernetes allocates infrastructure resources based on pod resource requests (CPU and memory). Hence, the cost of a deployment can be derived from the deployment aggregate CPU and memory requests. In this guide, we use this approach and define the study goal as the sum of CPU and memory requests of the container to be optimized.
Alternatively, the cost of a Kubernetes deployment can also be collected from external data sources that provide actual cost metrics like OpenCost. In this case, the study goal can be defined by leveraging the cost metric. See here for more information on how to integrate cost metrics.

Notice that weighting factors can be used in the goal formula to specify the importance of CPU vs memory resources. For example, the cloud price of 1 CPU is about 9 times that of 1 GB of RAM. You can customize those weights based on your requirements so that Akamas knows how to truly reach the most cost-efficient configuration in your specific context.

TODO: IN QUESTI STUDY VIENE DEFINITA DIRETTAMENTE NELLA TELEMETRY, NON SO SE EFFETTIVAMENTE SIA RICHIESTO

Constraints

When optimizing for cost reduction (or resource footprint), it's key not to impact application response time or introduce risks of availability and reliability issues. To ensure this, you can define your performance and reliability requirements (SLOs) as metric constraints.

In this study:

to ensure application performance, constraints are specified on application response times and error rate
to ensure application reliability, constraints are specified on container peak CPU and memory utilization, and container out-of-memory kills

Parameters

To achieve cost-efficient and reliable microservices, Kubernetes container resources and HPA scaling options must be configured optimally and tuned jointly, as they are heavily interconnected.

To do that, the study includes the following parameters:

Kubernetes container: CPU and memory requests and limits
HPA target CPU utilization

The study also includes parameter constraints to ensure that recommended configurations are safe and comply with best practices. In particular:

CPU limits must be at most 2x CPU requests, to avoid excessive over-commitment of CPU limits in the cluster.

Notice that the parameters and constraints can change depending on your policies. For example, it is a best practice to set memory requests == limits to avoid pod eviction, hence we are only tuning the memory limit in the study and set the request to the same value in the deployment file.

Workload

Akamas live optimization considers the application's workload to recommend new configurations that are optimal for the goal (e.g. reduce cost) while meeting all metric constraints (e.g., latency and error rates).

For Kubernetes microservices, the workload is typically the throughput (requests/sec) of the microservice API endpoints. This is the approach used in this guide.

Approval mode

In this live optimization, the manual approval is set to false, meaning that as soon as a new configuration gets generated, the workflow will be executed without any human involvement.

You can set it to true so that Akamas will ask for user approval when a new configuration gets generated. Once you approve it, the workflow will be executed, and the new configuration will be deployed to production according to the integration strategy you have defined above.

You can now create a study.yaml manifest like the following:

name: ak-frontend - live - system 2
system: frontend-2
workflow: frontend-11-delayedApproval-hpa-1hour-system2

goal:
  name: Cost
  objective: minimize
  function:
    formula: web_application.cost
  constraints:
    absolute:
      - name: Application response time degradation
        formula: web_application.requests_response_time_p50:p90 <= 60
      - name: Application error rate degradation
        formula: web_application.requests_error_rate:p90 <= 0.02
      - name: Container CPU saturation
        formula: server.container_cpu_util_max:p90 < 0.8
      - name: Container memory saturation
        formula: server.container_memory_used:max / server.container_memory_limit < 0.7

windowing:
  type: trim
  trim: [1m,  1m]
  task: observe 55 minutes

parametersSelection:
  - name: server.cpu_request
    domain: [10, 500]
  - name: server.cpu_limit
    domain: [10, 500]
  - name: server.memory_limit
    domain: [16, 640]
  - name: frontend_hpa.metrics_resource_target_averageUtilization
    domain: [10, 90]

parameterConstraints:
  - name: CPU request less or equal to limits
    formula: server.cpu_request <= server.cpu_limit
  - name: CPU limit within a given factor of request
    formula: server.cpu_limit <= server.cpu_request * 2

workloadsSelection:
  - name: web_application.requests_throughput:max
  - name: web_application.requests_throughput

numberOfTrials: 1
steps:
  - name: baseline
    type: baseline
    numberOfTrials: 3
    values:
      server.cpu_request: 200
      server.cpu_limit: 400
      server.memory_limit: 128
      frontend_hpa.metrics_resource_target_averageUtilization: 60
    renderParameters: [frontend_hpa.metrics_resource_target_averageUtilization]

  - name: optimize
    type: optimize
    numberOfExperiments: 300

Then run:

akamas create study study.yaml

You can now follow the live optimization progress and explore the results using the Akamas UI.

Optimize cost of a Kubernetes microservice while preserving SLOs in production

In this example, you will use Akamas live optimization to minimize the cost of a Kubernetes deployment, while preserving application performance and reliability requirements.

Prerequisites

In this example, you need:

an Akamas instance
a Kubernetes cluster, with a deployment to be optimized
the kubectl command installed in the Akamas instance, configured to access the target Kubernetes and with privileges to get and update the deployment configurations
a supported telemetry data source (e.g. Prometheus or Dynatrace) configured to collect metrics from the target Kubernetes cluster

Optimization setup

Optimization packs

This example leverages the following optimization packs:

System

The system represents the Kubernetes deployment to be optimized (let's call it "frontend"). You can create a system.yaml manifest like this:

name: frontend
description: Kubernetes frontend deployment

Create the new system resource:

akamas create system system.yaml

The system will then have two components:

A Kubernetes container component, which contains container-level metrics like CPU usage and parameters to be tuned like CPU limits
A Web Application component, which contains service-level metrics like throughput and response time

In this example, we assume the deployment to be optimized is called frontend, with a container named server, and is located within the boutique namespace. We also assume that Dynatrace is used as a telemetry provider.

Kubernetes component

Create a component-container.yaml manifest like the following:

name: container
description: Kubernetes container, part of the frontend deployment
componentType: Kubernetes Container
properties:
  dynatrace:
    type: CONTAINER_GROUP_INSTANCE
    kubernetes:
      namespace: boutique
      containerName: server
      basePodName: frontend-*

Then run:

akamas create component component-container.yaml frontend

Now create a component-webapp.yaml manifest like the following:

name: webapp
description: The service related to the frontend deployment
componentType: Web Application
properties:
  dynatrace:
    id: <TELEMETRY_DYNATRACE_WEBAPP_ID>

Then run:

akamas create component component-webapp.yaml frontend

Workflow

The workflow in this example is composed of three main steps:

Update the Kubernetes deployment manifest with the parameters (CPU and memory limits) recommended by Akamas
Apply the new parameters (kubectl apply)
Wait for the rollout to complete
Sleep for 30 minutes (observation interval)

Create a workflow.yaml manifest like the following:

name: frontend
tasks:
  - name: configure
    operator: FileConfigurator
    arguments:
      source:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
        path: frontend.yaml.templ
      target:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
        path: frontend.yaml

  - name: apply
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
      command: kubectl apply -f frontend.yaml

  - name: verify
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
      command: kubectl rollout status --timeout=5m deployment/frontend -n boutique;

  - name: observe
    operator: Sleep
    arguments:
      seconds: 1800

Then run:

akamas create workflow workflow.yaml

Telemetry

Create the telemetry.yamlmanifest like the following:

provider: Dynatrace
config:
  url: <YOUR_DYNATRACE_URL>
  token: <YOUR_DYNATRACE_TOKEN>
  pushEvents: false

Then run:

akamas create telemetry-instance telemetry.yaml frontend

Study

In this live optimization:

the goal is to reduce the cost of the Kubernetes deployment. In this example, the cost is based on the amount of CPU and memory limits (assuming requests = limits).
the approval mode is set to manual, a new recommendation is generated daily
to avoid impacting application performance, constraints are specified on desired response times and error rates
to avoid impacting application reliability, constraints are specified on peak resource usage and out-of-memory kills
the parameters to be tuned are the container CPU and memory limits (we assume requests=limits in the deployment file)

Create a study.yaml manifest like the following:

name: frontend
system: frontend
workflow: frontend
requireApproval: true

goal:
  objective: minimize
  function:
    formula: (((container.container_cpu_limit/1000) * 3) + (container.container_memory_limit/(1024*1024*1024)))
  constraints:
    absolute:
      - name: Response Time
        formula: webapp.requests_response_time <= 300
      - name: Error Rate
        formula: webapp.service_error_rate:max <= 0.05
      - name: Container CPU saturation
        formula: container.container_cpu_util:p95 < 0.8
      - name: Container memory saturation
        formula: container.container_memory_util:max < 0.7
      - name: Container out-of-memory kills
        formula: container.container_oom_kills_count == 0

parametersSelection:
  - name: container.cpu_limit
    domain: [300, 1000]
  - name: container.memory_limit
    domain: [800, 1536]

windowing:
  type: trim
  trim: [5m, 0m]
  task: observe

workloadsSelection:
  - name: webapp.requests_throughput

steps:
  - name: baseline
    type: baseline
    numberOfTrials: 48
    values:
      container.cpu_limit: 1000
      container.memory_limit: 1536

  - name: optimize
    type: optimize
    numberOfTrials: 48
    numberOfExperiments: 100
    numberOfInitExperiments: 0
    maxFailedExperiments: 50

Then run:

akamas create study study.yaml

You can now follow the live optimization progress and explore the results using the Akamas UI for Live optimizations.

Optimize cost of a Java microservice on Kubernetes while preserving SLOs in production

In this guide, you optimize the cost (or resource footprint) of a Java microservice running on Kubernetes. The study tunes both pod resource settings (CPU and memory requests and limits) and JVM options (max heap size, garbage collection algorithm, etc.) at the same time, while also taking into account your application performance and reliability requirements (SLOs). This optimization happens in production, leveraging Akamas live optimization capabilities.

Prerequisites

an Akamas instance
a Kubernetes cluster, with a Java-based deployment to be optimized
a supported telemetry data source configured to collect metrics from the target Kubernetes cluster (see for the full list)
a way to apply configuration changes recommended by Akamas to the target deployment. In this guide, Akamas interacts directly with the Kubernetes APIs via kubectl.You need a service account with permission to update your deployment (see below for other integration options)

Optimization setup

In this guide, we assume the following setup:

the Kubernetes deployment to be optimized is called adservice (in the boutique namespace)
in the deployment, there is a container named server, where the application JVM runs
Dynatrace is used as an observability tool

Let's set up the Akamas optimization for this use case.

System

For this optimization, you need the following components to model the adservice tech stack:

A Kubernetes container component, which contains container-level metrics like CPU usage and parameters to be tuned like CPU limits (from the optimization pack)
A Java OpenJDK component, which contains JVM-level metrics like heap memory usage and parameters to be tuned like the garbage collector algorithm (from the optimization pack)
A Web Application component, which contains service-level metrics like throughput and response time of the microservice (from the optimization pack)

Let's start by creating the system, that represents the Kubernetes deployment to be optimized. To create it, write a system.yaml manifest like this:

Then run:

Now create a component-container.yaml manifest like the following:

Notice the component includes properties that specify how Dynatrace telemetry will look up this container in the Kubernetes cluster (the same will happen for the following components).

These properties are dependent upon the telemetry provider you are using.

Then run:

Next, create a component-jvm.yaml manifest like the following:

Then run:

Now create a component-webapp.yaml manifest like the following:

Then run:

Workflow

To optimize a Kubernetes microservice in production, you need to create a workflow that defines how to deploy in production the new configuration recommended by Akamas.

Let's explore the high-level tasks required in this scenario and the options you have to adapt it to your environment:

1) Update the Kubernetes deployment configuration

The first step is to update the Kubernetes deployment with the new configuration. This can be done in several ways depending on your environment and processes:

A simple option is to let Akamas directly update the deployment leveraging the Kubernetes APIs via kubectl commands
Another option is to follow an Infrastructure-as-code approach, where the configuration change is managed via pull requests to a Git repository, leveraging your pipelines to deploy the change in production

2) Wait for the new deployment to be rolled out in production

This task can be done in several ways depending on how you manage changes, as discussed in the previous task:

A simple option is to use thekubectl rollout command to wait for the deployment rollout completion. This is the approach used in this guide
Another option is to follow an Infrastructure-as-code approach, where a change is managed via pull requests to a Git repository, leveraging your pipelines to deploy in production. In this situation, the deployment process is executed externally and is not controlled by Akamas. Hence, the workflow task will periodically poll the Kubernetes deployment to recognize when the new deployment has landed in production

3) Observe how the application behaves with the new configuration

A 30-minute observation interval is recommended for most situations.

Let's now create a workflow.yaml manifest like the following:

In the configure task, Akamas will apply the container CPU/memory limits and JVM options recommended by Akamas AI to the deployment file. To do that, copy your deployment manifest to a template file (here called adservice.yaml.templ), and substitute the current values with Akamas parameter placeholders as follows:

Whenever Akamas recommended configuration is applied, the configure task will create the actual adservice.yaml deployment file with the parameter placeholders substituted with values recommended by Akamas AI, and then the new deployment will be applied via kubectl apply.

To create the workflow, run:

Telemetry

Create a telemetry instance based on your observability setup to collect your target Kubernetes deployment metrics.

Create a telemetry.yamlmanifest like the following:

Then run:

Study

It's time to create the Akamas study to achieve your optimization objectives.

Let's explore how the study is designed by going through the main concepts. The complete study manifest is available at the bottom.

Goal

There are different approaches to measuring the cost of Kubernetes deployments:

A simple approach is to consider that Kubernetes allocates infrastructure resources based on pod resource requests (CPU and memory). Hence, the cost of a deployment can be derived from the deployment aggregate CPU and memory requests. In this guide, we use this approach and define the study goal as the sum of CPU and memory requests of the container to be optimized
Alternatively, the cost of a Kubernetes deployment can also be collected from external data sources that provide actual cost metrics like OpenCost. In this case, the study goal can be defined by leveraging the cost metric

Constraints

In this study:

to ensure application performance, constraints are specified on application response times and error rate
to ensure application reliability, constraints are specified on:
- container peak CPU and memory utilization, and container out-of-memory kills
- JVM garbage collection time %, to prevent out-of-memory in the JVM heap memory

Parameters

To achieve cost-efficient and reliable Java-based microservices, Kubernetes container resources and JVM runtime options must be configured optimally and tuned jointly, as they are heavily interconnected.

To do that, the study includes the following parameters:

Kubernetes container: CPU and memory requests and limits
JVM: heap size and garbage collection (GC) algorithms

The study also includes parameter constraints to ensure that recommended configurations are safe and comply with best practices. In particular:

Kubernetes container memory limit must be higher than JVM heap size, plus a buffer to account for JVM off-heap memory usage
CPU limits must be at most 2x CPU requests, to avoid excessive over-commitment of CPU limits in the cluster

Notice that the parameters and constraints can change depending on your policies. For example, it is a best practice to set memory requests == limits to avoid pod eviction. In this case, you only include memory requests in the study and set limits to the same value in the deployment file.

Workload

For Kubernetes microservices, the workload is typically the throughput (requests/sec) of the microservice API endpoints. This is the approach used in this guide.

Approval mode and recommendation frequency

In this live optimization, the manual approval is set to required, meaning that Akamas will ask for user approval when a new configuration gets generated. Once you approve it, the workflow will be executed, and the new configuration will be deployed to production according to the integration strategy you have defined above.

You can set it to false to enable fully autonomous optimization: in this case, as soon as a new configuration gets generated, the workflow will be executed without any human involvement.

The recommendation frequency can be chosen by leveraging the numberOfTrials parameter. As the workflow duration is set to 30 minutes, in order to have a new configuration generated daily, set the number of trials to 48.

You can now create a study.yaml manifest like the following:

Then run:

You can now follow the live optimization progress and explore the results using the Akamas UI.

Artifact templates

To quickly set up this optimization, download the Akamas template manifests and update the values file to match your needs. Then, create your optimization using the Akamas scaffolding.

Application runtime

Offline optimizations

Optimizing a sample Java OpenJDK application

In this example study we’ll tune the parameters of PageRank, one of the benchmarks available in the Renaissance suite, with the goal of minimizing its memory usage. Application monitoring is provided by Prometheus, leveraging a JMX exporter.

Environment setup

The test environment includes the following instances:

Akamas: instance running Akamas
PageRank: instance running the PageRank benchmark and the Prometheus monitoring service

Telemetry Infrastructure setup

To gather metrics about PageRank we will use a Prometheus and a JMX exporter. Here’s the scraper to add to the Prometheus configuration to extract the metrics from the exporter:

- job_name: jmx-exporter
  static_configs:
    - targets: ['pagerank.akamas.io:5556']
      labels:
      instance: jvm

Application and Test tool

To run and monitor the benchmark we’ll require on the PageRank instance:

The Renaissance jar
The JMX exporter agent, plus a configuration file to expose the required classes

Here’s the snippet of code to configure the instance as required for this guide:

mkdir renaissance; cd renaissance
wget -O renaissance.jar https://github.com/renaissance-benchmarks/renaissance/releases/download/v0.10.0/renaissance-gpl-0.10.0.jar
wget -O jmx_exporter.jar https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.14.0/jmx_prometheus_javaagent-0.14.0.jar
echo -e '--\nwhitelistObjectNames: ["java.lang:*"]' > conf.yaml

Optimization setup

In this section, we will guide you through the steps required to set up the optimization on Akamas.

If you have not installed the Java OpenJDK optimization pack yet, take a look at the optimization pack page Java OpenJDK to proceed with the installation.

System

System pagerank

Here’s the definition of the system we will use to group our components and telemetry instances for this example:

name: pagerank
description: A system to tune the pagerank benchmark

To create the system run the following command:

akamas create system pagerank.yaml

Component jvm

We’ll use a component of type Java OpenJDK 11 to represent the JVM underlying the PageRank benchmark. To identify the JMX-related metrics in Prometheus the configuration requires the prometheus property for the telemetry service, detailed later in this guide.

Here’s the definition of the component:

name: jvm
componentType: openjdk-11
properties:
  prometheus:
    instance: jvm
    job: jmx-exporter

To create the component in the system run the following command:

akamas create component jvm.yaml pagerank

Workflow

The workflow used for this study consists of two main stages:

generate the configuration file containing the tested Java parameters
run the execution using previously written parameters

Here’s the definition of the workflow:

name: run-pagerank
tasks:
  - name: Configure parameters
    operator: FileConfigurator
    arguments:
      source:
        hostname: pagerank.akamas.io
        username: ubuntu
        path: /home/ubuntu/renaissance/java_opts.template
        key: key
      target:
        hostname: pagerank.akamas.io
        username: ubuntu
        path: /home/ubuntu/renaissance/java_opts
        key: key

  - name: Run benchmark
    operator: Executor
    arguments:
      command: "cd renaissance; java -javaagent:./jmx_exporter.jar=5556:conf.yaml $(cat java_opts) -jar renaissance.jar -r 2 page-rank"
      host:
        hostname: pagerank.akamas.io
        username: ubuntu
        key: key

Where the configuration template is java_opts.template is defined as follows:

 ${jvm.jvm_gcType} ${jvm.jvm_maxHeapSize} ${jvm.jvm_newSize} ${jvm.jvm_survivorRatio} ${jvm.jvm_maxTenuringThreshold}

To create the workflow run the following command:

akamas create workflow workflow.yaml

Telemetry

The following is the definition of the telemetry instance that fetches metrics from the Prometheus service:

provider: Prometheus
config:
  address: pagerank.akamas.io
  port: 9090

To create the telemetry instance in the system run the following command:

akamas create telemetry-instance prometheus.yaml pagerank

This telemetry instance will be able to bind the fetched metrics to the related jvm component thanks to the prometheus attribute we previously added in its definition.

Study

The goal of this study is to find a JVM configuration that minimizes the peak memory used by the benchmark.

The optimized parameters are the maximum heap size, the garbage collector used and several other parameters managing the new and old heap areas. We also specify a constraint stating that the GC regions can’t exceed the total heap available, to avoid experimenting with parameter configurations that can’t start in the first place.

Here’s the definition of the study:

name: Optimize PageRank
description: Tweaking the JVM parameters to optimize the page-rank benchmark.
system: pagerank
workflow: run-pagerank

goal:
  objective: minimize
  function:
    formula: memory_used
    variables:
      memory_used:
        metric: jvm.jvm_memory_used

parametersSelection:
  - name: jvm.jvm_gcType
  - name: jvm.jvm_maxHeapSize
    domain: [1250, 2000]
  - name: jvm.jvm_newSize
    domain: [350, 2000]
  - name: jvm.jvm_survivorRatio
  - name: jvm.jvm_maxTenuringThreshold

parameterConstraints:
  - name: Max heap must always be greater than new size
    formula: jvm.jvm_maxHeapSize > jvm.jvm_newSize

steps:
  - name: baseline
    type: baseline
    values:
      jvm.jvm_gcType: G1
      jvm.jvm_maxHeapSize: 2000

  - name: optimize
    type: optimize
    numberOfExperiments: 30

To create and run the study execute the following commands:

akamas create study study.yaml
akamas start study 'Optimize PageRank'

Optimizing cost of a Node.js application with performance tests

COMING SOON! Please reach out to us at support@akamas.io if interested.

Optimizing cost of a Golang application with performance tests

COMING SOON! Please reach out to us at support@akamas.io if interested.

Optimizing cost of a .NET application with performance tests

COMING SOON! Please reach out to us at support@akamas.io if interested.

Applications running on cloud instances

Optimizing a sample application running on AWS

In this example, you will go through the optimization of a Spark application running on AWS instances. We’ll be using a PageRank implementation included in Renaissance, an industry-standard Java benchmarking suite, tuning both Java and AWS parameters to improve the performance of our application.

Environment setup

For this example, you’re expected to use two dedicated machines:

an Akamas instance
a Linux-based AWS EC2 instance

The Akamas instance requires provisioning and manipulating instances, therefore it requires to be enabled to do so by setting AWS Policies, integrating with orchestration tools (such as Ansible), and an inventory linked to your AWS EC2 environment.

The Linux-based instance will run the application benchmark, so it requires the latest open-jdk11 release

sudo apt install openjdk-11-jre

Telemetry Infrastructure setup

For this study you’re going to require the following telemetry providers:

CSV Provider to parse the results of the benchmark
Prometheus provider to monitor the instance
AWS Telemetry provider to extract instance price

Application and Test tool

The renaissance suite provides the benchmark we’re going to optimize.

Since the application consists of a jar file only, the setup is rather straightforward; just download the binary in the ~/renaissance/ folder:

mkdir ~/renaissance
cd ~/renaissance
wget -O renaissance.jar https://github.com/renaissance-benchmarks/renaissance/releases/download/v0.10.0/renaissance-gpl-0.10.0.jar

In the same folder upload the template file launch.benchmark.sh.temp, containing the script that executes the benchmark using the provided parameters and parses the results:

#!/bin/bash
java -XX:MaxRAMPercentage=60 ${jvm.*} -jar renaissance.jar -r 50 --csv renaissance.csv page-rank

total_time=$(awk -F"," '{total_time+=$2}END{print total_time}' ./renaissance.csv)
first_line=$(head -n 1 renaissance.csv)
end_time=$(tail -n 1 renaissance.csv | cut -d',' -f3)
start_time=$(sed '2q;d' renaissance.csv | cut -d',' -f4)
echo $first_line,"TS,COMPONENT" > renaissance-parsed.csv
ts=$(date -d @$(($start_time/1000)) "+%Y-%m-%d %H:%M:%S")

echo -e "page-rank,$total_time,$end_time,$start_time,$ts,pagerank" >> renaissance-parsed.csv

You may find further info about the suite and its benchmarks in the official doc.

Optimization setup

In this section, we will guide you through the steps required to set up the optimization on Akamas.

Optimization packs

This example requires the installation of the following optimization packs:

System

Our system could be named renaissance after its application, so you’ll have a system.yaml file like this:

name: jvm
description: The JVM running the benchmark
componentType: java-openjdk-11
properties:
    prometheus:
      job: jmx
      instance: jmx_instance

Then create the new system resource:

akamas create component component-jvm.yaml renaissance

The renaissance system will then have three components:

A benchmark component
A Java component
An EC2 component, i.e. the underlying instance

Java component

Create a component-jvm.yaml file like the following:

name: jvm
description: The JVM running the benchmark
componentType: java-openjdk-11
properties:
    prometheus:
      job: jmx
      instance: jmx_instance

Then type:

akamas create component component-jvm.yaml renaissance

Benchmark component

Since there is no optimization pack associated with this component, you have to create some extra resources.

A metrics.yaml file for a new metric tracking execution time:

metrics:
  - name: elapsed
    unit: nanoseconds
    description: The duration of the benchmark execution

A component-type benchmark.yaml:

name: benchmark
description: A component type for the Renaissance Java benchmarking suite
metrics:
  - name: elapsed
parameters: []

The component pagerank.yaml:

name: pagerank
description: The pagerank application included in Renaissance benchmarks
componentType: benchmark

Create your new resources, by typing in your terminal the following commands:

akamas create metrics metrics.yaml
akamas create component-type benchmark.yaml
akamas create component pagerank.yaml renaissance

EC2 component

Create a component-ec2.yaml file like the following:

name: instance
description: The ec2 instance the benchmark runs on
componentType: ec2
properties:
  hostname: renaissance.akamas.io
  sshPort: 22
  instance: ec2_instance
  username:  ubuntu
  key: # SSH KEY
  ec2:
    region: us-east-2 # This is just a reference

Then create its resource by typing in your terminal:

akamas create component component-ec2.yaml renaissance

Workflow

The workflow in this example is composed of three main steps:

Update the instance type
Run the application benchmark
Stop the instance

To manage the instance we are going to integrate a very simple Ansible in our workflow: the FileConfigurator operator will replace the parameters in the template file in order to generate the code run by the Executor operator, as explained in the Ansible page.

In detail:

Update the instance size
1. Generate the playbook file from the template
2. Update the instance using the playbook
3. Wait for the instance to be available
Run the application benchmark
1. Configure the benchmark Java launch script
2. Execute the launch script
3. Parse PageRank output to make it consumable by the CSV telemetry instance
Stop the instance
1. Configure the playbook to stop an instance with a specific instance id
2. Run the playbook to stop the instance

The following is the template of the Ansible playbook:

# Change instance type, requires AWS CLI

- name: Resize the instance
  hosts: localhost
  gather_facts: no
  connection: local
  tasks:
  - name: save instance info
    ec2_instance_info:
      filters:
        "tag:Name": <your-instance-name>
    register: ec2
  - name: Stop the instance
    ec2:
      region: <your-aws-region>
      state: stopped
      instance_ids:
        - "{{ ec2.instances[0].instance_id }}"
      instance_type: "{{ ec2.instances[0].instance_type }}"
      wait: True
  - name: Change the instances ec2 type
    shell: >
       aws ec2 modify-instance-attribute --instance-id "{{ ec2.instances[0].instance_id }}"
       --instance-type "${ec2.aws_ec2_instance_type}.${ec2.aws_ec2_instance_size}"
    delegate_to: localhost
  - name: restart the instance
    ec2:
      region: <your-aws-region>
      state: running
      instance_ids:
        - "{{ ec2.instances[0].instance_id }}"
      wait: True
    register: ec2
  - name: wait for SSH to come up
    wait_for:
      host: "{{ item.public_dns_name }}"
      port: 22
      delay: 60
      timeout: 320
      state: started
    with_items: "{{ ec2.instances }}"

The following is the workflow configuration file:

name: Pagerank AWS optimization
tasks:

  # Creating the EC2 instance
  - name: Configure provisioning
    operator: FileConfigurator
    arguments:
      sourcePath: /home/ubuntu/ansible/resize.yaml.templ
      targetPath: /home/ubuntu/ansible/resize.yaml
      host:
        hostname: bastion.akamas.io
        username: ubuntu
        key: # SSH KEY

  - name: Execute Provisioning
    operator: Executor
    arguments:
      command: ansible-playbook /home/akamas/ansible/resize.yaml
      host:
        hostname: bastion.akamas.io
        username: akamas
        key: # SSH KEY

  # Waiting for the instance to come up and set up its DNS
  - name: Pause
    operator: Sleep
    arguments:
      seconds: 120

  # Running the benchmark
  - name: Configure Benchmark
    operator: FileConfigurator
    arguments:
        source:
            hostname: renaissance.akamas.io
            username: ubuntu
            path: /home/ubuntu/renaissance/launch_benchmark.sh.templ
            key: # SSH KEY
        target:
            hostname: renaissance.akamas.io
            username: ubuntu
            path: /home/ubuntu/renaissance/launch_benchmark.sh
            key: # SSH KEY

  - name: Launch Benchmark
    operator: Executor
    arguments:
      command: bash /home/ubuntu/renaissance/launch_benchmark.sh
      host:
        hostname: renaissance.akamas.io
        username: ubuntu
        key: # SSH KEYCreate the workflow resource by typing in your terminal:

Telemetry

If you have not installed the Prometheus telemetry provider or the CSV telemetry provider yet, take a look at the telemetry provider pages Prometheus provider and CSV Provider to proceed with the installation.

Prometheus

Prometheus allows us to gather jvm execution metrics through the jmx exporter: download the java agent required to gather metrics from here, then update the two following files:

The prometheus.yml file, located in your Prometheus folder:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: prometheus
    static_configs:
    - targets: ['localhost:9090']

  - job_name: jmx
    static_configs:
    - targets: ["localhost:9110"]
    relabel_configs:
    - source_labels: ["__address__"]
      regex: "(.*):.*"
      target_label: instance
      replacement: jmx_instanc

The config.yml file you have to create in the ~/renaissance folder:

startDelaySeconds: 0
username:
password:
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
# using the property above we are telling the export to export only relevant java metrics
whitelistObjectNames:
  - "java.lang:*"
  - "jvm:*"

Now you can create a prometheus-instance.yaml file:

provider: Prometheus
config:
  address: renaissance.akamas.io
  port: 9090

Then you can install the telemetry instance:

akamas create telemetry-instance prometheus-instance.yaml renaissance

You may find further info on exporting Java metrics to Prometheus here.

CSV - Telemetry instance

Create a telemetry-csv.yaml file to read the benchmark output:

provider: CSV
config:
  protocol: scp
  address: renaissance.akamas.io
  username: ubuntu
  authType: key
  auth: # SSH KEY
  remoteFilePattern: /home/ubuntu/renaissance/renaissance-parsed.csv
  csvFormat: horizontal
  componentColumn: COMPONENT
  timestampColumn: TS
  timestampFormat: yyyy-MM-dd HH:mm:ss

metrics:
  - metric: elapsed
    datasourceMetric: nanos

Then create the resource by typing in your terminal:

akamas create telemetry-instance renaissance

Study

Here we provide a reference study for AWS. As we’ve anticipated, the goal of this study is to optimize a sample Java application, the PageRank benchmark you may find in the renaissance benchmark suite by Oracle.

Our goal is rather simple: minimizing the product between the benchmark execution time and the instance price, that is, finding the most cost-effective instance for our application.

Create a study.yaml file with the following content:

name: aws
description: Tweaking aws and the JVM to optimize the page-rank application.
system: renaissance

goal:
  objective: minimize
  function:
    formula: benchmark.elapsed * aws.aws_ec2_price

workflow: workflow-aws

parametersSelection:
  - name: aws.aws_ec2_instance_type
    categories: [c5,c5d,c5a,m5,m5d,m5a,r5,r5d,r5a]
  - name: aws.aws_ec2_instance_size
    categories: [large,xlarge,2xlarge,4xlarge]
  - name: jvm.jvm_gcType
  - name: jvm.jvm_newSize
  - name: jvm.jvm_maxHeapSize
  - name: jvm.jvm_minHeapSize
  - name: jvm.jvm_survivorRatio
  - name: jvm.jvm_maxTenuringThreshold

steps:
  - name: baseline
    type: baseline
    numberOfTrials: 2
    values:
     aws.aws_ec2_instance_type: c5
     aws.aws_ec2_instance_size: 2xlarge
     jvm.jvm_gcType: G1
  - name: optimize
    type: optimize
    numberOfExperiments: 60

Then create the corresponding Akamas resource and start the study:

akamas create study study.yaml
akamas start study aws

Spark applications

Optimizing a Spark application

In this example study we’ll tune the parameters of SparkPi, one of the example applications provided by most of the Apache Spark distributions, to minimize its execution time. Application monitoring is provided by the Spark History Server APIs.

Environment setup

The test environment includes the following instances:

Akamas: instance running Akamas
Spark cluster: composed of instances with 16 vCPUs and 64 GB of memory, where the Spark binaries are installed under /usr/lib/spark. In particular, the roles are:
- 1x master instance: the Spark node running the resource manager and Spark History Server (host: sparkmaster.akamas.io)
- 2x worker instances: the other instances in the cluster

Telemetry Infrastructure setup

To gather metrics about the application we will leverage the Spark History Server. If it is not already running, start it on the master instance with the following command:

/usr/lib/spark/sbin/start-history-server.sh

Application and Test tools

To make sure the tested application is available on your cluster and runs correctly, execute the following commands:

file /usr/lib/spark/examples/jars/spark-examples.jar
spark-submit \
  --master yarn --deploy-mode client \
  --class 'org.apache.spark.examples.SparkPi' \
  /usr/lib/spark/examples/jars/spark-examples.jar 100

Optimization setup

In this section, we will guide you through the steps required to set up on Akamas the optimization of the Spark application execution.

System

System spark

Here’s the definition of the system we will use to group our components and telemetry instances for this example:

name: spark
description: A system to tune the Spark Pi example application

To create the system run the following command:

akamas create system system.yaml

Component sparkPi

We’ll use a component of type Spark Application 2.3.0 to represent the application running on the Apache Spark framework 2.3.

In the snippet shown below, we specify:

the field properties required by Akamas to connect via SSH to the cluster master instance
the parameters required by spark-submit to execute the application
the sparkApplication flag required by the telemetry instance to associate the metrics from the History Server to this component

name: sparkPi
description: The Spark Application used to calculate KPIs for ContentWise Analytics
componentType: Spark Application 2.3.0

properties:
  hostname: sparkmaster.akamas.io
  username: hadoop
  key: ssh_key

  master: yarn
  deployMode: client
  className: org.apache.spark.examples.SparkPi
  file: /usr/lib/spark/examples/jars/spark-examples.jar
  args: [ 1000 ]

  sparkApplication: 'true'

To create the component in the system run the following command:

akamas create component sparkPi.yaml spark

Workflow

The workflow used for this study contains only a single stage, where the operator submits the application along with the Spark parameters under test.

Here’s the definition of the workflow:

name: Run SparkPi
tasks:
- name: run application
  operator: SSHSparkSubmit
  arguments:
    component: sparkPi
    retries: 0

To create the workflow run the following command:

akamas create workflow workflow.yaml

Telemetry

If you have not installed the Spark History Server telemetry provider yet, take a look at the telemetry provider page Spark History Server Provider to proceed with the installation.

Here’s the definition of the component, specifying the History Server endpoint:

provider: SparkHistoryServer
config:
  address: sparkmaster.akamas.io
  port: 18080

  importLevel: job

To create the telemetry instance in the system run the following command:

akamas create telemetry-instance telemetry.yaml spark

This telemetry instance will be able to bind the fetched metrics to the related sparkPi component thanks to the sparkApplication attribute we previously added in its definition.

Study

The goal of this study is to find a Spark configuration that minimizes the execution time for the example application.

To achieve this goal we’ll operate on the number of executor processes available to run the application job, and the memory and CPUs allocated for both driver and executors. The domains are configured so that the single driver/executor process does not exceed the size of the underlying instance, and the constraints make it so that the application overall does not require more resources than the ones available in the cluster, also taking into account that some resources must be reserved for other services such as the cluster manager.

Note that this study uses two constraints on the total number of resources to be used by the spark application. This example refers to a cluster of three nodes with 16 cores and 64 GB of memory each, and at least one core per instance should be reserved for the system.

Here’s the definition of the study:

name: Speedup SparkPi execution
system: spark
workflow: Run SparkPi

goal:
  objective: minimize
  function:
    formula: sparkPi.spark_application_duration

parametersSelection:
- name: sparkPi.driverCores
  domain: [1, 10]
- name: sparkPi.driverMemory
  domain: [32, 2048]
- name: sparkPi.executorCores
  domain: [1, 15]
- name: sparkPi.executorMemory
  domain: [32, 2048]
- name: sparkPi.numExecutors
  domain: [1, 45]

parameterConstraints:
- name: cap_total_allocated_cpus
  formula: (spark.driverCores + spark.executorCores*spark.numExecutors) <= 15*3

- name: cap_total_allocated_memory
  formula: (spark.driverMemory + spark.executorMemory*spark.numExecutors) <= 60*3

steps:
- name: baseline
  type: baseline

- name: tune
  type: optimize
  numberOfExperiments: 200
  maxFailedExperiments: 200

To create and run the study execute the following commands:

akamas create study study.yaml
akamas start study 'Speedup SparkPi execution'

Optimize application performance and reliability

Kubernetes microservices

Offline optimizations

Live optimizations

Optimizing cost of a Kubernetes microservice while preserving SLOs in production

In this example, you will use Akamas live optimization to minimize the cost of a Kubernetes deployment, while preserving application performance and reliability requirements.

Prerequisites

In this example, you need:

an Akamas instance
a Kubernetes cluster, with a deployment to be optimized
the kubectl command installed in the Akamas instance, configured to access the target Kubernetes and with privileges to get and update the deployment configurations
a supported telemetry data source (e.g. Prometheus or Dynatrace) configured to collect metrics from the target Kubernetes cluster

Optimization setup

Optimization packs

This example leverages the following optimization packs:

System

The system represents the Kubernetes deployment to be optimized (let's call it "frontend"). You can create a system.yaml manifest like this:

name: frontend
description: Kubernetes frontend deployment

Create the new system resource:

akamas create system system.yaml

The system will then have two components:

A Kubernetes container component, which contains container-level metrics like CPU usage and parameters to be tuned like CPU limits
A Web Application component, which contains service-level metrics like throughput and response time

Kubernetes component

Create a component-container.yaml manifest like the following:

name: container
description: Kubernetes container, part of the frontend deployment
componentType: Kubernetes Container
properties:
  dynatrace:
    type: CONTAINER_GROUP_INSTANCE
    kubernetes:
      namespace: boutique
      containerName: server
      basePodName: frontend-*

Then run:

akamas create component component-container.yaml frontend

Now create a component-webapp.yaml manifest like the following:

name: webapp
description: The service related to the frontend deployment
componentType: Web Application
properties:
  dynatrace:
    id: <TELEMETRY_DYNATRACE_WEBAPP_ID>

Then run:

akamas create component component-webapp.yaml frontend

Workflow

The workflow in this example is composed of three main steps:

Update the Kubernetes deployment manifest with the parameters (CPU and memory limits) recommended by Akamas
Apply the new parameters (kubectl apply)
Wait for the rollout to complete
Sleep for 30 minutes (observation interval)

Create a workflow.yaml manifest like the following:

name: frontend
tasks:
  - name: configure
    operator: FileConfigurator
    arguments:
      source:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
        path: frontend.yaml.templ
      target:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
        path: frontend.yaml

  - name: apply
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
      command: kubectl apply -f frontend.yaml

  - name: verify
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
      command: kubectl rollout status --timeout=5m deployment/frontend -n boutique;

  - name: observe
    operator: Sleep
    arguments:
      seconds: 1800

Then run:

akamas create workflow workflow.yaml

Telemetry

Create the telemetry.yamlmanifest like the following:

provider: Dynatrace
config:
  url: <YOUR_DYNATRACE_URL>
  token: <YOUR_DYNATRACE_TOKEN>
  pushEvents: false

Then run:

akamas create telemetry-instance telemetry.yaml frontend

Study

In this live optimization:

the goal is to reduce the cost of the Kubernetes deployment. In this example, the cost is based on the amount of CPU and memory limits (assuming requests = limits).
the approval mode is set to manual, a new recommendation is generated daily
to avoid impacting application performance, constraints are specified on desired response times and error rates
to avoid impacting application reliability, constraints are specified on peak resource usage and out-of-memory kills
the parameters to be tuned are the container CPU and memory limits (we assume requests=limits in the deployment file)

Create a study.yaml manifest like the following:

name: frontend
system: frontend
workflow: frontend
requireApproval: true

goal:
  objective: minimize
  function:
    formula: (((container.container_cpu_limit/1000) * 3) + (container.container_memory_limit/(1024*1024*1024)))
  constraints:
    absolute:
      - name: Response Time
        formula: webapp.requests_response_time <= 300
      - name: Error Rate
        formula: webapp.service_error_rate:max <= 0.05
      - name: Container CPU saturation
        formula: container.container_cpu_util:p95 < 0.8
      - name: Container memory saturation
        formula: container.container_memory_util:max < 0.7
      - name: Container out-of-memory kills
        formula: container.container_oom_kills_count == 0

parametersSelection:
  - name: container.cpu_limit
    domain: [300, 1000]
  - name: container.memory_limit
    domain: [800, 1536]

windowing:
  type: trim
  trim: [5m, 0m]
  task: observe

workloadsSelection:
  - name: webapp.requests_throughput

steps:
  - name: baseline
    type: baseline
    numberOfTrials: 48
    values:
      container.cpu_limit: 1000
      container.memory_limit: 1536

  - name: optimize
    type: optimize
    numberOfTrials: 48
    numberOfExperiments: 100
    numberOfInitExperiments: 0
    maxFailedExperiments: 50

Then run:

akamas create study study.yaml

You can now follow the live optimization progress and explore the results using the Akamas UI for Live optimizations.

Optimizing cost of a Java microservice on Kubernetes while preserving SLOs in production

In this example, you will use Akamas live optimization to minimize the cost of a Kubernetes deployment, while preserving application performance and reliability requirements.

Prerequisites

In this example, you need:

an Akamas instance
a Kubernetes cluster, with a deployment to be optimized
the kubectl command installed in the Akamas instance, configured to access the target Kubernetes and with privileges to get and update the deployment configurations
a supported telemetry data source (e.g. Prometheus or Dynatrace) configured to collect metrics from the target Kubernetes cluster

Optimization setup

Optimization packs

This example leverages the following optimization packs:

System

The system represents the Kubernetes deployment to be optimized (let's call it "frontend"). You can create a system.yaml manifest like this:

name: frontend
description: Kubernetes frontend deployment

Create the new system resource:

akamas create system system.yaml

The system will then have two components:

A Kubernetes container component, which contains container-level metrics like CPU usage and parameters to be tuned like CPU limits
A Web Application component, which contains service-level metrics like throughput and response time

Kubernetes component

Create a component-container.yaml manifest like the following:

name: container
description: Kubernetes container, part of the frontend deployment
componentType: Kubernetes Container
properties:
  dynatrace:
    type: CONTAINER_GROUP_INSTANCE
    kubernetes:
      namespace: boutique
      containerName: server
      basePodName: frontend-*

Then run:

akamas create component component-container.yaml frontend

Now create a component-webapp.yaml manifest like the following:

name: webapp
description: The service related to the frontend deployment
componentType: Web Application
properties:
  dynatrace:
    id: <TELEMETRY_DYNATRACE_WEBAPP_ID>

Then run:

akamas create component component-webapp.yaml frontend

Workflow

The workflow in this example is composed of three main steps:

Update the Kubernetes deployment manifest with the parameters (CPU and memory limits) recommended by Akamas
Apply the new parameters (kubectl apply)
Wait for the rollout to complete
Sleep for 30 minutes (observation interval)

Create a workflow.yaml manifest like the following:

name: frontend
tasks:
  - name: configure
    operator: FileConfigurator
    arguments:
      source:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
        path: frontend.yaml.templ
      target:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
        path: frontend.yaml

  - name: apply
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
      command: kubectl apply -f frontend.yaml

  - name: verify
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: mymachine
        username: user
        key: /home/user/.ssh/key
      command: kubectl rollout status --timeout=5m deployment/frontend -n boutique;

  - name: observe
    operator: Sleep
    arguments:
      seconds: 1800

Then run:

akamas create workflow workflow.yaml

Telemetry

Create the telemetry.yamlmanifest like the following:

provider: Dynatrace
config:
  url: <YOUR_DYNATRACE_URL>
  token: <YOUR_DYNATRACE_TOKEN>
  pushEvents: false

Then run:

akamas create telemetry-instance telemetry.yaml frontend

Study

In this live optimization:

the goal is to reduce the cost of the Kubernetes deployment. In this example, the cost is based on the amount of CPU and memory limits (assuming requests = limits).
the approval mode is set to manual, a new recommendation is generated daily
to avoid impacting application performance, constraints are specified on desired response times and error rates
to avoid impacting application reliability, constraints are specified on peak resource usage and out-of-memory kills
the parameters to be tuned are the container CPU and memory limits (we assume requests=limits in the deployment file)

Create a study.yaml manifest like the following:

name: frontend
system: frontend
workflow: frontend
requireApproval: true

goal:
  objective: minimize
  function:
    formula: (((container.container_cpu_limit/1000) * 3) + (container.container_memory_limit/(1024*1024*1024)))
  constraints:
    absolute:
      - name: Response Time
        formula: webapp.requests_response_time <= 300
      - name: Error Rate
        formula: webapp.service_error_rate:max <= 0.05
      - name: Container CPU saturation
        formula: container.container_cpu_util:p95 < 0.8
      - name: Container memory saturation
        formula: container.container_memory_util:max < 0.7
      - name: Container out-of-memory kills
        formula: container.container_oom_kills_count == 0

parametersSelection:
  - name: container.cpu_limit
    domain: [300, 1000]
  - name: container.memory_limit
    domain: [800, 1536]

windowing:
  type: trim
  trim: [5m, 0m]
  task: observe

workloadsSelection:
  - name: webapp.requests_throughput

steps:
  - name: baseline
    type: baseline
    numberOfTrials: 48
    values:
      container.cpu_limit: 1000
      container.memory_limit: 1536

  - name: optimize
    type: optimize
    numberOfTrials: 48
    numberOfExperiments: 100
    numberOfInitExperiments: 0
    maxFailedExperiments: 50

Then run:

akamas create study study.yaml

You can now follow the live optimization progress and explore the results using the Akamas UI for Live optimizations.

Applications running on cloud instances

Spark applications

Optimize cost of a Kubernetes deployment subject to Horizontal Pod Autoscaler

Prerequisites

an Akamas instance
a Kubernetes cluster, with a deployment to be optimized
a Horizontal Pod Autoscaler working on the desired deployment
a supported telemetry data source configured to collect metrics from the target Kubernetes cluster (see here for the full list)
a way to apply configuration changes recommended by Akamas to the target deployment and HPA. In this guide, Akamas interacts directly with the Kubernetes APIs via kubectl.You need a service account with permissions to update your deployment (see below for other integration options).

Optimization setup

In this guide, we assume the following setup:

the Kubernetes deployment to be optimized is called frontend (in the hipster-shop namespace)
in the deployment, there is a container named server, where the app runs
the HPA is called frontend-hpa
both Dynatrace and Prometheus are used as observability tools

Let's set up the Akamas optimization for this use case.

System

For this optimization, you need the following components to model the frontend tech stack:

The Kubernetes Workload, Container and Pod components, containing metrics like CPU used for the different objects and parameters to be tuned like CPU limits at the container levels (from the Kubernetes optimization pack)
An HPA component, which contains HPA parameters like the target CPU utilization
A Web Application component, which contains service-level metrics like throughput and response time of the microservice (from the Web Applicationoptimization pack)

Let's start by creating the system, which represents the Kubernetes deployment to be optimized. To create it, write a system.yaml manifest like this:

name: frontend
description: The frontend Kubernetes deployment

Then run:

akamas create system system.yaml

Now create the three Kubernetes components. Create a workload.yaml manifest like the following:

name: workload_frontend
description: The frontend Kubernetes workload
componentType: Kubernetes Workload
properties:
  prometheus:
    namespace: hipster-shop
    deployment: frontend

Then create a container.yaml manifest like the following:

name: server
description: The server Kubernetes container
componentType: Kubernetes Container
properties:
  prometheus:
    namespace: hipster-shop
    pod: frontend.*
    container: server

And a pod.yaml manifest like the following:

name: pod_frontend
description: The frontend Kubernetes pod
componentType: Kubernetes Pod
properties:
  prometheus:
    namespace: hipster-shop
    pod: frontend.*

Now create the entities by running:

CREATE BATCH

akamas create component workload.yaml frontend-2
akamas create component container.yaml frontend-2
akamas create component pod.yaml frontend-2

Now create anapplication.yaml manifest like the following:

name: webapp
description: The web application of frontend deployment
componentType: Web Application
properties:
  dynatrace:
    id: SERVICE-80258F7AA97F2E4D
  prometheus:
    namespace: hipster-shop-2
    pod: frontend.*
    container: server

Notice the component includes properties that specify how Dynatrace telemetry will look up this container in the Kubernetes cluster.

These properties are dependent upon the telemetry provider you are using. See the reference for the full list of supported providers and relative configurations.

The run:

akamas create component application.yaml frontend-2

Finally, create anhpa.yaml manifest like the following:

name: frontend_hpa
description: The HPA for the frontend
componentType: HPA

The HPA component does not provide any metric, so we do not need to specify anything about the workload.

NOTA PER STEFANO DONI: STO IGNORANDO IL FATTO CHE VADA CREATO IL COMPONENT TYPE ED I PARAMETRI

Then run:

akamas create component hpa.yaml frontend-2

Workflow

To optimize a Kubernetes microservice in production, you need to create a workflow that defines how the new configuration recommended by Akamas will be deployed in production.

Let's explore the high-level tasks required in this scenario and the options you have to adapt it to your environment:

1) Update the Kubernetes deployment and HPA configurations

The first step is to update the Kubernetes deployment and HPA with the new configuration. This can be done in several ways depending on your environment and processes:

A simple option is to let Akamas directly update the Kubernetes entities leveraging the Kubernetes APIs via kubectl commands.
Another option is to follow an Infrastructure-as-code approach, where the configuration change is managed via pull requests to a Git repository, leveraging your pipelines to deploy the change in production.

In this guide, we take the first option and use the kubectl patch and kubectl apply commands to configure the new deployment and the HPA.

2) Wait for the new deployment to be rolled out in production

This task can be done in several ways depending on how you manage changes, as discussed in the previous task:

A simple option is to use thekubectl rollout command to wait for the deployment rollout completion. This is the approach used in this guide.
Another option is to follow an Infrastructure-as-code approach, where a change is managed via pull requests to a Git repository, leveraging your pipelines to deploy in production. In this situation, the deployment process is executed externally and is not controlled by Akamas. Hence, the workflow task will periodically poll the Kubernetes deployment to recognize when the new deployment has landed in production.

See here for an example of an Infrastructure-as-code automation approach. TODO LINK

3) Wait for the appropriate time to start the experiment

When dealing with the HPA, it is important that Akamas always observes the same timeframe.

Albeit Akamas handles different workload patterns, it's always better to run each experiment on the same time slot, so that each configuration is evaluated against a similar workload pattern.

In this example we assume that we want to evaluate a new configuration every hour, hence we will insert a workload step that waits for the end of the current hour.

Typically, this depends on the configuration process of your application.

4) Observe how the application behaves with the new configuration

Since we decided to evaluate a configuration every hour, we use a 55 minute observation interval, leaving 5 minutes for the configuration process.

Let's now create a workflow.yaml manifest like the following:

name: frontend-11-delayedApproval-hpa-1hour-system2
tasks:
  - name: configure frontend
    operator: FileConfigurator
    arguments:
      source:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
        path: /work/examples/hipstershop-hpa/hipstershop-2/ak-frontend.sh.templ
      target:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
        path: /work/ak-frontend-2.sh

  - name: apply frontend
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
      command: sh /work/ak-frontend-2.sh hipster-shop-2 frontend

  - name: verify frontend
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
      command: kubectl rollout status --timeout=5m deployment/frontend -n hipster-shop-2;

  - name: configure hpa
    operator: FileConfigurator
    arguments:
      source:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
        path: /work/examples/hipstershop-hpa/hipstershop-2/frontend-hpa-v2.yaml.templ
      target:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
        path: /work/frontend-hpa-v2-2.yaml

  - name: apply hpa
    operator: Executor
    arguments:
      timeout: 5m
      host:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
      command: kubectl apply -f /work/frontend-hpa-v2-2.yaml -n hipster-shop-2

  - name: check if we are in time or wait for start of next hour
    operator: Executor
    arguments:
      host:
        hostname: toolbox
        username: akamas
        key: /home/stefano/tmp_ak_key
      command: if [ $(date +%M) -lt 55 ]; then sleep $((60*(60 - $(date +%M)))); else sleep 0; fi

  - name: observe 55 minutes
    operator: Sleep
    arguments:
      seconds: 3300

Then run:

akamas create workflow workflow.yaml

Telemetry

To collect metrics of your target Kubernetes deployment, you create a telemetry instance based on your observability setup.

Create a dynatrace.yamlmanifest like the following:

provider: Dynatrace
config:
  url: <YOUR_DYNATRACE_URL>
  token: <YOUR_DYNATRACE_TOKEN>
  pushEvents: false

Then run:

akamas create telemetry-instance dynatrace.yaml frontend-2

Create a prometheus.yamlmanifest like the following:

provider: Prometheus
config:
  address: prom-kube-prometheus-stack-prometheus.monitoring
  port: 9090
  duration: 60
  logLevel: DETAILED
metrics:
  - metric: cost
    datasourceMetric: 'sum(kube_pod_container_resource_requests{resource="cpu" %FILTERS%})*29 + sum(kube_pod_container_resource_requests{resource="memory" %FILTERS%})/1024/1024/1024*3.2'

Then run:

akamas create telemetry-instance prometheus.yaml frontend-2

Study

It's now time to create the Akamas study to achieve your optimization objectives.

Let's explore how the study is designed by going through the main concepts. The complete study manifest is available at the bottom.

Goal

There are different approaches to measuring the cost of Kubernetes deployments:

A simple approach is to consider that Kubernetes allocates infrastructure resources based on pod resource requests (CPU and memory). Hence, the cost of a deployment can be derived from the deployment aggregate CPU and memory requests. In this guide, we use this approach and define the study goal as the sum of CPU and memory requests of the container to be optimized.
Alternatively, the cost of a Kubernetes deployment can also be collected from external data sources that provide actual cost metrics like OpenCost. In this case, the study goal can be defined by leveraging the cost metric. See here for more information on how to integrate cost metrics.

TODO: IN QUESTI STUDY VIENE DEFINITA DIRETTAMENTE NELLA TELEMETRY, NON SO SE EFFETTIVAMENTE SIA RICHIESTO

Constraints

In this study:

to ensure application performance, constraints are specified on application response times and error rate
to ensure application reliability, constraints are specified on container peak CPU and memory utilization, and container out-of-memory kills

Parameters

To achieve cost-efficient and reliable microservices, Kubernetes container resources and HPA scaling options must be configured optimally and tuned jointly, as they are heavily interconnected.

To do that, the study includes the following parameters:

Kubernetes container: CPU and memory requests and limits
HPA target CPU utilization

The study also includes parameter constraints to ensure that recommended configurations are safe and comply with best practices. In particular:

CPU limits must be at most 2x CPU requests, to avoid excessive over-commitment of CPU limits in the cluster.

Workload

For Kubernetes microservices, the workload is typically the throughput (requests/sec) of the microservice API endpoints. This is the approach used in this guide.

Approval mode

In this live optimization, the manual approval is set to false, meaning that as soon as a new configuration gets generated, the workflow will be executed without any human involvement.

You can now create a study.yaml manifest like the following:

name: ak-frontend - live - system 2
system: frontend-2
workflow: frontend-11-delayedApproval-hpa-1hour-system2

goal:
  name: Cost
  objective: minimize
  function:
    formula: web_application.cost
  constraints:
    absolute:
      - name: Application response time degradation
        formula: web_application.requests_response_time_p50:p90 <= 60
      - name: Application error rate degradation
        formula: web_application.requests_error_rate:p90 <= 0.02
      - name: Container CPU saturation
        formula: server.container_cpu_util_max:p90 < 0.8
      - name: Container memory saturation
        formula: server.container_memory_used:max / server.container_memory_limit < 0.7

windowing:
  type: trim
  trim: [1m,  1m]
  task: observe 55 minutes

parametersSelection:
  - name: server.cpu_request
    domain: [10, 500]
  - name: server.cpu_limit
    domain: [10, 500]
  - name: server.memory_limit
    domain: [16, 640]
  - name: frontend_hpa.metrics_resource_target_averageUtilization
    domain: [10, 90]

parameterConstraints:
  - name: CPU request less or equal to limits
    formula: server.cpu_request <= server.cpu_limit
  - name: CPU limit within a given factor of request
    formula: server.cpu_limit <= server.cpu_request * 2

workloadsSelection:
  - name: web_application.requests_throughput:max
  - name: web_application.requests_throughput

numberOfTrials: 1
steps:
  - name: baseline
    type: baseline
    numberOfTrials: 3
    values:
      server.cpu_request: 200
      server.cpu_limit: 400
      server.memory_limit: 128
      frontend_hpa.metrics_resource_target_averageUtilization: 60
    renderParameters: [frontend_hpa.metrics_resource_target_averageUtilization]

  - name: optimize
    type: optimize
    numberOfExperiments: 300

Then run:

akamas create study study.yaml

You can now follow the live optimization progress and explore the results using the Akamas UI.