1 of 18

Preparing optimization studies

Preparing an optimization study requires several steps, as illustrated by the following figure:

and described in the following sections:

Notice that while these steps apply to both offline optimization studies and live optimization studies, some of these steps are different depending on which optimization is being prepared.

Modeling systems

The very first preparatory step is to model the system representing an application or a service that needs to be optimized (also known as the optimization target).

Modeling a system translates into identifying the components representing the key technology elements to be included in the optimization. Each component is associated with a set of tunable parameters, i.e. configurable properties that impact the performance, efficiency, or reliability of the system, and with a set of metrics, i.e. measurable properties that are used to evaluate the performance, efficiency, or reliability of the system. Typically, key system components are identified by considering which elements and their parameters need to be tuned.

The following figure shows a system corresponding to a Java-based application, where the Java Virtual Machine (JVM) and Kubernetes containers have been identified as key components.

As shown in this figure, a supported component is the "web application", representing the end user perspective of the modeled system (e.g. response time). As expected, this component type only provides measured metrics and no tunable parameters.

Akamas provides several out-of-the-box component types to support system and component modeling. Moreover, it is also possible to define new component types to model other components (see Modeling components).

The System template section of the reference guide describes the template required to define a system, while the commands for creating a system are listed on the Resource Management command page.

Best Practices

Properly modeling the application or service to be optimized by identifying the components and their parameters to tune is the first important step in the optimization process. Some best practices are described here below.

Modeling only relevant components

When defining the system and its components, it is convenient to focus only on those components that are either providing tunable parameters or key metrics (or KPIs).

Key metrics are those used to:

define the optimization goal and constraints, either as metrics that are expected to be improved by the optimization or as metrics representing constraints. For example, a typical goal is to optimize the application throughput. In this case, a Web Application component should include service metrics such as transaction throughput or transaction response time.
support the analysis of the optimization results, as metrics that are useful to measure the impact of parameter tuning on the performance, efficiency, or reliability of the system. For example, a Linux OS component could be used to assess the impact of the optimization on the system-level metrics such as CPU utilization.

Please note that the metrics used to define the optimization goal and constraints are mandatory as they are used by the Akamas AI engine to validate and score each tested configuration against the goal. Other metrics that are not related to the optimization goal and constraints can be considered optional from a pure optimization implementation perspective.

When defining the optimization study, it is always possible to select which parameters and metrics to consider, thus which components are modeled in the system. Therefore, a system could be modeled by all components that at some point are going to be optimized, even if not used in the current optimization study. However, the recommended approach is to model the system only with components whose parameters (and relevant metrics) are to be tuned by the current study.

Reusing systems whenever possible

Whenever possible, it is recommended to model systems and their components by considering how these could be reused for multiple optimization studies in different contexts.

For example, it might be useful to create a simple system containing only one component (e.g. the JVM) for a first optimization study. A new system might then be created to include other components (e.g. the application server) for more advanced optimization studies.

Please, also notice that systems (and other Akamas artifacts) can be shared with different teams thanks to the definition of Akamas workspace.

Modeling systems with horizontal scalability

A typical optimization target is a cluster, i.e. a system made of multiple instances that provide horizontal scalability (e.g. a Kubernetes deployment with several replicas). In this scenario, all the instances are supposed to be identical both from a code and configuration perspective. In this scenario, the recommended approach is to create only one component that represents a generic instance of the cluster. This way, all the instances will be tuned in exactly the same way.

In this scenario, the associated automation workflow needs to be configured to ensure that each configuration is applied to the whole cluster, by propagating the parameter configuration to all of the cluster instances, not just to a single instance represented by the modeled component whose metrics are collected and used to evaluate the overall cluster behavior under that configuration.

Notice that in order for this approach to work correctly, it is also important to verify that the cluster is correctly monitored by the telemetry providers. Depending on the telemetry technology in use, the clustered system may be presented as either a single entity, with aggregated metrics (e.g. a Kubernetes deployment with the total CPU usage of all the replica pods), or as multiple entities, each corresponding to the different instances in the cluster:

in case aggregated metrics are provided by the telemetry provider for the cluster, these metrics can be simply assigned to the component modeling the whole cluster;
in case only instance-level metrics are made available by the telemetry provider, telemetry instances need to be configured in Akamas so as to aggregate the metrics of the cluster instances (e.g. averaging CPU utilization, summing memory usage, etc.), depending on how each specific metric is expected to be used in the goal and constraints or in the study results.

Modeling components

After identifying the components that are required to model a system, the following step is to model each identified key component.

Akamas provides the corresponding for their specific technology (and possibly version) and describing all the tunable parameters and metrics of interest. The full list of Akamas optimization packs is available on the o page of the Akamas reference guide.

The section of the reference guide describes the template required to define a system component, while the commands for creating a system component are listed on the page.

While the optimization process does not necessarily require component types and optimization packs to be defined, it is recommended to leverage this construct to facilitate modularization and reuse.

This is possible as the Akamas optimization pack model is extensible: custom optimization packs can be easily created without any programming to allow Akamas optimization capabilities to be applied to virtually any technology.

Creating custom optimization packs

To create a custom optimization pack, the following fixed directory structure and several YAML manifests need to be created.

Optimization pack directory structure

my_dir
|_ optimizationPack.yaml
|_ component-types
|  |_ componentType1.yaml
|
|_ metrics
|  |_ metricsGroup1.yaml
|
|_ parameters
|  |_ parametersGroup1.yaml
|
|_ telemetry-providers
|_ provider1.yaml

Optimization pack manifest

The optimizationPack.yaml file is the manifest of the optimization pack to be created, which should always be named optimizationPack and have the following structure:

name: Java_8_Optimization_Pack
description: An optimization pack for the Java Hotspot JVM version 8
weight: 1
version: 1.0.0
tags:
- java
- jvm

where:

Component types

The component-types directory should contain the manifests of the component types to be included in the optimization pack. No particular naming constraint is enforced on those manifests.

See Component Types template for details on the structure of those manifests.

Metrics

The metrics directory should contain the manifests of the groups of metrics to be included in the optimization pack. No particular naming constraint is enforced on those manifests.

See Metric template for details on the structure of those manifests.

Parameters

The parameters directory should contain the manifests of the groups of parameters to be included in the optimization pack. No particular naming constraint is enforced on those manifests.

See Parameter template for details on the structure of those manifests.

Telemetry providers

The telemetry-providers directory should contain the manifests of the groups of parameters to be included in the optimization pack. No particular naming is enforced on those manifests.

See Telemetry Provider template for details on the structure of those manifests.

Building optimization pack descriptor

The following command need to be executed in order to produce the final JSON descriptor:

akamas build optimization-pack PATH_TO_THE_DIRECTORY

After this, the optimization pack can be installed (and then used) as described on the Managing optimization packs page.

Managing optimization packs

Whether out-of-the-box or custom, before being used optimization packs need to be installed on an Akamas installation before being used.

Since optimization packs are global resources that are shared across all the workspaces on the same Akamas installation, an account with administrative privileges is required to manage them.

Optimization packs that are not yet installed are displayed as grayed out in the Akamas UI (this is the case for the AWS and Docker packs in the following figure).

The content of the store can be also inspected from the store container on the Akamas server:

docker exec -it store bash
root@ff16b64e84db:/store# cd /store_data/
root@ff16b64e84db:/store_data# ls -l
total 120
-rw-r--r--. 1 root  root   3136 Jul 21 07:11 Docker_1-1-0.json
-rw-r--r--. 1 root  root  19130 Jul 21 07:11 Java-OpenJDK_1-2-6.json
-rw-r--r--. 1 root  root  70391 Jul 21 07:11 Linux_1-2-1.json
-rw-r--r--. 1 root  root   4661 Jul 21 07:11 MongoDB_1-0-0.json
-rw-r--r--. 1 root  root   6432 Jul 21 07:11 MySQL_1-2-0.json
-rw-r--r--. 1 39073 11537  5633 Sep  7 11:31 web-application_1-1-0.json

which also provides the list of the associated JSON file (the optimization pack descriptor).

Downloading descriptors

An Akamas installation comes with the latest optimization packs already loaded in the store and is able to check the central repository for updates.

The following command describes how to download the file descriptor https://akamas.s3.us-east-2.amazonaws.com/optimization-packs/Linux/1.3.0/Linux_1-3-0.json related to the version 1.3.0 of the Linux optimization pack:

curl -O https://akamas.s3.us-east-2.amazonaws.com/optimization-packs/Linux/1.3.0/Linux_1-3-0.json akamas install optimization-pack Linux_1-3-0.json

Installing

There are two ways of installing an optimization pack:

online installation - this is the general case when the optimization pack is already in the store
offline installation - this may apply to custom optimization packs available as a JSON file (refer to the Creating custom optimization pack page)

Only in the first case, an optimization pack can be installed from the UI. See here below the command line commands to get an optimization pack installed.

Online installation

Execute the following command by specifying the name of the optimization pack that is already available in the store:

akamas install optimization-pack OPTIMIZATION_PACK_NAME

Offline installation

Execute the following command to install an optimization pack by specifying the name of the optimization pack and the full path to the JSON descriptor file:

akamas install optimization-pack PATH_TO_JSON_DESCRIPTOR

Forcing installation

When installing an optimization pack, the following checks are executed to identify potential clashes with already existing resources:

name of the optimization pack
metrics
parameters
component types
telemetry providers

In case one of those checks is positive (i.e. a clash exists), the installation failed and a message notifies that a "force" option needs to be used to get the optimization pack installed anyway

akamas install -f optimization-pack OPTIMIZATION_PACK_NAME

Please be aware that when forcing the installation of an optimization pack, Akamas replaces (or merges) all the conflicting resources, except that if there is at least one custom resource, the installation is stopped. In this case, the custom resource needs to be manually removed first in order to proceed.

Uninstalling

The following command uninstalls an optimization pack

akamas uninstall --force OPTIMIZATION_PACK_NAME

Notice that this also deletes all the components built using that optimization pack.

Updating

In case a new optimization pack needs to be installed from a descriptor, the procedure is the following:

uninstall the optimization pack
remove the old version of the optimization pack descriptor file from the store container;
install the new optimization pack with the new JSON descriptor

Creating telemetry instances

After modeling the system and its components, the following step (see the following figure) is to ensure that all the metrics that are required to define goals and constraints and analyze the behavior of the target system can be collected from one of the available data sources available in the environment, that in Akamas are called .

Akamas provides a number of out-of-the-box telemetry providers, including industry-standard monitoring platforms (e.g. Prometheus or Dynatrace), performance testing tools (e.g. LoadRunner or JMeter), or simple CSV files. The section lists all the out-of-the-box telemetry providers and how to get them integrated by Akamas, while the section describes the mapping of the specific data source metrics to Akamas metrics).

Since several instances of a data source type might be available, the specific data source instance needs to be specified, that is a corresponding needs to be defined for the modeled system and its components.

Telemetry Providers are shared across all the workspace in the same Akamas installation and require an account with administrative privileges to manage them. Any number of telemetry instances (even of the same type) can be specified. For example, the following figure shows two Prometheus telemetry instances associated with the Adservice system.

Best Practices

The following sections provide guidelines on how to create telemetry instances.

Verify metrics provided by the telemetry provider

A seemingly obvious, yet fundamental, best practice when choosing a telemetry provider is to check whether the required metrics:

are supported by the original data source or can be added (e.g. as it is in the case of Prometheus)
are available and can be effectively gathered in the specific implementation
are supported by the telemetry provider itself or whether it needs to be extended (this is the case for a Prometheus telemetry provider ) as in the case of custom metrics such as those made available by the application itself

Creating automation workflows

After modeling the system and its components and ensuring that appropriate telemetry instances are defined, the following step (see the following figure) is to define a workflow.

A workflow automates all the tasks to be executed in sequence (see the following figure) during the optimization study, in particular those leveraging integrations with external entities, such as telemetry providers or configuration management tools. Akamas provides a number of general-purpose and specialized workflow operators (see Workflow Operator page).

The Workflow template section of the reference guide describes the template required to define a workflow, while the commands for creating a workflow are listed on the Resource Management command page.

Since a workflow is an Akamas resource defined at the workspace level and that can be used by multiple studies, it might be the case that a convenient workflow is already available or can be used to create a new workflow for the specific target system and integrations, by adding/removing some workflow tasks, changing the task sequence or the values assigned to task parameters.

Notice that since the structure of workflows defined for a live optimization study and for an offline optimization study are very different, these cases are described by a specific page:

creating workflows for offline optimization studies
creating workflows for live optimization studies

Creating workflows for offline studies

A workflow for an offline optimization study automates all the actions required to interface the configuration management and load testing tools (see the following figure) at each experiment or trial. Notice that metrics collection is an implicit action that does not need to be coded as part of the workflow.

More in detail, a typical workflow includes the following types of tasks:

Preparing the application, by executing all cleaning or reset actions that are required to prepare the load testing phase and ensuring that each experiment is executed under exactly the same conditions - for example, this may involve cleaning caches, uploading test data, etc
Applying the configuration, by preparing and then applying the parameter configuration under test to the target environment - this may require interfacing configuration management tools or pushing configuration to a repository, restarting the entire application or some of its components to ensure that some parameters are effectively applied, and then checking that after restarting the application is up & running before the workflow execution continues, and checking whether the configuration has been correctly applied
Applying the workload, by launching a load test to assess the behavior of the system under the applied configuration and synthetic workload defined in the load testing scenarios - of course, a preliminary step is to design a load testing scenario and synthetic workload that ensures that optimized configurations resulting from the offline optimization can be applied to the target system under the real or expected workload

The Optimization examples section provides some examples of how to define workload for a specific technology. In a complex application, a workflow may include multiple actions of the same type, each operating on separate components of the target system. The Knowledge base guide provides some real-world examples of how to create workflows and optimization studies.

Failing workflows

A workflow interrupts in case any of its steps does. A failing workflow causes the experiment or trial to fail. This should be considered as a different situation than a specific configuration not matching optimization constraints or causing the system under test to fail to run. For example, if the amount of max memory configured was too low, the application may fail to start.

When an experiment fails, the Akamas AI engine takes this information into account and thus learns that that parameter configuration was bad. This way, the AI engine automatically tries to avoid the regions of the parameter space which can lead to low scores or failures.

This explains why it is important to build robust workflows that ensure experiments only fail in case bad configurations are tested. See the specific entry Building robust workflow in the best practices section below.

Best Practices

Creating effective workflows is essential to ensure that Akamas can automatically identify the optimal configuration in a reliable and efficient way. Some best practices on how to build robust workflows are described here below.

Some additional best practices related to the design and implementation of load testing are described in the Performing load testing to support optimization activities page.

Reusing workload as much as possible

Since Akamas workflows are first-class entities that can be used by multiple studies, it might be useful to avoid creating (and maintaining) multiple workflows and instead define workflows that can be easily reused, by factoring all differences into specific action parameters.

Of course, this general guideline should be balanced with respect to other requirements, such as avoiding potential conflicts due to different teams modifying the same workload for different uses and potentially impacting optimization results.

Building robust workflows

Akamas takes into account the exit code of each of the workflow tasks, and the whole workflow fails if a task exits with an error. Therefore, the best practice is to make use of exit codes in each task, to ensure that task failures can only happen in case of bad parameter configuration.

For example, it is important to always check that the application has correctly started and is up and running (after a new configuration has been applied). This can be done by:

including a workflow task that tests the application is up and running after the tasks where the configuration is applied;
making sure that this task exits with an error in case the application has not correctly started (typically after a timeout).

Another example is when the underlying environment incurs issues during the optimization (e.g. a database might be mistakenly shut down by another team). As much as possible, all these environmental transient issues should be carefully avoided. Akamas also provides the ability to execute multiple task retries (default is twice, configurable) to compensate for these transient issues, provided they only last for a short time (the retry time and delay are also configurable).

Building workflows that ensure reproducible experiments

As for any other performance evaluation activity, Akamas experiments should be designed to be reproducible: if the same experiment (hence, the same parameter configuration) is executed multiple times (i.e. in multiple trials), the same performance results should be found for each trial.

Therefore, it is fundamental that workflows include all the necessary tasks to realize reproducible experiments. Particular care needs to be taken to correctly manage the system state across the experiments and trials. System state can include:

Application caches
Operating system cache and buffers (e.g. Linux filesystem page cache)
Database tables that fill up during the optimization process

All experiments should always start with a clean and well-known state. If the state is not properly managed, it may happen that the performance of the system is observed to change (whether higher or lower) not because of the effect of the applied parameters, but due to other effects (e.g. warming of caches).

Best practices to consistently manage system state across experiments include:

Restoring the system state at the beginning of each experiment - this may involve restarting the application, clearing caches, restoring DB tables, etc;
Allowing for a sufficient warm-up period in the performance tests, so to ensure application performance has reached stability. See also the recommended best practices about properly managing warm-up periods in the following section about creating an optimization study.

Another common cause that can impact the reproducibility of experiments is an unstable infrastructure or environment. Therefore, it is important to ensure that the underlying infrastructure is stable and that no other workload that might impact the optimization results is running on it. For example, beware of scheduled system jobs (e.g. backups), automatic software updates or anti-virus systems that might not explicitly be considered as part of the environment but that may unexpectedly alter its performance behavior.

Taking into account workflow duration

When designing workflows, it is important to take into account the potential duration of their tasks. Indeed, the task duration impacts the duration of the overall optimization and might impact the ability to execute a sufficient number of experiments within the overall time interval or specific time windows allowed for the optimization study.

Typically, the longest task in a workflow is the one related to applying workload (e.g. launching a load test or a batch job): such tasks can last for dozens of minutes if not hours. However, a workflow may also include other ancillary tasks that may provide nontrivial contributions to the task durations (e.g. checking the status to ensure that the application is up & running).

Making workflows fail fast

As general guidance, it is better to fail fast by performing quick checks executed as early as possible. For example, it is better to do a status check before launching a load test instead of possibly waiting for it to complete (maybe after 1h) just to discover that the application did not even start.

Performing load testing to support optimization activities

This page provides a short compendium of general performance engineering best practices to be applied in any load testing exercise. The focus is on how to ensure that realistic performance tests are designed and implemented to be successfully leveraged for optimization initiatives.

The goal of ensuring realistic performance tests boils down to two aspects:

sound test environments;
realistic workloads.

Test environments

A test o the pre-production environment (Test Env from now on) needs to represent as closely as possible the production environment (ProdEnv from now on).

The most representative test environment would be a perfect replica of the production environment from both infrastructure (hardware) and architecture perspectives. The following criteria and guidelines can help design a TestEnv that is suitable for performance testing supporting optimization initiatives.

Hardware specifications

The hardware specifications of the physical or virtual servers running in TestEnv and ProdEnv must be identical. This is because any differences in the available resources (e.g. amount of RAM) or specification (e.g. CPU vendor and/or type) may affect both services performance and system configuration.

This general guideline can only be relaxed for servers/clusters running container(s) or container orchestration platforms (e.g. Kubernetes or OpenShift). Indeed, it is possible to safely execute most of the related optimization cases if the TestEnv guarantees enough spare/residual capacity (number of cores or amount of RAM) to allocate all the needed resources.

While for monolithic architectures this may translate into significant HW requirements, with microservices this might not be the case, for two main reasons:

microservices are typically smaller than monoliths and designed for horizontal scalability: this means that optimizing the configuration of the single instance (pod/container resources and runtime settings) becomes easier as they typically have smaller HW requirements;
approaches like Infrastructure-as-code (IaaC), typically used with cloud-native applications, allow for easily setting up cluster infrastructure (on-prem or on the cloud) that can mimic production environments.

Downscaled/downsized architecture

Test Envs are typically downscaled/downsized with respect to Prod Envs. If this is the case, then optimizations can be safely executed provided it is possible to generate a "production-like" workload on each of the nodes/elements of the architecture.

This can be usually achieved if all the architectural layers have the same scale ratio between the two environments and the generated workload is scaled accordingly. For example, if the ProdEnvs has 4 nodes at the front-end layer, 4 at the backend layer, and 2 at the database layer, then a TestEnv can have 2 nodes, 2 nodes, and 1 node respectively.

Load balancing among nodes

From a performance testing perspective, the existence of a load balancing among multiple nodes can be ignored, if the load balancing relies on an external component that ensures a uniform distribution of the load across all nodes.

On the contrary, if an application-level balancing is in place, it might be required to include at least two nodes in the testing scenario so as to take into account the impact of such a mechanism on the performance of the cluster.

External/downstream services

The TestEnv should also replicate the application ecosystem, including dependencies from external or downstream services.

External or downstream services should emulate the production behavior from both functional (e.g. response size and error rate) and performance (e.g. throughput and response times) perspectives. In case of constraints or limitations on the ability to leverage external/downstream services for testing purposes, the production behavior needs to be simulated via stubs/mock services.

In the case of microservices applications, it is also required to replicate dependencies within an application. Several approaches can be taken for this purpose, such as:

replicating interacting microservices;
mocking these microservices and simulating realistic response times using simulation tools such as https://github.com/spectolabs/hoverfly;
disregarding dependencies with nonrelevant services (e.g. a post-processing service running on a mainframe whose messages are simply left published in a queue without being dequeued).

Test cases

The most representative performance test script would provide 100% coverage of all the possible test cases. Of course, this is very unlikely to be the case in performance testing. The following criteria and guidelines can be considered to establish the required test coverage.

Statistical relevance

The test cases included in the test script must cover at least 80% of the production workload.

Business relevance

The test cases included in the test script must cover all the business-critical functionalities that are known (or expected) to represent a significant load in the production environment

Technical relevance

The test cases included in the test script must cover all the functionalities that at the code level involve:

Large objects/data structure allocation and management
Long living objects/data structure allocation and management
Intensive CPU, data, or network utilization
"one of-a-kind" implementations, such as connections to a data source, ad-hoc objects allocation/management, etc.

Test user paths and behavior

The virtual user paths and behavior coded in the test script must be representative of the workload generated by production users. The most representative test script would account for the production users in terms of a mix of the different user paths, associated think times, and session length perspectives.

When single-user paths cannot be easily identified, the best practice is to consider each of them the most comprehensive user journey. In general, a worst-case approach is recommended.

The task of reproducing realistic workloads is easier for microservice architectures. On the contrary, for monolithic architectures, this task could become hard as it may not be easy to observe all of the workloads, due to custom frameworks, etc. With microservices, the workload can be completely decomposed in terms of APIs/endpoints and APM tools can provide full observability of production workload traffic and performance characteristics for each single API. This guarantees that the replicated workload can reproduce the production traffic as closely as possible.

Test data

Both test script data, that is datasets used in the test script, and test environment data, that is datasets in any involved databases/datastores, have to be characterized both in terms of size and variance so as to reproduce the production performances.

Test script data

The test script data has to be characterized in order to guarantee production-like performances (e.g. cache behavior). In case this characterization is difficult, the best practice is to adopt a worst-case approach.

Test environment data

The test data must be sized and have an adequate variance to guarantee production like performances in the interaction with databases/datastores (e.g. query response times).

Test scenarios

Most performance test tools provide the ability to easily define and modify the test scenarios on top of already defined test cases/scripts, test case-mix, and test data. This is especially useful in the Akamas context where it might be required to execute a specific test scenario, based on the specific optimization goal defined. The most common (and useful, in the Akamas context) test scenarios are described here below.

Load tests

A load test aims at measuring system performance against a specified workload level, typically the one experienced or expected in production. Usually, the workload level is defined in terms of virtual user concurrency or request throughput.

In the load test, after an initial ramp-up, the target load level is maintained constant for a steady state until the end of the test.

When validating a load test, the following two key factors have to be considered:

The steady-state concurrency/throughput level: a good practice is to apply a worst-case approach by emulating at least 110% of the production throughput;
The steady-state duration: in general defining the length for steady-state is a complex task because it is strictly dependent on the technologies under test and also because phenomena such as bootstraps, warm-ups, and caching can affect the performance and behavior of the system only before or after a certain amount of time; as a general guide to validate the steady-state duration, it is useful to:
1. execute a long-run test by keeping the defined steady-state for at least 2h to 3h;
2. analyze test results by looking for any variation in the performance and behavior of the system over time;
3. In case no variation is observed, shorten the defined same steady-state to at least 30+min.

Stress tests

A Stress test is all about pushing the system under test to its limit.

Stress tests are useful to identify the maximum throughput that an application can cope with while working within its SLOs. Identifying the breaking point of an application is also useful to highlight the bottleneck(s) of the application.

A stress test also makes it possible to understand how the system reacts to excessive load, thus validating the architectural expectations. For example, it can be useful to discover that the application crashes when reaching the limit, instead of simply enqueuing requests and slowing down processing them.

Endurance tests

An endurance test aims at validating the system's performance over an extended period of time.

Validating tests vs production

The first validation is provided by utilization metrics (e.g. CPU, RAM, I/O), which should closely display in the test environments the same behavior of production environments. If the delta is significant, some refinements of the test case and environment might be required to close the gap and gain confidence in the test results.

Creating workflows for live optimizations

A workflow for a live optimization study automates all the actions required to interface the configuration management. Notice that metrics collection is an implicit action that does not need to be coded as part of the workflow.

More in detail, a typical workflow includes the following types of tasks:

Applying the configuration, by preparing and then applying the parameter configuration that has been recommended and/or approved to the target environment - this may require interfacing configuration management tools or pushing configuration to a repository

Depending on the complexity of the system, the workflow might be composed by multiple actions of the same type, each operating on separate components of the target system.

As expected, with respect to workflows for offline optimization studies, there are no actions to apply synthetic workloads as part of a load-testing scenario.

Creating optimization studies

The final preparatory step before running a study is to actually create the study, which also requires several substeps.

Most of the substeps are common for both a live optimization study and an offline optimization study, even if they might need to be conceived differently in these two different contexts:

defining the optimization goal & constraints
defining the optimization parameters & metrics
defining the optimization steps

Other optional and mandatory steps are specific for offline optimization studies (see below) and live optimization studies (see below).

The Study template section of the reference guide describes the template for creating a study, while the commands for creating a study are on the Resource Management command page. For offline optimization studies only, the Akamas UI displays the "Create a study" button that provides a visual step-by-step procedure for creating a new optimization study (see the following figure).

Offline optimization studies

For offline optimization studies, there are some additional (optional) steps:

defining windowing policies (optional - typically after defining the goal & constraints)
defining KPIs (optional - typically after defining the goal & constraints)

Notice that Akamas also allows existing offline optimization studies to be duplicated either from the Akamas UI (see the following figure) or from the command line (refer to the Resource management commands page).

Live optimization studies

For live optimization studies, there are some additional steps - including a mandatory one:

defining workloads (mandatory - typically after defining the goal & constraints)
setting safety policies (optional - typically when defining the optimization steps)

Defining optimization goal & constraints

The first fundamental step in creating a study is to define the study goal & constraints. While this step might be perceived as somewhat straightforward (e.g. constraints could be simply translated from SLOs already in place), defining the optimization goal really requires carefully balancing complexity and effectiveness, also as part of the general (iterative) optimization process. Please also read the Best Practices section here below.

In general, any performance engineering, tuning, and optimization activity involves complex tradeoffs among different - and potentially conflicting - goals and system performance metrics, such as:

Maximizing the business volume an application can support, while not making the single transaction slower or increasing errors above a desired threshold
Minimizing the duration of a batch processing task, while not increasing the cloud costs by more than 20% or using more than 8 CPUs

Akamas support all these (and other) scenarios by means of the optimization goal, that is the single metric or the formula combining multiple metrics that have to be either minimized or maximized, and one or more constraints among metrics of the system.

In general, constraints can be defined as either absolute constraints (e.g. app.response_time < 200 ms) or as relative constraints with respect to a baseline (e.g. app_response_time < +20% of the baseline), that is the current configuration in place, typically corresponding to the very first experiment in an offline optimization study which. Therefore, relative constraints are only applicable to offline optimization studies, while absolute constraints are applicable to both absolute and relative constraints.

Please notice that when defining constraints for an optimization study, it is required to also include those constraints listed in the Constraints section of the respective Optimization Packs which express internal constraints among parameters. For example, in case OpenJDK 11 components are to be tuned, the reference section is Constraints.

The Goal & Constraint page of the Study template in the reference guide describes the corresponding structures. For offline optimization studies only, the Akamas UI allows the optimization goal and constraints to be defined as part of the visual procedure activated by the "Create a study" button (see the following figure).

Please notice that any experiment that does not respect the constraints is marked by Akamas as failed, even if correctly executed. The reason for this failure can be inspected in the experiment status. Similarly to workflow failures (see below), the Akamas AI engine automatically takes any failure due to constraint violations into account when searching the optimization space to identify the parameter configurations that might improve the goal metrics while matching constraints.

Best Practices

There are no general guidelines and best practices on how to best define goals & constraints, as this is where experience, knowledge, and processes meet.

Please refer to the section Optimization examples for a number of examples related to a variety of technologies and the Knowledge Base guide for real-world examples.

Defining windowing policies

For both offline and live optimization studies, it is possible to define how to identify the time windows that Akamas needs to consider for assessing the result of an experiment. Defining a windowing policy helps achieve reliable optimizations by excluding metrics data points that should not influence the score of an experiment.

The following two windowing policies are available:

Trim windowing: discards the initial and final part of an experiment - e.g. to exclude warm-up and tear-down phases - trim windowing policy is the default (with entire interval selection whether no trimming is specified)
Stability windowing: discard those parts that do not correspond to the most stable window - this leverages the Akamas features of automatically identifying the most stable window based on the user-specified specified criteria

The Windowing policy page of the Study template section in the reference guide describes the corresponding structures. For offline optimization studies only, the Akamas UI allows the windowing policies to be defined as part of the visual procedure activated by the "Create a study" button (see the following figures).

Best Practices

The following sections provide general best practices on how to define suitable windowing policy.

Define windowing based on the optimization goal

In order to make the optimization process fully automated and unattended, Akamas automatically analyzes the time series of the collected metrics of each experiment and calculates the experiment score (all the system metrics will also be aggregated).

Based on the optimization goal, it is important to instruct Akamas on how to perform this experiment analysis, in particular, by also leveraging Akamas windowing policies.

For example, when optimizing an online or transactional application, there are two common scenarios:

Increase system performance (i.e. minimize response time) or reduce system costs (i.e. decrease resource footprint or cloud costs) while processing a given and fixed transaction volume (i.e. a load test);
Increase the maximum throughput a system can support (i.e., system capacity) while processing an increasing amount of load (e.g. a stress test).

In the first scenario, a load test scenario is typically used: the injected load (e.g. virtual users) ramps up for a period, followed by a steady state, with a final ramp-down period. From a performance engineering standpoint, since the goal is to assess the system performance during the steady state, the warm-up and tear-down periods can be discarded. This analysis can be automated by applying a windowing policy of type "trim" upon creating the optimization study, which makes Akamas automatically compute the experiment score by discarding a configurable warm-up and tear-down period.

In the second scenario, a stress test is typically used: the injected load follows a ramp with increasing levels of users, designed to stress the system up to its limit. In this case, a performance engineer is most likely interested in the maximum throughput the system can sustain before breaking down (possibly while matching a response time constraint). This analysis can be automated by applying a windowing policy of type "stability", which makes Akamas automatically compute the experiment score in the time window where the throughput was maximized but stable for a configurable period of time.

When optimizing a batch application, windowing is typically not required. In such scenarios, a typical goal is to minimize batch duration or aggregate resource utilization. Hence, there is no need to define any windowing policy: by default, the whole experiment timeframe is considered.

Finding an effective stability window

Setting up an effective stability window requires some knowledge of the test scenario and the variability of the environment.

As a general guideline it is recommended to run a baseline study with a stability window set to a low value, such as a value close to 0 or half of the expected mean of the metric, and then to inspect the results of the baseline to identify which window has been identified and update the standard deviation threshold accordingly. When using a continuous ramp the test has no plateaus, so the standard deviation threshold should be a bit higher to account for the increment of the traffic in the windowing period. On the contrary, when running a staircase test with many plateaus, the standard deviation can be smaller to identify a period of time with the same amount of users.

Applying the standard deviation filter to very stable metrics, such as the number of users, simplifies the definition of the standard deviation threshold but might hide some instability of the environment when subject to constant traffic. On the other hand, applying the threshold to a more direct measure of the performance, such as the throughput, makes it easier to identify the stability period of the application but might require more baseline experiments to identify the proper threshold value. The logs of the scoring phase provide useful insights into the maximum standard deviation found and the number of candidate windows that have been identified given a threshold value, which can be used to refine the threshold in a few baseline experiments.

Defining KPIs

While the optimization goal drives the Akamas AI toward optimal configurations, there might be other sub-optimal configurations of interest in case they do not simply match the optimization constraints but might also improve on some Key Performance Indicators (KPIs).

For example:

for a Kubernetes microservice Java-based application, a typical optimization goal is to reduce the overall (infrastructure or cloud) cost by tuning both Kubernetes and JVM parameters while keeping SLOs in terms of application response time and error rate under control
among different configurations that provide similar cost reduction in addition to matching all SLOs, a configuration that would also significantly cause the application response time might be worth considering with respect to an optimal configuration that does not improve on this KPI

Akamas automatically considers any metric referred to in the defined optimization goal and constraints for an offline optimization study as a KPI. Moreover, any other metrics of the system component can be specified as a KPI for an offline optimization study.

The KPIs page of the Study template section in the reference guide describes how to define the corresponding structure. Specifying the KPIs can be done while first defining the study or from the Akamas UI, at either study creation time or afterward (see the following figures).

Once KPIs are defined, Akamas will represent the results of the optimization in the Insights section of the Akamas UI. Moreover, the corresponding suboptimal configuration associated with a specific KPI is highlighted in the Akamas UI by a textual badge "Best <KPI name>".

Please notice that KPIs can also be re-defined after an offline optimization study has been completed as their definition does not affect the optimization process, only the evaluation of its results. See the section Analyzing offline optimization studies and the Optimization Insights page.

Defining parameters & metrics

After defining the goal and its constraints, the following substep in creating an optimization study is specifying the optimization parameters and metrics. In particular, selecting the parameters that are going to be tuned to optimize the system is a critical decision that requires carefully balancing complexity and effectiveness. As for goals & constraints, also this step may require adopting an iterative approach. See also the Best Practices section here below.

The Parameter selection and Metric selection pages of the Study template section in the reference guide describe how to define the corresponding structure. For offline optimization studies only, the Akamas UI allows the parameters and metrics to be defined as part of the visual procedure activated by the "Create a study" button (see the following figure).

As illustrated by the previous and following figures, during this step is also possible to edit the range of values associated with each optimization parameter with respect to the default domain provided by either the original or custom optimization pack in use for the respective technology.

Please also refer to the Guidelines for choosing optimization parameters for a number of selected technologies. Some examples provided in the Knowledge Base guide may also provide useful guidance.

Parameter rendering

By default, all parameters specified in the parameters selection of a study are applied ("rendered"). Akamas allows specifying which configuration parameters should be applied in the optimization steps. More precisely:

parameter rendering is available at the step level for baseline, preset, and optimize steps
parameter rendering is not available for bootstrap steps (bootstrapped experiments are not executed)

This feature can be useful to deal with the different strategies through which applications and systems accept configuration parameters.

Please refer to the Parameter rendering page to see how to configure parameter rendering.

Best Practices

The following sections provide some best practices on how to best approach the step of defining optimization parameters. .

Configure parameters domains based on environment specs

Since the parameter domain defines the range of values that the Akamas AI engine can assign to the parameter, when defining the system parameters to be optimized, it is important to review the parameter domains and adjust them based on the system characteristics of the target system, environment and best practices in place.

Akamas optimization packs already provide parameter domains that are correct for most situations. For example, the OpenJDK 11 JVM gcType is a categorical parameter that already includes all the possible garbage collectors that can be set for this JVM version.

For other parameters, there are no sensible default domains as they depend on the environment. For example, the OpenJDK 11 maxHeapSize JVM parameter dictates how much memory the JVM can use. This obviously depends on the environment in which the JVM runs. For example, the upper bound might be 90% of the memory of the virtual machine or container in which the JVM runs.

Defining good parameter domains is important to ensure the parameter configurations suggested by the Akamas AI engine will be as good as possible. Notice that if the domain is not defined correctly, this may cause experiment failures (e.g. the JVM could not start if the maxHeapSize is higher than the container size). As discussed as part of the best practices for defining robust workflows, the Akamas AI engine has been designed to learn configurations that may lead to failures and to automatically discover any hidden constraints found in the environment.

Configure parameter constraints based on Optimization Pack best practices

Depending on the specific technology under optimization, the configuration parameters may have relationships among themselves. For example, in a JVM the newSize parameter defines the size of a region of the JVM heap, and hence its value should be always less than the maxHeapSize parameter.

Akamas AI engine supports the definition of constraints among parameters as this is a frequent need when optimizing real-life applications.

It is important to define the parameter constraints when creating a new study. The optimization pack documentation provides guidelines on what are the most important parameter constraints for the specific technology.

When optimizing a new or custom technology, it may happen that some experiments fail due to unknown parameter constraints being violated. For example, the application may fail to start and only by analyzing the application error logs, the reason for the failure can be understood. For a Java application, the JVM error message (e.g. "new size cannot be larger than max heap size") could provide useful hints. This would reveal that some constraints need to be added to the parameter constraints in the study.

While the Akamas AI engine has been designed to learn from failures, including those due to relationships among parameters that were not explicitly set as constraints, setting parameter constraints may help avoid unnecessary failures and thus speed up the optimization process.

Defining workloads

For a live optimization study, it is required to specify which component metrics represent the different workloads observed on the target system. A workload could be represented by either a metric directly measuring that workload, such as the application throughput, or a proxy metric, such as the percentage of reads and writes in your database.

The page of the section in the reference guide describes how to define the corresponding structure.

Akamas features automatic detection of workload contexts, corresponding to different patterns for the same workload. For example, workload context could correspond to the peak or idle load, or to the weekend or weekday traffic. This allows Akamas to recommend safe configurations based on the observed behavior of the system under similar workload conditions.

Akamas provides several parameters governing how the Akamas optimizer operates and leverages the workload information while a live optimization study is being executed. The most important parameter is the online mode (see ) as it related to whether the human user is part of the approval loop when the Akamas AI recommends a configuration to be applied.

Moreover, Akamas also provides customizable safety policies that drive the Akamas optimizer in evaluating candidate configurations with respect to defined goal constraints.

Online mode

Live optimizations can operate in one of the following online modes:

recommendation (or manual) mode (the default mode): Akamas does not immediately apply a configuration identified by Akamas AI: a new configuration is first recommended to the user, who needs to approve it, possibly after modifying it, before it gets applied - this is also referred to as human-in-the loop scenario;
fully autonomous (or automatic) mode: new configurations are immediately applied by Akamas as soon as they are generated by the Akamas AI, without being first recommended to (and approved by) the user.

It is worth noticing that under a recommendation mode, there might be a significant delay between the time a configuration is identified by Akamas and the time the recommended changes get applied. Therefore, the Akamas AI leverages the workload information differently when looking for a new configuration, depending on the defined online mode:

in the recommendation mode, Akamas takes into account all the defined workloads and looks for the configuration that best satisfies the goal constraints for all the observed workloads and provides the best improvements for all of them
in the fully autonomous mode, Akamas works on a single workload at each iteration (based on a customizable workload strategy - see below) and looks for an optimized configuration for that specific workload to be immediately applied in the next iteration, even if it might not be the best for the different workloads

The online mode can be specified at the study level and can also be overridden at the step level (only for steps of type "optimize" - see section ). The page of the section in the reference guide describes how to define the corresponding structure. This can be done either from the Akamas command line (see page ) or from the Akamas AI (see the following figure).

Notice that the online mode can be changed at any time, that is while the optimization study is running, to become immediately effective. For example, a live optimization could initially operate in recommendation mode and then be changed to fully autonomous mode afterward.

Defining optimization steps

A final step in defining an optimization study is to specify specifies the sequence of steps executed while running the study.

The following four types of steps are available:

Baseline: performs an experiment and sets it as a baseline for all the other ones
Bootstrap: imports experiments from other studies
Preset: performs an experiment with a specific configuration
Optimize: performs experiments and generates optimized configurations

Please notice that at least one baseline step is always required in any optimization study. This applies not only to offline optimization studies, but also to live optimization studies as it is being used to suggest changes to parameter values starting from the default values.

The Steps page in the Study template section in the reference guide describes how to define the corresponding structures for each of the different types of steps allowed by Akamas. For offline optimization studies only, the Akamas UI allows the optimization steps to be defined as part of the visual procedure activated by the "Create a study" button (see the following figure).

In addition to the best practices here below, please refer to the section Optimization examples for a number of examples related to a variety of technologies and the Knowledge Base guide for real-world examples.

Best Practices

The following sections provide some best practices on how to best approach the step of defining the baseline step.

Ensure the baseline configuration is correct

In an optimization study, the baseline is an important experiment as it represents the system performance with the current configuration, and serves as a reference to assess the relative improvements the optimization achieved.

Therefore, it is important to make sure the baseline configuration of the study correctly reflects the current configuration - be it the vendor default or the result of a manual tuning exercise.

Evaluate which parameters to include in the baseline configuration

When defining the study baseline configuration it is important to evaluate which parameters to include. Indeed, several technologies have default values assigned to most of their configuration parameters. However, the runtime behavior can be different depending on whether the parameter is set to the default value or it is not set at all.

Therefore, it is recommended to review the current configuration (e.g. the one in place in production) and identify which parameters and values have been set (e.g. JVM maxHeapSize = 2GB, gcType = Parallel, etc.), and then to only set those parameters with their corresponding values, without adding any other parameters. This ensures that the specified baseline is consistent with the real production setup.

Setting safety policies

While Akamas leverages similar AI methods for both live optimizations and optimization studies, the way these methods are applied is radically different. Indeed, for optimization studies running in pre-production environments, the approach is to explore the configuration space by also accepting potential failed experiments, to identify regions that do not correspond to viable configurations. Of course, this approach cannot be accepted for live optimization running in production environments. For this purpose, Akamas live optimization uses observations of configuration changes combined with the automatic detection of workload contexts and provides several customizable safety policies when recommending configurations to be approved, revisited, and applied.

Akamas provides a few customizable optimizer options (refer to the options described on the Optimize step page of the reference guide) that should be configured so as to make configurations recommended in live optimization and applied to production environments as safe as possible.

Exploration factor

Akamas provides an optimizer option known as the exploration factor that only allows gradual changes to the parameters. This gradual optimization allows Akamas to observe how these changes impact the system behavior before applying the following gradual changes.

By properly configuring the optimizer, Akamas can gradually explore regions of the configuration space and slowly approach any potentially risky regions, thus avoiding recommending any configurations that may negatively impact the system. Gradual optimization takes into account the maximum recommended change for each parameter. This is defined as a percentage (default is 5%) with respect to the baseline value. For example, in the case of a container whose CPU limit is 1000 millicores, the corresponding maximum allowed change is 50 millicores. It is important to notice that this does not represent an absolute cap, as Akamas also takes into account any good configurations observed. For example, in the event of a traffic peak, Akamas would recommend a good configuration that was observed working fine for a similar workload in the past, even if the change is higher than 5% of the current configuration value.

Notice that this feature would not work for categorical parameters (e.g. JVM GC Type) as their values do not change incrementally. Therefore, when it comes to these parameters, Akamas by default takes a conservative approach of only recommending configurations with categorical parameters taking already observed before values. This still allows some never observed values to be recommended as users are allowed to modify values also for categorical parameters when operating in human-in-the-loop mode. Once Akamas has observed that that specific configuration is working fine, the corresponding value can then be recommended. For example, a user might modify the recommended configuration for GC Type from Serial to Parallel. Once Parallel has been observed as working fine, Akamas would consider it for future recommendations of GC Type, while other values (e.g. G1) would not be considered until verified as safe recommendations.

The exploration factor can be customized for each live optimization individually and changed while live optimizations are running.

Safety factor

Akamas provides an optimizer option known as the safety factor designed to prevent Akamas from selecting configurations (even if slowly approaching them) that may impact the ability to match defined SLOs. For example, when optimizing container CPU limits, lower and lower CPU limits might be recommended, up to the point that the limit becomes too low that the application performance degrades.

Akamas takes into account the magnitude of constraint breaches: a severe breach is considered more negative than a minor breach. For example, in the case of an SLO of 200 ms on response time, a configuration causing a 1 sec response time is assigned a very different penalty than a configuration causing a 210 ms response time. Moreover, Akamas leverages the smart constraint evaluation feature that takes into account if a configuration is causing constraints to approach their corresponding thresholds. For example, in the case of an SLO of 200 ms on response time, a configuration changing response time from 170 ms to 190 ms is considered more problematic than one causing a change from 100 ms to 120 ms. The first one is considered by Akamas as corresponding to a gray area that should not be explored.

The safety factor can be customized for each live optimization individually and changed while live optimizations are running.

Outlier detection

It is also worth mentioning that Akamas also features an outlier detection capability to compensate for production environments typically being noisy and much less stable than staging environments, thus displaying highly fluctuating performance metrics. As a consequence, constraints may fail from time to time, even for perfectly good configurations. This may be due to a variety of causes, such as shared infrastructure on the cloud, slowness of external systems, etc.

Performing load testing to support optimization activities

The goal of ensuring realistic performance tests boils down to two aspects:

sound test environments;
realistic workloads.

Test environments

A test o the pre-production environment (Test Env from now on) needs to represent as closely as possible the production environment (ProdEnv from now on).

Hardware specifications

While for monolithic architectures this may translate into significant HW requirements, with microservices this might not be the case, for two main reasons:

microservices are typically smaller than monoliths and designed for horizontal scalability: this means that optimizing the configuration of the single instance (pod/container resources and runtime settings) becomes easier as they typically have smaller HW requirements;
approaches like Infrastructure-as-code (IaaC), typically used with cloud-native applications, allow for easily setting up cluster infrastructure (on-prem or on the cloud) that can mimic production environments.

Downscaled/downsized architecture

Load balancing among nodes

External/downstream services

The TestEnv should also replicate the application ecosystem, including dependencies from external or downstream services.

In the case of microservices applications, it is also required to replicate dependencies within an application. Several approaches can be taken for this purpose, such as:

replicating interacting microservices;
mocking these microservices and simulating realistic response times using simulation tools such as https://github.com/spectolabs/hoverfly;
disregarding dependencies with nonrelevant services (e.g. a post-processing service running on a mainframe whose messages are simply left published in a queue without being dequeued).

Test cases

Statistical relevance

The test cases included in the test script must cover at least 80% of the production workload.

Business relevance

The test cases included in the test script must cover all the business-critical functionalities that are known (or expected) to represent a significant load in the production environment

Technical relevance

The test cases included in the test script must cover all the functionalities that at the code level involve:

Large objects/data structure allocation and management
Long living objects/data structure allocation and management
Intensive CPU, data, or network utilization
"one of-a-kind" implementations, such as connections to a data source, ad-hoc objects allocation/management, etc.

Test user paths and behavior

When single-user paths cannot be easily identified, the best practice is to consider each of them the most comprehensive user journey. In general, a worst-case approach is recommended.

Test data

Test script data

Test environment data

The test data must be sized and have an adequate variance to guarantee production like performances in the interaction with databases/datastores (e.g. query response times).

Test scenarios

Load tests

In the load test, after an initial ramp-up, the target load level is maintained constant for a steady state until the end of the test.

When validating a load test, the following two key factors have to be considered:

The steady-state concurrency/throughput level: a good practice is to apply a worst-case approach by emulating at least 110% of the production throughput;
The steady-state duration: in general defining the length for steady-state is a complex task because it is strictly dependent on the technologies under test and also because phenomena such as bootstraps, warm-ups, and caching can affect the performance and behavior of the system only before or after a certain amount of time; as a general guide to validate the steady-state duration, it is useful to:
1. execute a long-run test by keeping the defined steady-state for at least 2h to 3h;
2. analyze test results by looking for any variation in the performance and behavior of the system over time;
3. In case no variation is observed, shorten the defined same steady-state to at least 30+min.

Stress tests

A Stress test is all about pushing the system under test to its limit.

Endurance tests

An endurance test aims at validating the system's performance over an extended period of time.