1 of 8

Creating optimization studies

The final preparatory step before running a study is to actually create the study, which also requires several substeps.

Most of the substeps are common for both a live optimization study and an offline optimization study, even if they might need to be conceived differently in these two different contexts:

defining the optimization goal & constraints
defining the optimization parameters & metrics
defining the optimization steps

Other optional and mandatory steps are specific for offline optimization studies (see below) and live optimization studies (see below).

The Study template section of the reference guide describes the template for creating a study, while the commands for creating a study are on the Resource Management command page. For offline optimization studies only, the Akamas UI displays the "Create a study" button that provides a visual step-by-step procedure for creating a new optimization study (see the following figure).

Offline optimization studies

For offline optimization studies, there are some additional (optional) steps:

defining windowing policies (optional - typically after defining the goal & constraints)
defining KPIs (optional - typically after defining the goal & constraints)

Notice that Akamas also allows existing offline optimization studies to be duplicated either from the Akamas UI (see the following figure) or from the command line (refer to the Resource management commands page).

Live optimization studies

For live optimization studies, there are some additional steps - including a mandatory one:

defining workloads (mandatory - typically after defining the goal & constraints)
setting safety policies (optional - typically when defining the optimization steps)

Defining optimization goal & constraints

The first fundamental step in creating a study is to define the study . While this step might be perceived as somewhat straightforward (e.g. constraints could be simply translated from SLOs already in place), defining the optimization goal really requires carefully balancing complexity and effectiveness, also as part of the general (iterative) optimization process. Please also read the section here below.

In general, any performance engineering, tuning, and optimization activity involves complex tradeoffs among different - and potentially conflicting - goals and system performance metrics, such as:

Maximizing the business volume an application can support, while not making the single transaction slower or increasing errors above a desired threshold
Minimizing the duration of a batch processing task, while not increasing the cloud costs by more than 20% or using more than 8 CPUs

Akamas support all these (and other) scenarios by means of the optimization goal, that is the single metric or the formula combining multiple metrics that have to be either minimized or maximized, and one or more constraints among metrics of the system.

In general, constraints can be defined as either absolute constraints (e.g. app.response_time < 200 ms) or as relative constraints with respect to a baseline (e.g. app_response_time < +20% of the baseline), that is the current configuration in place, typically corresponding to the very first experiment in an offline optimization study which. Therefore, relative constraints are only applicable to offline optimization studies, while absolute constraints are applicable to both absolute and relative constraints.

Please notice that when defining constraints for an optimization study, it is required to also include those constraints listed in the Constraints section of the respective Optimization Packs which express internal constraints among parameters. For example, in case OpenJDK 11 components are to be tuned, the reference section is .

The page of the in the reference guide describes the corresponding structures. For offline optimization studies only, the Akamas UI allows the optimization goal and constraints to be defined as part of the visual procedure activated by the "Create a study" button (see the following figure).

Please notice that any experiment that does not respect the constraints is marked by Akamas as failed, even if correctly executed. The reason for this failure can be inspected in the experiment status. Similarly to workflow failures (see below), the Akamas AI engine automatically takes any failure due to constraint violations into account when searching the optimization space to identify the parameter configurations that might improve the goal metrics while matching constraints.

Best Practices

There are no general guidelines and best practices on how to best define goals & constraints, as this is where experience, knowledge, and processes meet.

Defining windowing policies

For both offline and live optimization studies, it is possible to define how to identify the time windows that Akamas needs to consider for assessing the result of an experiment. Defining a windowing policy helps achieve reliable optimizations by excluding metrics data points that should not influence the score of an experiment.

The following two windowing policies are available:

Trim windowing: discards the initial and final part of an experiment - e.g. to exclude warm-up and tear-down phases - trim windowing policy is the default (with entire interval selection whether no trimming is specified)
Stability windowing: discard those parts that do not correspond to the most stable window - this leverages the Akamas features of automatically identifying the most stable window based on the user-specified specified criteria

The Windowing policy page of the Study template section in the reference guide describes the corresponding structures. For offline optimization studies only, the Akamas UI allows the windowing policies to be defined as part of the visual procedure activated by the "Create a study" button (see the following figures).

Best Practices

The following sections provide general best practices on how to define suitable windowing policy.

Define windowing based on the optimization goal

In order to make the optimization process fully automated and unattended, Akamas automatically analyzes the time series of the collected metrics of each experiment and calculates the experiment score (all the system metrics will also be aggregated).

Based on the optimization goal, it is important to instruct Akamas on how to perform this experiment analysis, in particular, by also leveraging Akamas windowing policies.

For example, when optimizing an online or transactional application, there are two common scenarios:

Increase system performance (i.e. minimize response time) or reduce system costs (i.e. decrease resource footprint or cloud costs) while processing a given and fixed transaction volume (i.e. a load test);
Increase the maximum throughput a system can support (i.e., system capacity) while processing an increasing amount of load (e.g. a stress test).

In the first scenario, a load test scenario is typically used: the injected load (e.g. virtual users) ramps up for a period, followed by a steady state, with a final ramp-down period. From a performance engineering standpoint, since the goal is to assess the system performance during the steady state, the warm-up and tear-down periods can be discarded. This analysis can be automated by applying a windowing policy of type "trim" upon creating the optimization study, which makes Akamas automatically compute the experiment score by discarding a configurable warm-up and tear-down period.

In the second scenario, a stress test is typically used: the injected load follows a ramp with increasing levels of users, designed to stress the system up to its limit. In this case, a performance engineer is most likely interested in the maximum throughput the system can sustain before breaking down (possibly while matching a response time constraint). This analysis can be automated by applying a windowing policy of type "stability", which makes Akamas automatically compute the experiment score in the time window where the throughput was maximized but stable for a configurable period of time.

When optimizing a batch application, windowing is typically not required. In such scenarios, a typical goal is to minimize batch duration or aggregate resource utilization. Hence, there is no need to define any windowing policy: by default, the whole experiment timeframe is considered.

Finding an effective stability window

Setting up an effective stability window requires some knowledge of the test scenario and the variability of the environment.

As a general guideline it is recommended to run a baseline study with a stability window set to a low value, such as a value close to 0 or half of the expected mean of the metric, and then to inspect the results of the baseline to identify which window has been identified and update the standard deviation threshold accordingly. When using a continuous ramp the test has no plateaus, so the standard deviation threshold should be a bit higher to account for the increment of the traffic in the windowing period. On the contrary, when running a staircase test with many plateaus, the standard deviation can be smaller to identify a period of time with the same amount of users.

Applying the standard deviation filter to very stable metrics, such as the number of users, simplifies the definition of the standard deviation threshold but might hide some instability of the environment when subject to constant traffic. On the other hand, applying the threshold to a more direct measure of the performance, such as the throughput, makes it easier to identify the stability period of the application but might require more baseline experiments to identify the proper threshold value. The logs of the scoring phase provide useful insights into the maximum standard deviation found and the number of candidate windows that have been identified given a threshold value, which can be used to refine the threshold in a few baseline experiments.

Defining KPIs

While the optimization goal drives the Akamas AI toward optimal configurations, there might be other sub-optimal configurations of interest in case they do not simply match the optimization constraints but might also improve on some Key Performance Indicators (KPIs).

For example:

for a Kubernetes microservice Java-based application, a typical optimization goal is to reduce the overall (infrastructure or cloud) cost by tuning both Kubernetes and JVM parameters while keeping SLOs in terms of application response time and error rate under control
among different configurations that provide similar cost reduction in addition to matching all SLOs, a configuration that would also significantly cause the application response time might be worth considering with respect to an optimal configuration that does not improve on this KPI

Akamas automatically considers any metric referred to in the defined optimization goal and constraints for an offline optimization study as a KPI. Moreover, any other metrics of the system component can be specified as a KPI for an offline optimization study.

The page of the section in the reference guide describes how to define the corresponding structure. Specifying the KPIs can be done while first defining the study or from the Akamas UI, at either study creation time or afterward (see the following figures).

Once KPIs are defined, Akamas will represent the results of the optimization in the Insights section of the Akamas UI. Moreover, the corresponding suboptimal configuration associated with a specific KPI is highlighted in the Akamas UI by a textual badge "Best <KPI name>".

Defining parameters & metrics

After defining the goal and its constraints, the following substep in creating an optimization study is specifying the optimization parameters and metrics. In particular, selecting the parameters that are going to be tuned to optimize the system is a critical decision that requires carefully balancing complexity and effectiveness. As for goals & constraints, also this step may require adopting an iterative approach. See also the Best Practices section here below.

The Parameter selection and Metric selection pages of the Study template section in the reference guide describe how to define the corresponding structure. For offline optimization studies only, the Akamas UI allows the parameters and metrics to be defined as part of the visual procedure activated by the "Create a study" button (see the following figure).

As illustrated by the previous and following figures, during this step is also possible to edit the range of values associated with each optimization parameter with respect to the default domain provided by either the original or custom optimization pack in use for the respective technology.

Please also refer to the Guidelines for choosing optimization parameters for a number of selected technologies. Some examples provided in the Knowledge Base guide may also provide useful guidance.

Parameter rendering

By default, all parameters specified in the parameters selection of a study are applied ("rendered"). Akamas allows specifying which configuration parameters should be applied in the optimization steps. More precisely:

parameter rendering is available at the step level for baseline, preset, and optimize steps
parameter rendering is not available for bootstrap steps (bootstrapped experiments are not executed)

This feature can be useful to deal with the different strategies through which applications and systems accept configuration parameters.

Please refer to the Parameter rendering page to see how to configure parameter rendering.

Best Practices

The following sections provide some best practices on how to best approach the step of defining optimization parameters. .

Configure parameters domains based on environment specs

Since the parameter domain defines the range of values that the Akamas AI engine can assign to the parameter, when defining the system parameters to be optimized, it is important to review the parameter domains and adjust them based on the system characteristics of the target system, environment and best practices in place.

Akamas optimization packs already provide parameter domains that are correct for most situations. For example, the OpenJDK 11 JVM gcType is a categorical parameter that already includes all the possible garbage collectors that can be set for this JVM version.

For other parameters, there are no sensible default domains as they depend on the environment. For example, the OpenJDK 11 maxHeapSize JVM parameter dictates how much memory the JVM can use. This obviously depends on the environment in which the JVM runs. For example, the upper bound might be 90% of the memory of the virtual machine or container in which the JVM runs.

Defining good parameter domains is important to ensure the parameter configurations suggested by the Akamas AI engine will be as good as possible. Notice that if the domain is not defined correctly, this may cause experiment failures (e.g. the JVM could not start if the maxHeapSize is higher than the container size). As discussed as part of the best practices for defining robust workflows, the Akamas AI engine has been designed to learn configurations that may lead to failures and to automatically discover any hidden constraints found in the environment.

Configure parameter constraints based on Optimization Pack best practices

Depending on the specific technology under optimization, the configuration parameters may have relationships among themselves. For example, in a JVM the newSize parameter defines the size of a region of the JVM heap, and hence its value should be always less than the maxHeapSize parameter.

Akamas AI engine supports the definition of constraints among parameters as this is a frequent need when optimizing real-life applications.

It is important to define the parameter constraints when creating a new study. The optimization pack documentation provides guidelines on what are the most important parameter constraints for the specific technology.

When optimizing a new or custom technology, it may happen that some experiments fail due to unknown parameter constraints being violated. For example, the application may fail to start and only by analyzing the application error logs, the reason for the failure can be understood. For a Java application, the JVM error message (e.g. "new size cannot be larger than max heap size") could provide useful hints. This would reveal that some constraints need to be added to the parameter constraints in the study.

While the Akamas AI engine has been designed to learn from failures, including those due to relationships among parameters that were not explicitly set as constraints, setting parameter constraints may help avoid unnecessary failures and thus speed up the optimization process.

Defining workloads

For a live optimization study, it is required to specify which component metrics represent the different workloads observed on the target system. A workload could be represented by either a metric directly measuring that workload, such as the application throughput, or a proxy metric, such as the percentage of reads and writes in your database.

The page of the section in the reference guide describes how to define the corresponding structure.

Akamas features automatic detection of workload contexts, corresponding to different patterns for the same workload. For example, workload context could correspond to the peak or idle load, or to the weekend or weekday traffic. This allows Akamas to recommend safe configurations based on the observed behavior of the system under similar workload conditions.

Akamas provides several parameters governing how the Akamas optimizer operates and leverages the workload information while a live optimization study is being executed. The most important parameter is the online mode (see ) as it related to whether the human user is part of the approval loop when the Akamas AI recommends a configuration to be applied.

Moreover, Akamas also provides customizable safety policies that drive the Akamas optimizer in evaluating candidate configurations with respect to defined goal constraints.

Online mode

Live optimizations can operate in one of the following online modes:

recommendation (or manual) mode (the default mode): Akamas does not immediately apply a configuration identified by Akamas AI: a new configuration is first recommended to the user, who needs to approve it, possibly after modifying it, before it gets applied - this is also referred to as human-in-the loop scenario;
fully autonomous (or automatic) mode: new configurations are immediately applied by Akamas as soon as they are generated by the Akamas AI, without being first recommended to (and approved by) the user.

It is worth noticing that under a recommendation mode, there might be a significant delay between the time a configuration is identified by Akamas and the time the recommended changes get applied. Therefore, the Akamas AI leverages the workload information differently when looking for a new configuration, depending on the defined online mode:

in the recommendation mode, Akamas takes into account all the defined workloads and looks for the configuration that best satisfies the goal constraints for all the observed workloads and provides the best improvements for all of them
in the fully autonomous mode, Akamas works on a single workload at each iteration (based on a customizable workload strategy - see below) and looks for an optimized configuration for that specific workload to be immediately applied in the next iteration, even if it might not be the best for the different workloads

The online mode can be specified at the study level and can also be overridden at the step level (only for steps of type "optimize" - see section ). The page of the section in the reference guide describes how to define the corresponding structure. This can be done either from the Akamas command line (see page ) or from the Akamas AI (see the following figure).

Notice that the online mode can be changed at any time, that is while the optimization study is running, to become immediately effective. For example, a live optimization could initially operate in recommendation mode and then be changed to fully autonomous mode afterward.

Defining optimization steps

A final step in defining an optimization study is to specify specifies the sequence of steps executed while running the study.

The following four types of steps are available:

Baseline: performs an experiment and sets it as a baseline for all the other ones
Bootstrap: imports experiments from other studies
Preset: performs an experiment with a specific configuration
Optimize: performs experiments and generates optimized configurations

Please notice that at least one baseline step is always required in any optimization study. This applies not only to offline optimization studies, but also to live optimization studies as it is being used to suggest changes to parameter values starting from the default values.

Best Practices

The following sections provide some best practices on how to best approach the step of defining the baseline step.

Ensure the baseline configuration is correct

In an optimization study, the baseline is an important experiment as it represents the system performance with the current configuration, and serves as a reference to assess the relative improvements the optimization achieved.

Therefore, it is important to make sure the baseline configuration of the study correctly reflects the current configuration - be it the vendor default or the result of a manual tuning exercise.

Evaluate which parameters to include in the baseline configuration

When defining the study baseline configuration it is important to evaluate which parameters to include. Indeed, several technologies have default values assigned to most of their configuration parameters. However, the runtime behavior can be different depending on whether the parameter is set to the default value or it is not set at all.

Therefore, it is recommended to review the current configuration (e.g. the one in place in production) and identify which parameters and values have been set (e.g. JVM maxHeapSize = 2GB, gcType = Parallel, etc.), and then to only set those parameters with their corresponding values, without adding any other parameters. This ensures that the specified baseline is consistent with the real production setup.

Setting safety policies

While Akamas leverages similar AI methods for both live optimizations and optimization studies, the way these methods are applied is radically different. Indeed, for optimization studies running in pre-production environments, the approach is to explore the configuration space by also accepting potential failed experiments, to identify regions that do not correspond to viable configurations. Of course, this approach cannot be accepted for live optimization running in production environments. For this purpose, Akamas live optimization uses observations of configuration changes combined with the automatic detection of workload contexts and provides several customizable safety policies when recommending configurations to be approved, revisited, and applied.

Akamas provides a few customizable optimizer options (refer to the options described on the page of the reference guide) that should be configured so as to make configurations recommended in live optimization and applied to production environments as safe as possible.

Exploration factor

Akamas provides an optimizer option known as the exploration factor that only allows gradual changes to the parameters. This gradual optimization allows Akamas to observe how these changes impact the system behavior before applying the following gradual changes.

By properly configuring the optimizer, Akamas can gradually explore regions of the configuration space and slowly approach any potentially risky regions, thus avoiding recommending any configurations that may negatively impact the system. Gradual optimization takes into account the maximum recommended change for each parameter. This is defined as a percentage (default is 5%) with respect to the baseline value. For example, in the case of a container whose CPU limit is 1000 millicores, the corresponding maximum allowed change is 50 millicores. It is important to notice that this does not represent an absolute cap, as Akamas also takes into account any good configurations observed. For example, in the event of a traffic peak, Akamas would recommend a good configuration that was observed working fine for a similar workload in the past, even if the change is higher than 5% of the current configuration value.

Notice that this feature would not work for categorical parameters (e.g. JVM GC Type) as their values do not change incrementally. Therefore, when it comes to these parameters, Akamas by default takes a conservative approach of only recommending configurations with categorical parameters taking already observed before values. This still allows some never observed values to be recommended as users are allowed to modify values also for categorical parameters when operating in human-in-the-loop mode. Once Akamas has observed that that specific configuration is working fine, the corresponding value can then be recommended. For example, a user might modify the recommended configuration for GC Type from Serial to Parallel. Once Parallel has been observed as working fine, Akamas would consider it for future recommendations of GC Type, while other values (e.g. G1) would not be considered until verified as safe recommendations.

The exploration factor can be customized for each live optimization individually and changed while live optimizations are running.

Safety factor

Akamas provides an optimizer option known as the safety factor designed to prevent Akamas from selecting configurations (even if slowly approaching them) that may impact the ability to match defined SLOs. For example, when optimizing container CPU limits, lower and lower CPU limits might be recommended, up to the point that the limit becomes too low that the application performance degrades.

Akamas takes into account the magnitude of constraint breaches: a severe breach is considered more negative than a minor breach. For example, in the case of an SLO of 200 ms on response time, a configuration causing a 1 sec response time is assigned a very different penalty than a configuration causing a 210 ms response time. Moreover, Akamas leverages the smart constraint evaluation feature that takes into account if a configuration is causing constraints to approach their corresponding thresholds. For example, in the case of an SLO of 200 ms on response time, a configuration changing response time from 170 ms to 190 ms is considered more problematic than one causing a change from 100 ms to 120 ms. The first one is considered by Akamas as corresponding to a gray area that should not be explored.

The safety factor is also used when starting the study in order to validate the behavior of the baseline to identify the safety of exploring configurations close to the baseline. If the baseline presents some constraint violations, then even exploring configurations close to the baseline might cause a risk. If Akamas identifies that, in the baseline configuration, more than (safety_factor*number_of_trials) manifest constraint violations then the optimization is stopped.

If your baseline has some trials failing constraint validation we suggest you analyze them before proceeding with the optimization

The safety factor is set by default to 0.5 and can be customized for each live optimization individually and changed while live optimizations are running.

Outlier detection

It is also worth mentioning that Akamas also features an outlier detection capability to compensate for production environments typically being noisy and much less stable than staging environments, thus displaying highly fluctuating performance metrics. As a consequence, constraints may fail from time to time, even for perfectly good configurations. This may be due to a variety of causes, such as shared infrastructure on the cloud, slowness of external systems, etc.