The following provides some best practices that can be adopted before launching optimization studies, in particular for offline optimization studies.
It is recommended to execute a dry-run of the study to verify that the workflow works as expected and in particular that the telemetry and configuration management steps are correctly executed.
Verify that workflow actually works
It is important to verify that all the steps of the workflow complete successfully and produce the expected results.
Verify that parameters are applied and effective
When approaching the optimization of new applications or technologies, it is important to make sure all the parameters that are being set are actually applied and used by the system.
Depending on the specific technology at hand, the following issues can be found:
parameters were set but they are not applied - for example parameters were set in the wrong configuration file or the path is not correct;
some automatic (corrective) mechanisms are in place that overrides the values applied for the parameters.
Therefore, it is important to always verify the actual values of the parameters once the system is up & running with a new configuration, and make sure they match the values applied by Akamas. This is typically done by leveraging:
monitoring tools, when the parameters are available as metrics or properties of the system;
native administration tools, which are typically available for introspection or troubleshooting activities (e.g. jcmd for the JVM).
Verify that load testing works
It is important to verify that the integration with load testing tools actually executes the intended load test scenarios.
Verify that telemetry collects all the relevant metrics
It is important to make sure that the integration with telemetry providers works correctly and that all the relevant metrics of the system are correctly collected.
Data-gathering from the telemetry data sources is launched at the end of the workflow tasks. The status of the telemetry process can be inspected in the Progress tab, where it is also possible to inspect the telemetry logs in case of failures.
Please notice that the telemetry process fails if the key metrics of the study cannot be gathered. This includes metrics defined in the goal function or constraints.
Before running the optimization study, it is important to make sure the system and the environment where the optimization is running provide stable and reproducible performance.
Make sure the system performance is stable
In order to ensure a successful optimization, it is important to make sure that the target system displays stable and predictable performance and does not suffer from random variations.
To make sure this is the case, it is recommended to create a study that only runs a single baseline experiment. In order to assess the performance of the system, Akamas trials can be used to execute the same experiments (hence, the same configuration) multiple times (e.g. three times). Once the experiment is completed, the resulting performance metrics can be analyzed to assess the stability. The analysis can either be done by leveraging aggregate metrics in the Analysis tab, or to a deeper level on the actual time series by accessing the Metrics tab from the Akamas UI.
Ideally, no significant performance variation should be observed in the different trials, for the key system performance metrics. Otherwise, it is strongly recommended to identify the root cause before proceeding with the actual optimization activity.
If you are running a live optimization, any constraint violation in the baseline will halt the study. In order to recommend safe configurations, the optimization process requires that the baseline does not violate constraints for the entire observation period.
Before launching the optimization it might be a good idea to take note of (or backup) the original configuration. This is very important in the case of Linux OS parameters optimization.
The following best practices should be considered before applying a configuration identified by an offline optimization study from a test or pre-production environment to a production environment.
Most of these best practices are general and refer to any configuration change and application rollout, not only to Akamas-related scenarios.
Any configuration identified by Akamas in a test or pre-production environment, by executing a number of experiments and trials in a limited timeframe, should be first validated before being promoted to production in its ability to consistently deliver the expected performance over time.
An endurance test typically lasts for several hours and can either mimic the specific load profile of production environments (e.g. morning peaks or low load phases during the night) or a simple constant high load (flat load). A specific Akamas study can be implemented for this purpose.
When applying a new configuration to a production environment it is important to reduce the risk of severely impacting the supported services and allowing time to backtrack if required.
With a gradual rollout approach, a new configuration is applied to only a subset of the target system to allow the system to be observed for a period of time and avoid impacting the entire.
Several strategies are possible, including:
Canary deployment, where a small percentage of the traffic is served by the instance with the new configuration;
Shadow traffic, where traffic is mirrored and redirected to the instance with the new configuration, and responses are not impacting the user.
In the case of an application sharing entire layers or single components (e.g. microservices) with other applications, it is important to assess in advance the potential impact on other applications before applying a configuration identified by only considering SLOs related to a single application.
The following general considerations may help in assessing the impact on the infrastructure:
if the new configuration is more efficient (i.e. it is less demanding in terms of resources) or it does require changes to resource requirements (e.g. does not change K8s request limits), then the configuration can be expected to be beneficial as the resources will be freed and become available for additional applications;
If the new configuration is less efficient (i.e. it requires more resources), then appropriate checks of whether the additional capacity is available in the infrastructure (e.g. in the K8s cluster or namespace) should be done, as when allocating new applications.
As far as the other applications are concerned:
Just reducing the operational cost of a service does not have any impact on other applications that are calling or using the service;
While tuning service for performance may put the caller system in back-pressure fatigue, this is not the typical behavior of enterprise systems, where the most susceptible systems are on the backend side:
Tuning most external services will not increase the throughput much, which is typically business-driven, thus the risk to overwhelm the backends is low;
Tuning the backends allows the caller systems to handle faster connections, thus reducing the memory footprint and increasing the resilience of the entire system;
Especially in the case of highly distributed systems, such as microservices, the number of inflight packages for a given period of time is something to be minimized;
A latency reduction for a microservice implies fewer in-flight packages throughout the system, leading to better performance, faster failures, and fewer pending transactions to be rolled back in case of incidents.
Even for live optimization studies, it is a good practice to analyze how the optimization is being executed with respect to the defined goal & constraints, and workloads.
This analysis may provide useful insights about the system being optimized (e.g. understanding of the system dynamics) and about the optimization study itself (e.g. how to adjust optimizer options or change constraints). Since this is more challenging for an environment that is being optimized live, a common practice to adopt a recommendation mode before possibly switching to a fully autonomous mode.
The Akamas UI displays the results of an offline optimization study in the following areas:
the Metrics section (see the following figures) displays the behavior of the metrics as configurations are recommended and applied (possibly after being reviewed and approved by users); this area supports the analysis of how the optimizer is driven by the configured safety and exploration factors.
The All Configurations section provides the list of all the recommended configurations, possibly as modified by the user, as well as the detail of each applied configuration (see the following figures).
in the case of a recommendation mode, the Pending configuration section (see the following figure) shows the configuration that is being recommended to allow users to review it (see the EDIT toggle) and approve it:
While the main result of an optimization study is to identify the optimal configuration with respect to the defined goal & constraints, any suboptimal configuration that is improving on one of the defined KPIs can be also very valuable.
These configurations are displayed in a dedicated section of the Akamas UI and also displayed in other areas of the Akamas UI as textual badges "Best <KPI name>" referred to as (insights) tags.
The following figures show the Insights section displayed on the study page and the Insights pages that can be drilled down to.
The following figure shows the insights tags in the Analysis tab:
Please notice that "Best", "Best Memory Limit" and any other KPI-related tags are displayed in the Akamas UI while the study progresses and thus may be reassigned as new experiments get executed and their configurations are scored and provide their results for the defined study KPIs. See
After starting a study, any finished experiment is labeled by one or more insights tags "Best <KPI name>" in case the corresponding configuration provides the best result so far for those KPIs. Notice that for experiments involving multiple trials, tags are only assigned after all their trials have finished.
Of course, after the very first experiment (i.e. a baseline) finishes, all tags are assigned to the corresponding configuration. This is displayed by the following figure for a study where the KPIs named CPU
with formula renaissance.cpu_used
and direction minimize
and MEM
with formula renaissance.mem_used
and direction minimize
:
When the following experiments finish, tags are reevaluated according with respect to the computed goal score and the achieved results for any single KPI. In this study, experiment #2 provided a better result for both the CPU and the study goal, so it got both the tags Best CPU
and Best renaissance.response_time
(which is defined as the goal of the study). Notice that the blue star is displayed by Akamas (except for baseline) to highlight the fact that this was automatically generated by Akamas and not assigned by a user.
Afterward, experiment #3 got the tag as the best configuration while experiment #4 got the tag Best CPU
. as improving on experiment #2. Therefore two configurations displayed the blue star.
A number of experiments later, experiment #7 provided better memory usage than the baseline so got the tag Best MEM
assigned. At this point, three configurations have the blue start, thus making evident that there are tradeoffs when trying to optimize with respect to the goal and the KPIs.
Once all the preparatory steps for creating a study are done, running a study is straightforward: An optimization study can be started from either the Akamas UI (see the following figures) or the command line (refer to the Resource management commands page).
Before actually running an optimization study, it is highly recommended to read the following sections:
Once started, managing studies is different for offline optimization studies (see here below) and live optimization studies (see here below).
Notice that once an offline optimization study has started, it can only be stopped or let be finished and not restarted again. However, it is also possible to reuse experiments executed in another study in another (successfully or not) finished study - this is called bootstrapping and is illustrated by the following figure (also refer to the Bootstrap Step page on the reference page).
This can be useful for multiple reasons, including the case of an error (e.g. a misconfigured workflow) that requires "restarting" the study.
For live optimization studies, it is possible to stop a study and restart it. However, please notice that this is an irreversible action, that would delete all the executed experiments, so basically, restarting a live study means starting it from scratch.
Since an offline optimization study lasts for at most the number of configured experiments and typically runs in a test or pre-production environment, results could be safely either analyzed after the study has completely finished.
However, it is a good practice to analyze partial results while the study is still running as this may provide useful insights about both the system being optimized (e.g. understanding of the system dynamics and sub-optimal configurations that could be immediately applied) and about the optimization study itself (e.g. how to re-design a workflow or change constraints), early-on.
The Akamas UI displays the results of an offline optimization study in different visual areas:
the Best Configuration section provides the optimal configuration identified by Akamas, as a list of recommended values for the optimization parameters compared to the baseline and ranked according to their relevance;
the Progress tab see the following figures) displays the progression of the study with respect to the study steps, the status of each experiment (and trial), its associated score, and the parameter values of the corresponding configurations; this area is mostly used for study monitoring (e.g. identifying failing workflows) and troubleshooting purposes;
the Analysis tab (see the following figures) displays how the baseline and experiments score with respect to the optimization goal, and the values of metrics and parameters for the corresponding configurations; this area supports the analysis of the different configurations;
the Metrics tab (see the following figure) displays the behavior of the metrics for all executed experiments (and trials); this area supports both study validation activities and deeper analysis of the system behavior;
the Insights section (see the following figure) displays any suboptimal configurations that have been identified for the study KPIs, and also allows making comparisons among them and the best configuration - the page describes in further detail the Insight section and the insights tags displayed in other areas of the Akamas UI.