Even for live optimization studies, it is a good practice to analyze how the optimization is being executed with respect to the defined goal & constraints, and workloads.
This analysis may provide useful insights about the system being optimized (e.g. understanding of the system dynamics) and about the optimization study itself (e.g. how to adjust optimizer options or change constraints). Since this is more challenging for an environment that is being optimized live, a common practice to adopt a recommendation mode before possibly switching to a fully autonomous mode.
The Akamas UI displays the results of an offline optimization study in the following areas:
the Metrics section (see the following figures) displays the behavior of the metrics as configurations are recommended and applied (possibly after being reviewed and approved by users); this area supports the analysis of how the optimizer is driven by the configured safety and exploration factors.
The All Configurations section provides the list of all the recommended configurations, possibly as modified by the user, as well as the details of each applied configuration (see the following figures).
in the case of a recommendation mode, the Pending configuration section (see the following figure) shows the configuration that is being recommended to allow users to review it (see the EDIT toggle) and approve it:
A critical aspect, when evaluating the performance of an application, is to make sure that the data we use is accurate. It's quite common for IT systems to experience some transient periods of instabilities; these might occur in many situations such as filling up caches, runtime compilation activities, horizontal scaling, and much more.
A common practice, in performance engineering, is to exclude from the analysis the initial and final part of a performance test to consider only the time when the system is in full operation. Akamas can automatically identify a subset of the whole data to evaluate scores and constraints.
Looking at the example below, from the Online Boutique application, we see that the response time has an initial spike to about 7ms and then stabilizes below 1ms; also the CPU utilization shows a similar pattern.
This is quite common, as an example, for Java-based systems as, in the first minutes of operations activities like heap resizing and just-in-time compilation take place. In this case, Akamas considered in the evaluation of the experiment only the gray area effectively avoiding the impact of the initial spike.
This behavior can be configured in the study by specifying a section called windowing. Two windowing policies allow you to properly configure Akamas in different scenarios.
The simplest policy is called trim and allows users to specify how much time should be excluded from the evaluation from the start and the end of the experiment. It is also possible to apply the trim policy to a specific task of the workflow. This policy can be easily used when, for example, the time required to deploy the application might change. You can read more on this policy in the reference documentation section.
In other contexts, discarding the initial warmup period is not enough. For these scenarios, Akamas supports a more advanced policy, called stability. This policy is also particularly useful for stress tests where our objective is to make the system sustain as much load as possible before becoming unstable as it allows users to express constraints on the stability of the system. You can read more on this policy in the reference documentation section.
The windowing section in the study definition is optional and the default policy considers all the available data to evaluate the performance of the experiment.
One of the key elements that define an optimization study is the parameters set. We have already seen in the study section how to define the set of optimized parameters here we dig deeper on this topic.
Akamas supports four types of parameters:
Integer parameters are those that can only assume an integer value (e.g. the number of cores on a VM instance).
Real parameters can assume real values (e.g. 0.2) and are mostly used when dealing with percentages.
Categorical parameters map those elements that do not have a strict ordering such as GC types (e.g. Parallel, G1, Serial) or booleans.
Ordinal parameters are similar to categorical ones as they also support a set of literal values but they are also ordered. An example is VM instance size (e.g. small, medium, large, xlarge..).
You can read more on parameters and how they are managed in the reference documentation section.
Most of the time you should not bother with defining parameters, as this information is already defined in the Optimization Packs.
When creating new optimization studies you should first select a set of parameters to include in the optimization process. The set might depend on many factors such as:
The potential impact of a parameter on the defined goal (e.g. if my goal is to reduce the cost of running an application it might be a good idea to include parameters related to resource usage).
The layers selected for the optimization. Optimizing multiple layers at the same time might bring more benefits as the configurations of both layers are aligned.
The Akamas' ability to change those parameters (e.g. if my deployment process does not support the definition of some parameters because, as an example, are managed by an external group, I should avoid adding them).
Besides defining the set of parameters users can also select the domain for the optimization and add a set of constraints.
Optimization packs already include information on the possible values for a parameter but in some situations, it is necessary to shrink it. As an example, the parameter that defines the amount of CPU that a container can use (the cpu_limit
) might vary a lot depending on the underlying cluster and the application. If the cluster that hosts the application only contains nodes with up to 10 CPUs it might be worth limiting the domain of this parameter for the optimization study to that value to avoid failures when deploying the application and speed up the optimization process. If you forget to set this domain restriction Akamas will learn it by itself but it will need to try to deploy a container with a higher CPU limit to find out that that's not possible.
In many situations, parameters have dependencies between each other. As an example, suppose you want to optimize at the same time the size of a container and the Java runtime that executes the application inside of it. Both layers have some parameters that affect how much memory can be used, for the container layer this parameter is called memory_limit
and for the JVM is called jvm_heap_size
. Configurations that have a jvm_heap_size
value higher than the memory_limit
might lead to out-of-memory errors.
You can define this relationship by specifying a constraint as in the example below:
These constraints instruct Akamas to avoid generating configurations that bring the jvm_heap_size
parameter close to the memory_limit
leaving a gap of 50Mb.
Constraints usually depend on the set of parameters chosen for the optimization. You can find more information about common constraints for the supported technologies in the documentation of the related optimization pack or the optimization guides.
In cases where a testing environment is not available or it is hard to build representative load tests Akamas can directly optimize production environments by running a Live Optimization study. Production environments differ from test environments in many ways, here are the main aspects that affect how Akamas can optimize the system in such a scenario and that define live optimization studies:
Safety, in terms of application stability and performance, is critical in production environments where SLO might be in place.
The approval process is usually different between production and lower-level environments. In many cases, a configuration change in a production environment must be manually approved by the SRE or Application team and follow a custom deployment scenario.
The workload on the application in a production environment is usually not controlled, it might change with the time of the day, due to special events or external factors
These are the main factors that make live optimization studies differ from offline optimizations.
The following figure represents the iterative process associated with live optimizations:
The following 5 phases can be identified for each iteration:
Collect KPIs: Akamas collects the metrics of the system required to observe its behavior under the current parameter configuration by leveraging the associated telemetry provider - here Akamas is also observing and categorizing the different workload contexts that are used to recommend configurations that are appropriate for each specific workload context
Score vs Goal: Akamas scores the applied parameter configuration under the specific workload context against the defined goal and constraints
Recommend Conf: Akamas provides a recommendation for parameter configuration based on the observed behavior and leveraging the Akamas AI
Human Approval: the recommendation is inspected, possibly revisited, and approved by users before being applied to the system. This step is optional and can be automated.
Apply Conf: Akamas applies the recommended configuration by leveraging the defined workflow.
Overall the core process is very similar to the one of offline optimization studies. The main difference is the (optional) presence of a manual configuration review and approval step.
Even if the process is similar, the way recommended configurations are generated is quite different as it's subject to some safety policies such as:
The exploration factor defines the maximum magnitude of the change of a parameter from one configuration to the next (e.g. reducing the CPU limit of a container by at most 10%). As changes are smaller in magnitude their effect on the system is also smaller, this leads to safer optimizations as the optimization can better track changes in the core metrics. As a side effect, it might take more time for a live optimization to fully optimize a configuration when compared to an offline study.
The safety factor defines how tight the constraints defined in the study are. As the configuration changes some metrics might approach a limit imposed by constraints. As an example, if we set a response time threshold of 300ms akamas will keep track of how the response time changes due to the configuration changes and react to keep the constraint fulfilled. The safety factor influences how quickly Akamas reacts to approaching constraints.
You can read more on safety policies in the related documentation section.
A key aspect of live optimization studies is the fact that the incoming workload of the application is not generated by a test script but by real users. This means that, after deploying a new configuration the incoming might be different with respect to the use used to evaluate the previous one. Nevertheless, the Akamas AI algorithm is capable of taking into account the differences in the incoming workload and fairly evaluating different configurations even if applied in different scenarios. As an example, the traffic of web applications exposed to the general public is usually different between workdays and weekends or working hours and nights.
To instruct Akamas to take into account changes that are not controlled by the deployment process you just need to specify the workloadsSelection
parameter in the optimization study.
The workload selection should contain a list of metrics that are independent of the configuration and represent external factors that affect the performance of the configuration in terms of goals or constraints. Most of the time the application throughput is a good metric to use as a workload metric.
When one or more workload metrics are specified Akamas will take into account the differences in the workload and build clusters of similar workloads to identify repetitive working conditions for the application. It will then use this information to contextualize the evaluation of each configuration and provide a recommended configuration that fulfills the defined constraints on all the workload conditions seen by the optimization process.
You can read more on this parameter on the reference workload selection page.
Live optimizations are separated from offline optimization studies and are available in the second entry on the left menu.
Live optimizations are run usually for a longer period compared to offline optimizations and their effect on the goal and the constraints is more gradual. For this reason, Akamas offers a specific UI that allows users to evaluate the progress of live optimizations and compare many different configurations applied by looking at the evolution of core metrics.
Offline optimization studies are optimization studies where the workload is simulated by leveraging a load-testing tool.
Offline optimization studies are typically used to optimize systems in pre-production environments, with respect to planned and what-if scenarios that cannot be directly run in production. Scenarios include new application releases, planned technology changes (e.g. new JVM or DB), cloud migration or new provider, expected workload growth, and resilience under failure scenarios (from chaos engineering).
The following figure represents the iterative process associated with offline optimizations:
The following 5 phases can be identified for each iteration (also known as experiment):
Apply configuration: Akamas applies the parameter configuration (one or more parameters) to the target system by leveraging a set of workflow operators
Apply workload: Akamas triggers a workload on the target system by also leveraging a set of workflow operators
Collect KPIs: Akamas collects the metrics related to the target system - only those metrics that are specified by each telemetry instance defined in the system
Score vs goal: Akamas scores the applied parameter configuration against the defined goal and constraints - the score is the value of the goal function
Recommend Conf: Akamas AI engine identifies the configuration for the next iteration until a termination condition for the study is met (e.g. number of experiments).
Thanks to its patented AI (reinforcement learning) algorithms, Akamas can find the optimal configuration without having to explore all the possible configurations.
For each experiment, Akamas allows multiple trials to be executed. A trial is a repetition of the same experiment to reduce the impact of noise on the result of an experiment.
Environments can be noisy for several reasons such as:
External conditions (e.g. background jobs, "noisy neighbors" in the cloud)
Measurement errors (e.g. monitoring tools not always 100% accurate)
This approach is consistent with scientific and engineering practices, where the strategy to minimize the impact of noise is to repeat the same experiment multiple times.
An offline optimization study can include multiple steps.
Typically there are at least two steps:
Baseline step: a single experiment that is run by applying the already deployed configuration before the Akamas optimization is applied - the results of this experiment are used as a reference (baseline) for assessing the optimization and as such is a mandatory step for each study
Optimize step: a defined number of experiments used to identify the optimal configuration by leveraging Akamas AI.
Other steps are:
Bootstrap step: imported experiments from other optimization studies
Preset step: a single experiment with a defined configuration
The steps to be executed can be specified when defining an offline optimization study.
An offline optimization study is an Akamas resource that can be managed via CLI using the resource management commands.
The Akamas UI shows offline optimization studies in a specific top-level menu.
The details and results of an offline optimization study are displayed when drilling down (there are multiple tabs and sections).
Now that Akamas knows about your application, how to configure it, and how to monitor it, the final step is to define your optimization study.
The study defines the objective of the optimization activity. It contains information about what we want to achieve (e.g. reduce costs, improve latency..), the parameters that can be optimized, and any SLO that should not be breached by the optimized configuration.
Studies are divided into two main categories:
Offline Studies are, generally, executed in test environments where the workload of the application is generated using a load-testing tool. You can read more here.
Live Studies are, usually, executed in production environments. You can read more here.
The setup of both studies is similar as both are constituted by the following core elements:
Name: A unique identifier that can be used to identify different studies.
System: The name of the system that we want to optimize.
Workflow: The name of the workflow that will be used to configure the application.
Goal: The objective of the optimization (e.g. minimize cost, maximize throughput, reduce latency).
Parameter Selection: A list of parameters that will be tuned in the optimization (e.g. container memory and CPU limits, EC2 instance family..).
Steps: The flow of the optimization study (e.g. assessing the baseline performance, optimizing the system, restoring the configuration).
The system and the workflow, already introduced in the previous sections, are referenced in the study definition to provide Akamas with information on how to apply the parameters (through the workflow) and retrieve the metrics (through the telemetry instances in the system) that are used to calculate the goal.
The goal defines the objective of our optimization. Specifying a goal is as simple as defining the metric we want to optimize and the direction of the optimization such as maximizing throughput or minimizing cost. If you want to optimize more complex scenarios or lack a single metric that represents your objective you can also specify a formula and define a goal such as minimizing memory and CPU utilization.
Metrics are identified within a study with the following notation component.metric_name
where component
is the name of a component of the system linked to the study and metric name
is the name of a metric. As an example, the CPU utilization of a container might be identified by MyContainer.cpu_util
.
Another important, although optional, element of the goal is the definition of constraints on other metrics of the system: in many cases optimizing a system involves finding a tradeoff between multiple aspects, and goal constraints can be used to map SLO and inform Akamas about other aspects of our system that we want to safeguard during the optimization (e.g. reducing the amount of CPU assigned to a container might reduce the cost of running the system but increase its response time). Constraints can be used to specify, as an example, an upper limit to the response time or the memory utilization of the system. You can find more information on how to specify constraints in the reference documentation section.
The parameter selection contains the list of parameters that are subject to the optimization process. These might include several components and layers, as in the following example.
Similarly to metrics, components are defined with the notation component.parameter_name
.
Optionally, you can also specify a range of values that can be assigned to the parameter. This is very useful when you want to evaluate a specific optimization area or want to add some context to the optimization (e.g. avoid setting a memory greater than 8GB because it's not available on the system).
The parameter selection can include any component and parameter of the system. During the optimization process, Akamas will provide values for those parameters and apply them to the system using the workflow provided in the study definition.
If the goal describes where we are heading, steps describe the road to get there. Usually, when optimizing an application we want to assess its performance before the tuning activity to evaluate the benefits; this initial assessment is called the Baseline. Then, we want to run the optimization process for a definite number of iterations, this is called an Optimization step. Many other use cases can be achieved by providing additional steps to the study. Some of these include:
Re-using knowledge gathered by other optimization studies
Applying the baseline configuration to the test environment after the optimization has ended
Evaluating a specific configuration suggested by the user
You can find more information on the steps in the reference documentation section.
Besides the goal, parameter selection, and steps, the study can be enriched with other, optional, elements that can be used to better tailor it to your specific needs. These include, as an example automated windowing and parameter constraints. You can find more information on these optional elements in the specific subsections or read the entire study definition in the reference documentation section.
Recalling our application example introduced in this section, our optimization objective is to reduce the costs of running the Ad service while reaching our SLO on the response time.
As shown in the image below, you can use the study creation wizard in the UI to specify all the required information.
If you prefer to define it via YAML you can use the following file.
Save it to a file named, as an example, study.yaml
and then issue the command
This study's definition contains three main parts.
The goal
In this section, we instruct akamas that we want to minimize the cost of the Adservice and we have added a constraint to the optimization. In particular, we added a constraint on the value of the metric requests_response_time
of the Api
component to be lower than 20ms. This is an absolute constraint as it's defined on the actual value of the metric and can easily map an SLO. You can also express constraints like "do not make the response time increase more than 10%" by using relative constraints. You can find more info on the supported constraint types in the reference documentation section.
The parameters selection
In this section, we defined which parameters Akamas can change to achieve its goal. We decided to include parameters both from the JVM and the container layers to let Akamas tune all of them accordingly. We also specified a custom domain, for a couple of parameters, to allow Akamas to explore only values within those ranges. Note that this is an optional step as Akamas already knows about the range of possible values of many parameters. You can find more info on available parameters and guidelines to choose them in different use cases in the optimization guides section.
The steps
This final section instructs Akamas to first assess the performance and costs of the current configuration, which we will refer to as the baseline, then run 30 experiments by changing the parameters to optimize the goal.
You can now start your optimization study and wait for Akamas to find the best configuration!