1 of 6

Live Study

In cases where a testing environment is not available or it is hard to build representative load tests Akamas can directly optimize production environments by running a Live Optimization study. Production environments differ from test environments in many ways, here are the main aspects that affect how Akamas can optimize the system in such a scenario and that define live optimization studies:

Safety, in terms of application stability and performance, is critical in production environments where SLO might be in place.
The approval process is usually different between production and lower-level environments. In many cases, a configuration change in a production environment must be manually approved by the SRE or Application team and follow a custom deployment scenario.
The workload on the application in a production environment is usually not controlled, it might change with the time of the day, due to special events or external factors

These are the main factors that make live optimization studies differ from offline optimizations.

The following figure represents the iterative process associated with live optimizations:

The following 5 phases can be identified for each iteration:

Recommend Conf: Akamas provides a recommendation for parameter configuration based on the observed behavior and leveraging the Akamas AI
Human Approval: the recommendation is inspected, possibly revisited, and approved by users before being applied to the system. This step is optional and can be automated.

Overall the core process is very similar to the one of offline optimization studies. The main difference is the (optional) presence of a manual configuration review and approval step.

Safety

Even if the process is similar, the way recommended configurations are generated is quite different as it's subject to some safety policies such as:

The exploration factor defines the maximum magnitude of the change of a parameter from one configuration to the next (e.g. reducing the CPU limit of a container by at most 10%). As changes are smaller in magnitude their effect on the system is also smaller, this leads to safer optimizations as the optimization can better track changes in the core metrics. As a side effect, it might take more time for a live optimization to fully optimize a configuration when compared to an offline study.
The safety factor defines how tight the constraints defined in the study are. As the configuration changes some metrics might approach a limit imposed by constraints. As an example, if we set a response time threshold of 300ms akamas will keep track of how the response time changes due to the configuration changes and react to keep the constraint fulfilled. The safety factor influences how quickly Akamas reacts to approaching constraints.

Workload

A key aspect of live optimization studies is the fact that the incoming workload of the application is not generated by a test script but by real users. This means that, after deploying a new configuration the incoming might be different with respect to the use used to evaluate the previous one. Nevertheless, the Akamas AI algorithm is capable of taking into account the differences in the incoming workload and fairly evaluating different configurations even if applied in different scenarios. As an example, the traffic of web applications exposed to the general public is usually different between workdays and weekends or working hours and nights.

To instruct Akamas to take into account changes that are not controlled by the deployment process you just need to specify the workloadsSelection parameter in the optimization study.

The workload selection should contain a list of metrics that are independent of the configuration and represent external factors that affect the performance of the configuration in terms of goals or constraints. Most of the time the application throughput is a good metric to use as a workload metric.

When one or more workload metrics are specified Akamas will take into account the differences in the workload and build clusters of similar workloads to identify repetitive working conditions for the application. It will then use this information to contextualize the evaluation of each configuration and provide a recommended configuration that fulfills the defined constraints on all the workload conditions seen by the optimization process.

User Interface

Live optimizations are separated from offline optimization studies and are available in the second entry on the left menu.

Live optimizations are run usually for a longer period compared to offline optimizations and their effect on the goal and the constraints is more gradual. For this reason, Akamas offers a specific UI that allows users to evaluate the progress of live optimizations and compare many different configurations applied by looking at the evolution of core metrics.

Windowing

A critical aspect, when evaluating the performance of an application, is to make sure that the data we use is accurate. It's quite common for IT systems to experience some transient periods of instabilities; these might occur in many situations such as filling up caches, runtime compilation activities, horizontal scaling, and much more.

A common practice, in performance engineering, is to exclude from the analysis the initial and final part of a performance test to consider only the time when the system is in full operation. Akamas can automatically identify a subset of the whole data to evaluate scores and constraints.

Looking at the example below, from the Online Boutique application, we see that the response time has an initial spike to about 7ms and then stabilizes below 1ms; also the CPU utilization shows a similar pattern.

This is quite common, as an example, for Java-based systems as, in the first minutes of operations activities like heap resizing and just-in-time compilation take place. In this case, Akamas considered in the evaluation of the experiment only the gray area effectively avoiding the impact of the initial spike.

This behavior can be configured in the study by specifying a section called windowing. Two windowing policies allow you to properly configure Akamas in different scenarios.

The simplest policy is called trim and allows users to specify how much time should be excluded from the evaluation from the start and the end of the experiment. It is also possible to apply the trim policy to a specific task of the workflow. This policy can be easily used when, for example, the time required to deploy the application might change. You can read more on this policy in the reference documentation section.

In other contexts, discarding the initial warmup period is not enough. For these scenarios, Akamas supports a more advanced policy, called stability. This policy is also particularly useful for stress tests where our objective is to make the system sustain as much load as possible before becoming unstable as it allows users to express constraints on the stability of the system. You can read more on this policy in the reference documentation section.

The windowing section in the study definition is optional and the default policy considers all the available data to evaluate the performance of the experiment.

Parameters and constraints

One of the key elements that define an optimization study is the parameters set. We have already seen in the study section how to define the set of optimized parameters here we dig deeper on this topic.

Akamas supports four types of parameters:

Integer parameters are those that can only assume an integer value (e.g. the number of cores on a VM instance).
Real parameters can assume real values (e.g. 0.2) and are mostly used when dealing with percentages.
Categorical parameters map those elements that do not have a strict ordering such as GC types (e.g. Parallel, G1, Serial) or booleans.
Ordinal parameters are similar to categorical ones as they also support a set of literal values but they are also ordered. An example is VM instance size (e.g. small, medium, large, xlarge..).

You can read more on parameters and how they are managed in the reference documentation section.

Most of the time you should not bother with defining parameters, as this information is already defined in the Optimization Packs.

When creating new optimization studies you should first select a set of parameters to include in the optimization process. The set might depend on many factors such as:

The potential impact of a parameter on the defined goal (e.g. if my goal is to reduce the cost of running an application it might be a good idea to include parameters related to resource usage).
The layers selected for the optimization. Optimizing multiple layers at the same time might bring more benefits as the configurations of both layers are aligned.
The Akamas' ability to change those parameters (e.g. if my deployment process does not support the definition of some parameters because, as an example, are managed by an external group, I should avoid adding them).

Domains

Besides defining the set of parameters users can also select the domain for the optimization and add a set of constraints.

Optimization packs already include information on the possible values for a parameter but in some situations, it is necessary to shrink it. As an example, the parameter that defines the amount of CPU that a container can use (the cpu_limit ) might vary a lot depending on the underlying cluster and the application. If the cluster that hosts the application only contains nodes with up to 10 CPUs it might be worth limiting the domain of this parameter for the optimization study to that value to avoid failures when deploying the application and speed up the optimization process. If you forget to set this domain restriction Akamas will learn it by itself but it will need to try to deploy a container with a higher CPU limit to find out that that's not possible.

Constraints

In many situations, parameters have dependencies between each other. As an example, suppose you want to optimize at the same time the size of a container and the Java runtime that executes the application inside of it. Both layers have some parameters that affect how much memory can be used, for the container layer this parameter is called memory_limit and for the JVM is called jvm_heap_size. Configurations that have a jvm_heap_size value higher than the memory_limit might lead to out-of-memory errors.

You can define this relationship by specifying a constraint as in the example below:

parameterConstraints:
  - name: Heap should be lower than the container memory limit
    formula: container.memory_limit > jvm.jvm_heap_size + 50

These constraints instruct Akamas to avoid generating configurations that bring the jvm_heap_size parameter close to the memory_limit leaving a gap of 50Mb.

Constraints usually depend on the set of parameters chosen for the optimization. You can find more information about common constraints for the supported technologies in the documentation of the related optimization pack or the optimization guides.