Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The third step in optimizing a new application is to create a workflow to instruct Akamas on the actions required to apply a configuration to the target application.
A workflow defines the actions that must be executed to evaluate the performance of a given configuration. These actions usually depend on the application architecture, technology stack, and deployment practices which might vary between environments and organizations (e.g. Deploying a microservice application in a staging environment on Kubernetes and performing a load test might be very different than applying an update to a monolith running in production).
Akamas provide several general-purpose and specialized workflow operators that allow users to perform common actions such as running a command on a Linux instance via SSH as well as integrate enterprise tools such as LoadRunner to run performance tests or Spark to launch Big Data analysis. More information and usage examples are on the Workflow Operators reference page.
If you are using GitOps practices and deployment pipeline you are probably already familiar with most of the elements used in Akamas workflows. Workflows can also trigger existing pipelines and re-use all the automation already in place.
Workflows are not tightly coupled to a study and can be re-used across studies and systems so you can change the optimization scope and target without the need to re-create a specific workflow.
The structure of the workflow heavily depends on deployment practices and the kind of optimization. In our example, we are dealing with a microservice application deployed in a test environment which is tested by injecting some load using Locust, a popular open-source performance testing tool.
The workflow that we will create to allow Akamas to evaluate the configurations comprises the following actions:
Create a deployment file from a template
Apply the file via kubectl
command
Wait for the deployment to be ready
Start the load test via locust APIs
Even if the integrations of this workflow are specific to the technology used by our test application (e.g. using kubectl
CLI to deploy the application), the general structure of the workflow could fit most of the applications subject to offline optimization in a test environment.
You can find more workflow examples for different use cases on the Optimization Guides section and references to technology-specific operators (e.g. Loadrunner, Spark) on the Workflow Operators reference page.
Here is the YAML definition of the workflow described above.
In this workflow, we used two operators: the FileConfigurator operator which creates a configuration file starting from a template by inserting the configuration values decided by Akamas, and the Executor operator which runs a command on a remote instance (named mgmserver
in this case, via ssh).
Save it to a file named, as an example, workflow.yaml
and then issue the creation command:
Here is what the workflow looks like in the UI:
In cases where a testing environment is not available or it is hard to build representative load tests Akamas can directly optimize production environments by running a Live Optimization study. Production environments differ from test environments in many ways, here are the main aspects that affect how Akamas can optimize the system in such a scenario and that define live optimization studies:
Safety, in terms of application stability and performance, is critical in production environments where SLO might be in place.
The approval process is usually different between production and lower-level environments. In many cases, a configuration change in a production environment must be manually approved by the SRE or Application team and follow a custom deployment scenario.
The workload on the application in a production environment is usually not controlled, it might change with the time of the day, due to special events or external factors
These are the main factors that make live optimization studies differ from offline optimizations.
The following figure represents the iterative process associated with live optimizations:
The following 5 phases can be identified for each iteration:
Collect KPIs: Akamas collects the metrics of the system required to observe its behavior under the current parameter configuration by leveraging the associated telemetry provider - here Akamas is also observing and categorizing the different workload contexts that are used to recommend configurations that are appropriate for each specific workload context
Score vs Goal: Akamas scores the applied parameter configuration under the specific workload context against the defined goal and constraints
Recommend Conf: Akamas provides a recommendation for parameter configuration based on the observed behavior and leveraging the Akamas AI
Human Approval: the recommendation is inspected, possibly revisited, and approved by users before being applied to the system. This step is optional and can be automated.
Apply Conf: Akamas applies the recommended configuration by leveraging the defined workflow.
Overall the core process is very similar to the one of offline optimization studies. The main difference is the (optional) presence of a manual configuration review and approval step.
Even if the process is similar, the way recommended configurations are generated is quite different as it's subject to some safety policies such as:
The exploration factor defines the maximum magnitude of the change of a parameter from one configuration to the next (e.g. reducing the CPU limit of a container by at most 10%). As changes are smaller in magnitude their effect on the system is also smaller, this leads to safer optimizations as the optimization can better track changes in the core metrics. As a side effect, it might take more time for a live optimization to fully optimize a configuration when compared to an offline study.
The safety factor defines how tight the constraints defined in the study are. As the configuration changes some metrics might approach a limit imposed by constraints. As an example, if we set a response time threshold of 300ms akamas will keep track of how the response time changes due to the configuration changes and react to keep the constraint fulfilled. The safety factor influences how quickly Akamas reacts to approaching constraints.
You can read more on safety policies in the related documentation section.
A key aspect of live optimization studies is the fact that the incoming workload of the application is not generated by a test script but by real users. This means that, after deploying a new configuration the incoming might be different with respect to the use used to evaluate the previous one. Nevertheless, the Akamas AI algorithm is capable of taking into account the differences in the incoming workload and fairly evaluating different configurations even if applied in different scenarios. As an example, the traffic of web applications exposed to the general public is usually different between workdays and weekends or working hours and nights.
To instruct Akamas to take into account changes that are not controlled by the deployment process you just need to specify the workloadsSelection
parameter in the optimization study.
The workload selection should contain a list of metrics that are independent of the configuration and represent external factors that affect the performance of the configuration in terms of goals or constraints. Most of the time the application throughput is a good metric to use as a workload metric.
When one or more workload metrics are specified Akamas will take into account the differences in the workload and build clusters of similar workloads to identify repetitive working conditions for the application. It will then use this information to contextualize the evaluation of each configuration and provide a recommended configuration that fulfills the defined constraints on all the workload conditions seen by the optimization process.
You can read more on this parameter on the reference workload selection page.
Live optimizations are separated from offline optimization studies and are available in the second entry on the left menu.
Live optimizations are run usually for a longer period compared to offline optimizations and their effect on the goal and the constraints is more gradual. For this reason, Akamas offers a specific UI that allows users to evaluate the progress of live optimizations and compare many different configurations applied by looking at the evolution of core metrics.
Now that Akamas knows about your application, how to configure it, and how to monitor it, the final step is to define your optimization study.
The study defines the objective of the optimization activity. It contains information about what we want to achieve (e.g. reduce costs, improve latency..), the parameters that can be optimized, and any SLO that should not be breached by the optimized configuration.
Studies are divided into two main categories:
Offline Studies are, generally, executed in test environments where the workload of the application is generated using a load-testing tool. You can read more here.
Live Studies are, usually, executed in production environments. You can read more here.
The setup of both studies is similar as both are constituted by the following core elements:
Name: A unique identifier that can be used to identify different studies.
System: The name of the system that we want to optimize.
Workflow: The name of the workflow that will be used to configure the application.
Goal: The objective of the optimization (e.g. minimize cost, maximize throughput, reduce latency).
Parameter Selection: A list of parameters that will be tuned in the optimization (e.g. container memory and CPU limits, EC2 instance family..).
Steps: The flow of the optimization study (e.g. assessing the baseline performance, optimizing the system, restoring the configuration).
The system and the workflow, already introduced in the previous sections, are referenced in the study definition to provide Akamas with information on how to apply the parameters (through the workflow) and retrieve the metrics (through the telemetry instances in the system) that are used to calculate the goal.
The goal defines the objective of our optimization. Specifying a goal is as simple as defining the metric we want to optimize and the direction of the optimization such as maximizing throughput or minimizing cost. If you want to optimize more complex scenarios or lack a single metric that represents your objective you can also specify a formula and define a goal such as minimizing memory and CPU utilization.
Metrics are identified within a study with the following notation component.metric_name
where component
is the name of a component of the system linked to the study and metric name
is the name of a metric. As an example, the CPU utilization of a container might be identified by MyContainer.cpu_util
.
Another important, although optional, element of the goal is the definition of constraints on other metrics of the system: in many cases optimizing a system involves finding a tradeoff between multiple aspects, and goal constraints can be used to map SLO and inform Akamas about other aspects of our system that we want to safeguard during the optimization (e.g. reducing the amount of CPU assigned to a container might reduce the cost of running the system but increase its response time). Constraints can be used to specify, as an example, an upper limit to the response time or the memory utilization of the system. You can find more information on how to specify constraints in the reference documentation section.
The parameter selection contains the list of parameters that are subject to the optimization process. These might include several components and layers, as in the following example.
Similarly to metrics, components are defined with the notation component.parameter_name
.
Optionally, you can also specify a range of values that can be assigned to the parameter. This is very useful when you want to evaluate a specific optimization area or want to add some context to the optimization (e.g. avoid setting a memory greater than 8GB because it's not available on the system).
The parameter selection can include any component and parameter of the system. During the optimization process, Akamas will provide values for those parameters and apply them to the system using the workflow provided in the study definition.
If the goal describes where we are heading, steps describe the road to get there. Usually, when optimizing an application we want to assess its performance before the tuning activity to evaluate the benefits; this initial assessment is called the Baseline. Then, we want to run the optimization process for a definite number of iterations, this is called an Optimization step. Many other use cases can be achieved by providing additional steps to the study. Some of these include:
Re-using knowledge gathered by other optimization studies
Applying the baseline configuration to the test environment after the optimization has ended
Evaluating a specific configuration suggested by the user
You can find more information on the steps in the reference documentation section.
Besides the goal, parameter selection, and steps, the study can be enriched with other, optional, elements that can be used to better tailor it to your specific needs. These include, as an example automated windowing and parameter constraints. You can find more information on these optional elements in the specific subsections or read the entire study definition in the reference documentation section.
Recalling our application example introduced in this section, our optimization objective is to reduce the costs of running the Ad service while reaching our SLO on the response time.
As shown in the image below, you can use the study creation wizard in the UI to specify all the required information.
If you prefer to define it via YAML you can use the following file.
Save it to a file named, as an example, study.yaml
and then issue the command
This study's definition contains three main parts.
The goal
In this section, we instruct akamas that we want to minimize the cost of the Adservice and we have added a constraint to the optimization. In particular, we added a constraint on the value of the metric requests_response_time
of the Api
component to be lower than 20ms. This is an absolute constraint as it's defined on the actual value of the metric and can easily map an SLO. You can also express constraints like "do not make the response time increase more than 10%" by using relative constraints. You can find more info on the supported constraint types in the reference documentation section.
The parameters selection
In this section, we defined which parameters Akamas can change to achieve its goal. We decided to include parameters both from the JVM and the container layers to let Akamas tune all of them accordingly. We also specified a custom domain, for a couple of parameters, to allow Akamas to explore only values within those ranges. Note that this is an optional step as Akamas already knows about the range of possible values of many parameters. You can find more info on available parameters and guidelines to choose them in different use cases in the optimization guides section.
The steps
This final section instructs Akamas to first assess the performance and costs of the current configuration, which we will refer to as the baseline, then run 30 experiments by changing the parameters to optimize the goal.
You can now start your optimization study and wait for Akamas to find the best configuration!
Offline optimization studies are optimization studies where the workload is simulated by leveraging a load-testing tool.
Offline optimization studies are typically used to optimize systems in pre-production environments, with respect to planned and what-if scenarios that cannot be directly run in production. Scenarios include new application releases, planned technology changes (e.g. new JVM or DB), cloud migration or new provider, expected workload growth, and resilience under failure scenarios (from chaos engineering).
The following figure represents the iterative process associated with offline optimizations:
The following 5 phases can be identified for each iteration (also known as experiment):
Apply configuration: Akamas applies the parameter configuration (one or more parameters) to the target system by leveraging a set of workflow operators
Apply workload: Akamas triggers a workload on the target system by also leveraging a set of workflow operators
Collect KPIs: Akamas collects the metrics related to the target system - only those metrics that are specified by each telemetry instance defined in the system
Score vs goal: Akamas scores the applied parameter configuration against the defined goal and constraints - the score is the value of the goal function
Recommend Conf: Akamas AI engine identifies the configuration for the next iteration until a termination condition for the study is met (e.g. number of experiments).
Thanks to its patented AI (reinforcement learning) algorithms, Akamas can find the optimal configuration without having to explore all the possible configurations.
For each experiment, Akamas allows multiple trials to be executed. A trial is a repetition of the same experiment to reduce the impact of noise on the result of an experiment.
Environments can be noisy for several reasons such as:
External conditions (e.g. background jobs, "noisy neighbors" in the cloud)
Measurement errors (e.g. monitoring tools not always 100% accurate)
This approach is consistent with scientific and engineering practices, where the strategy to minimize the impact of noise is to repeat the same experiment multiple times.
An offline optimization study can include multiple steps.
Typically there are at least two steps:
Baseline step: a single experiment that is run by applying the already deployed configuration before the Akamas optimization is applied - the results of this experiment are used as a reference (baseline) for assessing the optimization and as such is a mandatory step for each study
Optimize step: a defined number of experiments used to identify the optimal configuration by leveraging Akamas AI.
Other steps are:
Bootstrap step: imported experiments from other optimization studies
Preset step: a single experiment with a defined configuration
The steps to be executed can be specified when defining an offline optimization study.
An offline optimization study is an Akamas resource that can be managed via CLI using the resource management commands.
The Akamas UI shows offline optimization studies in a specific top-level menu.
The details and results of an offline optimization study are displayed when drilling down (there are multiple tabs and sections).
After modeling the system and its components, the following step is to set up the telemetry. Telemetry is essential to provide Akamas with enough data to evaluate a configuration both in terms of goal (e.g. reducing the cost) and constraints (e.g. meeting SLOs).
Akamas can gather metrics from many data sources, from industry-standard observability platforms (e.g. Prometheus or Dynatrace) to simple CSV files. This is done via telemetry providers that contain all the logic and information required to correctly extract the metrics and map them to the components of your system. You can take a look at available telemetry providers in the documentation reference.
To instruct Akamas about the location of the data sources and how to access them, you can create a telemetry instance for your system. A telemetry instance comprises the following properties:
Name: An optional unique name within the system to quickly identify it.
Provider: The name of the telemetry provider that will be used to gather metrics.
Config: Additional configuration options that depend on the provider (e.g. a URL to reach the observability tool or the location of a CSV file to import) refer to each provider reference for more information.
A system can include multiple telemetry instances from different providers (e.g. in case you need to extract some information from Dynatrace and others from a CSV file).
Telemetry instances alone do not provide information on which metrics should be extracted from the data source and to which component they map. As briefly introduced in the system section this is the job of the component properties.
Each telemetry provider supports a unique set of properties that depends on the specific data source which allows Akamas to map each component to one or more entities in the observability tool and extract the right metrics for that particular technology.
As we introduced at the beginning of this section, we choose to use Dynatrace to monitor our application. To instruct Akamas to gather metrics from this data source you just need to create the following file.
In this file, we specified the URL and the token required to authenticate to our Dynatrace instance.
Save it to a file named, as an example, instance.yaml
and then issue the command.
As described in the section above, telemetry instances are coupled to a specific system. For this reason, we had to provide the name of the system Online Boutique
as an argument to the create command.
Here is how the telemetry instance looks in the UI.
Akamas needs to be informed that the component named Adservice
used in the system maps to a specific entity in Dynatrace that represents the container running in the Kubernetes cluster.
Recalling the definition of the Adservice
component in the system we see that it contains a set of properties starting with the dynatrace
keyword. These properties are used by the Dynatrace telemetry provider to map the component to the correct entity and import metrics such as CPU usage and throttling that can be used to gather information about the performance of such components.
For a complete definition of the properties available for the Dynatrace provider, as well as other providers, you can take a look at the telemetry reference documentation section.
If Dynatrace is not your observability platform of choice, take a look at the telemetry provider section where you can find many other telemetry providers for different observability tools and common integration strategies like CSV files.
Creating a system is the first step in optimizing your application.
A system is a representation of your application. It might be a complete representation of different layers, a single microservice, a batch job, or any IT system that you want to optimize.
A system can be used to fully model an application and then run multiple optimization initiatives or contain just the elements that are used for a specific optimization study.
The system is identified by a name, which in our example is "Online Boutique", and can be extended with a description to make it easily recognizable.
The core elements of a system are the components. A component represents the fundamental element of an IT system, often composed of various layers or entities. It serves as a black-box definition of an entity involved in optimization, eliminating the need for intricate details in modeling.
A component comprises the following properties:
Name: A distinct identifier within the context of the system.
Description: A clarification of the component's purpose or function.
Component type: An identification of the underlying technology or technology stack of the component.
Properties: A set of additional properties that hold information about the component's configuration or telemetry (e.g. the IP used to reach an API or the username to connect to a server via SSH).
Akamas allows users to model their IT systems without the need to focus on technological aspects by providing several out-of-the-box component types to support system and component modeling.
Component types are platform entities (i.e.: shared among all the users) that contain key information about specific technologies such as parameters that can be tuned and key metrics.
Akamas includes off-the-shelf component types for the most popular technologies such as Containers, Linux Hosts, AWS EC2 instances, Web Applications, Spark, and runtimes such as JVM, Node, and Go.
Component types are shipped within Optimization Packs and can be easily installed and updated as support for new technologies is released.
Recalling our example of the Online Boutique application, we decided, for the moment, to model just the elements that are included in the optimization initiative. We have also decided not to model the entire Kubernetes cluster as we are not interested in optimizing and monitoring it at this stage.
We have mapped the JVM and the Pod to the respective component types and mapped the Kubernetes service to the Web Application component type. You can read more about these component types in their documentation reference.
To model our system we used the component types coming from these optimization packs:
The following picture shows our choice of components starting from the architectural diagram.
To create this system in Akamas you can use the following YAML file.
Create the file system.yaml
and run the following command.
Now you can start adding components. The following three YAML files represent the three components of our Online Boutique system.
Create the files and run the following command for each file.
Note that, since components are bound to a specific system, we need to provide as an argument to the creation command also the name of the system Online Boutique
that we created a few moments ago.
A critical aspect, when evaluating the performance of an application, is to make sure that the data we use is accurate. It's quite common for IT systems to experience some transient periods of instabilities; these might occur in many situations such as filling up caches, runtime compilation activities, horizontal scaling, and much more.
A common practice, in performance engineering, is to exclude from the analysis the initial and final part of a performance test to consider only the time when the system is in full operation. Akamas can automatically identify a subset of the whole data to evaluate scores and constraints.
Looking at the example below, from the Online Boutique application, we see that the response time has an initial spike to about 7ms and then stabilizes below 1ms; also the CPU utilization shows a similar pattern.
This is quite common, as an example, for Java-based systems as, in the first minutes of operations activities like heap resizing and just-in-time compilation take place. In this case, Akamas considered in the evaluation of the experiment only the gray area effectively avoiding the impact of the initial spike.
This behavior can be configured in the study by specifying a section called windowing. Two windowing policies allow you to properly configure Akamas in different scenarios.
The simplest policy is called trim and allows users to specify how much time should be excluded from the evaluation from the start and the end of the experiment. It is also possible to apply the trim policy to a specific task of the workflow. This policy can be easily used when, for example, the time required to deploy the application might change. You can read more on this policy in the reference documentation section.
In other contexts, discarding the initial warmup period is not enough. For these scenarios, Akamas supports a more advanced policy, called stability. This policy is also particularly useful for stress tests where our objective is to make the system sustain as much load as possible before becoming unstable as it allows users to express constraints on the stability of the system. You can read more on this policy in the reference documentation section.
The windowing section in the study definition is optional and the default policy considers all the available data to evaluate the performance of the experiment.
Even for live optimization studies, it is a good practice to analyze how the optimization is being executed with respect to the defined goal & constraints, and workloads.
This analysis may provide useful insights about the system being optimized (e.g. understanding of the system dynamics) and about the optimization study itself (e.g. how to adjust optimizer options or change constraints). Since this is more challenging for an environment that is being optimized live, a common practice to adopt a recommendation mode before possibly switching to a fully autonomous mode.
The Akamas UI displays the results of an offline optimization study in the following areas:
the Metrics section (see the following figures) displays the behavior of the metrics as configurations are recommended and applied (possibly after being reviewed and approved by users); this area supports the analysis of how the optimizer is driven by the configured safety and exploration factors.
The All Configurations section provides the list of all the recommended configurations, possibly as modified by the user, as well as the details of each applied configuration (see the following figures).
in the case of a recommendation mode, the Pending configuration section (see the following figure) shows the configuration that is being recommended to allow users to review it (see the EDIT toggle) and approve it:
One of the key elements that define an optimization study is the parameters set. We have already seen in the how to define the set of optimized parameters here we dig deeper on this topic.
Akamas supports four types of parameters:
Integer parameters are those that can only assume an integer value (e.g. the number of cores on a VM instance).
Real parameters can assume real values (e.g. 0.2) and are mostly used when dealing with percentages.
Categorical parameters map those elements that do not have a strict ordering such as GC types (e.g. Parallel, G1, Serial) or booleans.
Ordinal parameters are similar to categorical ones as they also support a set of literal values but they are also ordered. An example is VM instance size (e.g. small, medium, large, xlarge..).
You can read more on parameters and how they are managed in the .
Most of the time you should not bother with defining parameters, as this information is already defined in the Optimization Packs.
When creating new optimization studies you should first select a set of parameters to include in the optimization process. The set might depend on many factors such as:
The potential impact of a parameter on the defined goal (e.g. if my goal is to reduce the cost of running an application it might be a good idea to include parameters related to resource usage).
The layers selected for the optimization. Optimizing multiple layers at the same time might bring more benefits as the configurations of both layers are aligned.
The Akamas' ability to change those parameters (e.g. if my deployment process does not support the definition of some parameters because, as an example, are managed by an external group, I should avoid adding them).
Besides defining the set of parameters users can also select the domain for the optimization and add a set of constraints.
Optimization packs already include information on the possible values for a parameter but in some situations, it is necessary to shrink it. As an example, the parameter that defines the amount of CPU that a container can use (the cpu_limit
) might vary a lot depending on the underlying cluster and the application. If the cluster that hosts the application only contains nodes with up to 10 CPUs it might be worth limiting the domain of this parameter for the optimization study to that value to avoid failures when deploying the application and speed up the optimization process. If you forget to set this domain restriction Akamas will learn it by itself but it will need to try to deploy a container with a higher CPU limit to find out that that's not possible.
In many situations, parameters have dependencies between each other. As an example, suppose you want to optimize at the same time the size of a container and the Java runtime that executes the application inside of it. Both layers have some parameters that affect how much memory can be used, for the container layer this parameter is called memory_limit
and for the JVM is called jvm_heap_size
. Configurations that have a jvm_heap_size
value higher than the memory_limit
might lead to out-of-memory errors.
You can define this relationship by specifying a constraint as in the example below:
These constraints instruct Akamas to avoid generating configurations that bring the jvm_heap_size
parameter close to the memory_limit
leaving a gap of 50Mb.
This section describe the main steps to optimize an application
To optimize a new application on Akamas you have to follow four steps shown in the following picture and described in the next sections by means of a simple example.
As depicted in the picture above, to optimize a new application you should:
Create a system that models the key parts of your application (e.g. containers, runtimes, APIs) that will be interested in the optimization initiative.
Set up the integration with a monitoring tool via telemetry providers so that Akamas can gather metrics about the performance of your application.
Create a workflow that allows Akamas to configure your application (e.g. write a configuration file, relaunch a process).
Define the optimization study according to your goal and SLOs so that Akamas knows what you want to achieve.
These steps relate to how Akamas integrates with your environment and apply to both offline and live optimization studies.
In the following sections, we will use a simple yet representative web application to illustrate how to onboard a new application on Akamas. The application is called Online Boutique. It is a microservices application composed of 11 microservices that allow users to browse items, add them to the cart, and purchase them in an online store.
Suppose that we are about to deploy a major upgrade to one of the microservices, the Ad Service, that handles the advertisement logic, and we want to reduce the costs of running this service while meeting our SLO on the response time given an increasing number of users.
As shown in the diagram below, our service is built in Java, deployed as a pod in a Kubernetes cluster, and exposes an API using a service. The whole platform is monitored with Dynatrace.
You can now proceed to the first step, creating the system to model this application.
Constraints usually depend on the set of parameters chosen for the optimization. You can find more information about common constraints for the supported technologies in the documentation of the or the .
If your technology stack or optimization need does not fit this example, take a look at the section where you can find many optimization scenarios for different use cases.