1 of 43

Using Akamas

This section describes how to use Akamas

This guide introduces the optimization process and methodology with Akamas and then provides a step-by-step description of how to prepare, run and analyze Akamas optimization studies:

General optimization process
Preparing optimization studies
Running optimization studies

and also provides some technology-specific guidelines and examples on:

Guidelines for choosing optimization parameters
Guidelines for defining optimization studies

General optimization process and methodology

Akamas has been designed and implemented to effectively support organizations in implementing their own approach to optimization, in particular, thanks to its Infrastructure as Code (IaC) design, modular and reusable constructs, and delegation-of-duty features to support multiple teams.

While an optimization process can also be a one-shot exercise aiming at optimizing a specific critical application to remediate performance issues or to address a cost reduction initiative, in general, optimization is conceived as a continuous and iterative process. This process can be seen as composed of multiple optimization campaigns running in parallel (each typically involving a single application) that are being executed at the same time (see the following figure).

In Akamas, an optimization campaign is structured into one or more , which represent an optimization initiative aimed at optimizing a target system with respect to defined goals and constraints.

At any given timeframe, for a specific application, there could be multiple studies being executed either in parallel or in sequence (see the following figure):

multiple live optimizations running for each critical application microservices; typically, a live optimization focuses on an application microservice supporting a specific business function with respect to specific optimization goals and constraints, as the optimization could be aimed for some microservices at improving performance while trading lower costs, while for others at keeping performances within the SLOs and reducing infrastructure or cloud cost;
multiple offline optimization studies may correspond to the different layers of the target system that are being optimized in several stages (typically starting with the backend layer, then the middleware, and finally the front-end layer), or to several application releases with different resources footprint (e.g. higher memory usage), or that involve technology changes in the application stack (e.g. moving from Oracle to MongoDB) or migration to a different cloud provider (or cloud managed service), or that are required to sustain higher workload (e.g. due to a marketing campaign) or to ensure application resilience under failure scenarios (identified by chaos engineering).

The following figure intends to illustrate the variety of scenarios in a real optimization process:

For example (with reference to the previous figure):

the optimization campaign for the microservices-based application App-1 runs an offline optimization study for the App-1-1 microservice in Q1 and the App-1-2 microservice in Q2, before running live optimizations for both these microservices in parallel starting from Q3; notice that in Q4, possibly to anticipate a workload growth and assess the required infrastructure, an offline optimization for App-1-2 (possibly the most resource-demanding microservice) is also executed;
the optimization campaign for the standalone application App-2 runs several offline optimizations in sequence: in Q1 and Q2, first separately on the frontend and backend layers of App-2 (respectively App-2-FE and App-2-BE) and then in Q3 for the entire application; in Q4, in addition to the quarterly optimization for App-2 with respect to the goal Goal-2-1 that was used in the previous optimizations, also another offline optimization is executed with respect to a different goal Goal-2-2, which could either be a refinement of the previous goal (e.g. with tighter SLOs) or reflecting a completely different goal (e.g. a cost-reduction goal with respect to a performance improvement goal);
the optimization campaign for the microservices-based application App-3 runs first a live optimization starting at some point in Q2 (for example as the application is first released) for most-critical microservice App-3-1 and then in Q3 also for other microservice App-3-2, possibly as a refinement of the modeling of App-3 based on the observed optimization results.

Preparing optimization studies

Preparing an optimization study requires several steps, as illustrated by the following figure:

and described in the following sections:

modeling systems
modeling components
creating telemetry instances
creating automation workflow
creating optimization study

Notice that while these steps apply to both offline optimization studies and live optimization studies, some of these steps are different depending on which optimization is being prepared.

Modeling systems

The very first preparatory step is to model the system representing an application or a service that needs to be optimized (also known as the optimization target).

Modeling a system translates into identifying the components representing the key technology elements to be included in the optimization. Each component is associated with a set of tunable parameters, i.e. configurable properties that impact the performance, efficiency, or reliability of the system, and with a set of metrics, i.e. measurable properties that are used to evaluate the performance, efficiency, or reliability of the system. Typically, key system components are identified by considering which elements and their parameters need to be tuned.

The following figure shows a system corresponding to a Java-based application, where the Java Virtual Machine (JVM) and Kubernetes containers have been identified as key components.

As shown in this figure, a supported component is the "web application", representing the end user perspective of the modeled system (e.g. response time). As expected, this component type only provides measured metrics and no tunable parameters.

Akamas provides several out-of-the-box component types to support system and component modeling. Moreover, it is also possible to define new component types to model other components (see Modeling components).

The System template section of the reference guide describes the template required to define a system, while the commands for creating a system are listed on the Resource Management command page.

Best Practices

Properly modeling the application or service to be optimized by identifying the components and their parameters to tune is the first important step in the optimization process. Some best practices are described here below.

Modeling only relevant components

When defining the system and its components, it is convenient to focus only on those components that are either providing tunable parameters or key metrics (or KPIs).

Key metrics are those used to:

define the optimization goal and constraints, either as metrics that are expected to be improved by the optimization or as metrics representing constraints. For example, a typical goal is to optimize the application throughput. In this case, a Web Application component should include service metrics such as transaction throughput or transaction response time.
support the analysis of the optimization results, as metrics that are useful to measure the impact of parameter tuning on the performance, efficiency, or reliability of the system. For example, a Linux OS component could be used to assess the impact of the optimization on the system-level metrics such as CPU utilization.

Please note that the metrics used to define the optimization goal and constraints are mandatory as they are used by the Akamas AI engine to validate and score each tested configuration against the goal. Other metrics that are not related to the optimization goal and constraints can be considered optional from a pure optimization implementation perspective.

When defining the optimization study, it is always possible to select which parameters and metrics to consider, thus which components are modeled in the system. Therefore, a system could be modeled by all components that at some point are going to be optimized, even if not used in the current optimization study. However, the recommended approach is to model the system only with components whose parameters (and relevant metrics) are to be tuned by the current study.

Reusing systems whenever possible

Whenever possible, it is recommended to model systems and their components by considering how these could be reused for multiple optimization studies in different contexts.

For example, it might be useful to create a simple system containing only one component (e.g. the JVM) for a first optimization study. A new system might then be created to include other components (e.g. the application server) for more advanced optimization studies.

Please, also notice that systems (and other Akamas artifacts) can be shared with different teams thanks to the definition of Akamas workspace.

Modeling systems with horizontal scalability

A typical optimization target is a cluster, i.e. a system made of multiple instances that provide horizontal scalability (e.g. a Kubernetes deployment with several replicas). In this scenario, all the instances are supposed to be identical both from a code and configuration perspective. In this scenario, the recommended approach is to create only one component that represents a generic instance of the cluster. This way, all the instances will be tuned in exactly the same way.

In this scenario, the associated automation workflow needs to be configured to ensure that each configuration is applied to the whole cluster, by propagating the parameter configuration to all of the cluster instances, not just to a single instance represented by the modeled component whose metrics are collected and used to evaluate the overall cluster behavior under that configuration.

Notice that in order for this approach to work correctly, it is also important to verify that the cluster is correctly monitored by the telemetry providers. Depending on the telemetry technology in use, the clustered system may be presented as either a single entity, with aggregated metrics (e.g. a Kubernetes deployment with the total CPU usage of all the replica pods), or as multiple entities, each corresponding to the different instances in the cluster:

in case aggregated metrics are provided by the telemetry provider for the cluster, these metrics can be simply assigned to the component modeling the whole cluster;
in case only instance-level metrics are made available by the telemetry provider, telemetry instances need to be configured in Akamas so as to aggregate the metrics of the cluster instances (e.g. averaging CPU utilization, summing memory usage, etc.), depending on how each specific metric is expected to be used in the goal and constraints or in the study results.

Modeling components

After identifying the components that are required to model a system, the following step is to model each identified key component.

Akamas provides the corresponding for their specific technology (and possibly version) and describing all the tunable parameters and metrics of interest. The full list of Akamas optimization packs is available on the o page of the Akamas reference guide.

The section of the reference guide describes the template required to define a system component, while the commands for creating a system component are listed on the page.

While the optimization process does not necessarily require component types and optimization packs to be defined, it is recommended to leverage this construct to facilitate modularization and reuse.

This is possible as the Akamas optimization pack model is extensible: custom optimization packs can be easily created without any programming to allow Akamas optimization capabilities to be applied to virtually any technology.

Creating custom optimization packs

To create a custom optimization pack, the following fixed directory structure and several YAML manifests need to be created.

Optimization pack directory structure

Optimization pack manifest

The optimizationPack.yaml file is the manifest of the optimization pack to be created, which should always be named optimizationPack and have the following structure:

where:

Component types

The component-types directory should contain the manifests of the component types to be included in the optimization pack. No particular naming constraint is enforced on those manifests.

Metrics

The metrics directory should contain the manifests of the groups of metrics to be included in the optimization pack. No particular naming constraint is enforced on those manifests.

Parameters

The parameters directory should contain the manifests of the groups of parameters to be included in the optimization pack. No particular naming constraint is enforced on those manifests.

Telemetry providers

The telemetry-providers directory should contain the manifests of the groups of parameters to be included in the optimization pack. No particular naming is enforced on those manifests.

Building optimization pack descriptor

The following command need to be executed in order to produce the final JSON descriptor:

Managing optimization packs

Whether out-of-the-box or custom, before being used optimization packs need to be installed on an Akamas installation before being used.

Since optimization packs are global resources that are shared across all the workspaces on the same Akamas installation, an account with administrative privileges is required to manage them.

Optimization packs that are not yet installed are displayed as grayed out in the Akamas UI (this is the case for the AWS and Docker packs in the following figure).

An Akamas installation comes with the latest optimization packs already loaded in the store and is able to check the central repository for updates.

Installing

There are two ways of installing an optimization pack:

online installation - this is the general case when the optimization pack is already in the store
offline installation - this may apply to custom optimization packs available as a JSON file (refer to the Creating custom optimization pack page)

Only in the first case, an optimization pack can be installed from the UI. See here below the command line commands to get an optimization pack installed.

Online installation

Execute the following command by specifying the name of the optimization pack that is already available in the store:

akamas install optimization-pack OPTIMIZATION_PACK_NAME

Offline installation

The following command describes how to download the file descriptor https://akamas.s3.us-east-2.amazonaws.com/optimization-packs/Linux/1.3.0/Linux_1-3-0.json related to the version 1.3.0 of the Linux optimization pack:

curl -O https://akamas.s3.us-east-2.amazonaws.com/optimization-packs/Linux/1.3.0/Linux_1-3-0.json akamas install optimization-pack Linux_1-3-0.json

Execute the following command to install an optimization pack by specifying the name of the optimization pack and the full path to the JSON descriptor file:

akamas install optimization-pack PATH_TO_JSON_DESCRIPTOR

Forcing installation

When installing an optimization pack, the following checks are executed to identify potential clashes with already existing resources:

name of the optimization pack
metrics
parameters
component types
telemetry providers

In case one of those checks is positive (i.e. a clash exists), the installation failed and a message notifies that a "force" option needs to be used to get the optimization pack installed anyway

akamas install -f optimization-pack OPTIMIZATION_PACK_NAME

Please be aware that when forcing the installation of an optimization pack, Akamas replaces (or merges) all the conflicting resources, except that if there is at least one custom resource, the installation is stopped. In this case, the custom resource needs to be manually removed first in order to proceed.

Uninstalling

The following command uninstalls an optimization pack

akamas uninstall --force OPTIMIZATION_PACK_NAME

Notice that this also deletes all the components built using that optimization pack.

Updating

In case a new optimization pack needs to be installed from a descriptor, the procedure is the following:

uninstall the optimization pack
remove the old version of the optimization pack descriptor file from the store container;
install the new optimization pack with the new JSON descriptor

Creating telemetry instances

After modeling the system and its components, the following step (see the following figure) is to ensure that all the metrics that are required to define goals and constraints and analyze the behavior of the target system can be collected from one of the available data sources available in the environment, that in Akamas are called telemetry providers.

Akamas provides a number of out-of-the-box telemetry providers, including industry-standard monitoring platforms (e.g. Prometheus or Dynatrace), performance testing tools (e.g. LoadRunner or JMeter), or simple CSV files. The section Integrating Telemetry Providers lists all the out-of-the-box telemetry providers and how to get them integrated by Akamas, while the section Telemetry metric mapping describes the mapping of the specific data source metrics to Akamas metrics).

Since several instances of a data source type might be available, the specific data source instance needs to be specified, that is a corresponding telemetry instance needs to be defined for the modeled system and its components.

The Telemetry instance template section of the reference guide describes the template required to define a telemetry instance, while the commands for creating a telemetry instance are listed on the Resource Management command page.

Telemetry Providers are shared across all the workspace in the same Akamas installation and require an account with administrative privileges to manage them. Any number of telemetry instances (even of the same type) can be specified. For example, the following figure shows two Prometheus telemetry instances associated with the Adservice system.

Best Practices

The following sections provide guidelines on how to create telemetry instances.

Verify metrics provided by the telemetry provider

A seemingly obvious, yet fundamental, best practice when choosing a telemetry provider is to check whether the required metrics:

are supported by the original data source or can be added (e.g. as it is in the case of Prometheus)
are available and can be effectively gathered in the specific implementation
are supported by the telemetry provider itself or whether it needs to be extended (this is the case for a Prometheus telemetry provider ) as in the case of custom metrics such as those made available by the application itself

Akamas makes it possible to validate whether a telemetry setup works correctly by first executing dry runs. This is discussed in the context of the recommended practices to run optimization studies (section Running optimization studies).

Creating automation workflows

After modeling the system and its components and ensuring that appropriate telemetry instances are defined, the following step (see the following figure) is to define a workflow.

A workflow automates all the tasks to be executed in sequence (see the following figure) during the optimization study, in particular those leveraging integrations with external entities, such as telemetry providers or configuration management tools. Akamas provides a number of general-purpose and specialized workflow operators (see Workflow Operator page).

The Workflow template section of the reference guide describes the template required to define a workflow, while the commands for creating a workflow are listed on the Resource Management command page.

Since a workflow is an Akamas resource defined at the workspace level and that can be used by multiple studies, it might be the case that a convenient workflow is already available or can be used to create a new workflow for the specific target system and integrations, by adding/removing some workflow tasks, changing the task sequence or the values assigned to task parameters.

Notice that since the structure of workflows defined for a live optimization study and for an offline optimization study are very different, these cases are described by a specific page:

creating workflows for offline optimization studies
creating workflows for live optimization studies

Creating workflows for offline studies

A workflow for an automates all the actions required to interface the configuration management and load testing tools (see the following figure) at each experiment or trial. Notice that metrics collection is an implicit action that does not need to be coded as part of the workflow.

More in detail, a typical workflow includes the following types of tasks:

Preparing the application, by executing all cleaning or reset actions that are required to prepare the load testing phase and ensuring that each experiment is executed under exactly the same conditions - for example, this may involve cleaning caches, uploading test data, etc
Applying the configuration, by preparing and then applying the parameter configuration under test to the target environment - this may require interfacing configuration management tools or pushing configuration to a repository, restarting the entire application or some of its components to ensure that some parameters are effectively applied, and then checking that after restarting the application is up & running before the workflow execution continues, and checking whether the configuration has been correctly applied
Applying the workload, by launching a load test to assess the behavior of the system under the applied configuration and synthetic workload defined in the load testing scenarios - of course, a preliminary step is to design a load testing scenario and synthetic workload that ensures that optimized configurations resulting from the offline optimization can be applied to the target system under the real or expected workload

Failing workflows

A workflow interrupts in case any of its steps does. A failing workflow causes the experiment or trial to fail. This should be considered as a different situation than a specific configuration not matching optimization constraints or causing the system under test to fail to run. For example, if the amount of max memory configured was too low, the application may fail to start.

When an experiment fails, the Akamas AI engine takes this information into account and thus learns that that parameter configuration was bad. This way, the AI engine automatically tries to avoid the regions of the parameter space which can lead to low scores or failures.

Best Practices

Creating effective workflows is essential to ensure that Akamas can automatically identify the optimal configuration in a reliable and efficient way. Some best practices on how to build robust workflows are described here below.

Reusing workload as much as possible

Since Akamas workflows are first-class entities that can be used by multiple studies, it might be useful to avoid creating (and maintaining) multiple workflows and instead define workflows that can be easily reused, by factoring all differences into specific action parameters.

Of course, this general guideline should be balanced with respect to other requirements, such as avoiding potential conflicts due to different teams modifying the same workload for different uses and potentially impacting optimization results.

Building robust workflows

Akamas takes into account the exit code of each of the workflow tasks, and the whole workflow fails if a task exits with an error. Therefore, the best practice is to make use of exit codes in each task, to ensure that task failures can only happen in case of bad parameter configuration.

For example, it is important to always check that the application has correctly started and is up and running (after a new configuration has been applied). This can be done by:

including a workflow task that tests the application is up and running after the tasks where the configuration is applied;
making sure that this task exits with an error in case the application has not correctly started (typically after a timeout).

Another example is when the underlying environment incurs issues during the optimization (e.g. a database might be mistakenly shut down by another team). As much as possible, all these environmental transient issues should be carefully avoided. Akamas also provides the ability to execute multiple task retries (default is twice, configurable) to compensate for these transient issues, provided they only last for a short time (the retry time and delay are also configurable).

Building workflows that ensure reproducible experiments

As for any other performance evaluation activity, Akamas experiments should be designed to be reproducible: if the same experiment (hence, the same parameter configuration) is executed multiple times (i.e. in multiple trials), the same performance results should be found for each trial.

Therefore, it is fundamental that workflows include all the necessary tasks to realize reproducible experiments. Particular care needs to be taken to correctly manage the system state across the experiments and trials. System state can include:

Application caches
Operating system cache and buffers (e.g. Linux filesystem page cache)
Database tables that fill up during the optimization process

All experiments should always start with a clean and well-known state. If the state is not properly managed, it may happen that the performance of the system is observed to change (whether higher or lower) not because of the effect of the applied parameters, but due to other effects (e.g. warming of caches).

Best practices to consistently manage system state across experiments include:

Restoring the system state at the beginning of each experiment - this may involve restarting the application, clearing caches, restoring DB tables, etc;
Allowing for a sufficient warm-up period in the performance tests, so to ensure application performance has reached stability. See also the recommended best practices about properly managing warm-up periods in the following section about creating an optimization study.

Another common cause that can impact the reproducibility of experiments is an unstable infrastructure or environment. Therefore, it is important to ensure that the underlying infrastructure is stable and that no other workload that might impact the optimization results is running on it. For example, beware of scheduled system jobs (e.g. backups), automatic software updates or anti-virus systems that might not explicitly be considered as part of the environment but that may unexpectedly alter its performance behavior.

Taking into account workflow duration

When designing workflows, it is important to take into account the potential duration of their tasks. Indeed, the task duration impacts the duration of the overall optimization and might impact the ability to execute a sufficient number of experiments within the overall time interval or specific time windows allowed for the optimization study.

Typically, the longest task in a workflow is the one related to applying workload (e.g. launching a load test or a batch job): such tasks can last for dozens of minutes if not hours. However, a workflow may also include other ancillary tasks that may provide nontrivial contributions to the task durations (e.g. checking the status to ensure that the application is up & running).

Making workflows fail fast

As general guidance, it is better to fail fast by performing quick checks executed as early as possible. For example, it is better to do a status check before launching a load test instead of possibly waiting for it to complete (maybe after 1h) just to discover that the application did not even start.

Performing load testing to support optimization activities

This page provides a short compendium of general performance engineering best practices to be applied in any load testing exercise. The focus is on how to ensure that realistic performance tests are designed and implemented to be successfully leveraged for optimization initiatives.

The goal of ensuring realistic performance tests boils down to two aspects:

sound test environments;
realistic workloads.

Test environments

A test o the pre-production environment (Test Env from now on) needs to represent as closely as possible the production environment (ProdEnv from now on).

The most representative test environment would be a perfect replica of the production environment from both infrastructure (hardware) and architecture perspectives. The following criteria and guidelines can help design a TestEnv that is suitable for performance testing supporting optimization initiatives.

Hardware specifications

The hardware specifications of the physical or virtual servers running in TestEnv and ProdEnv must be identical. This is because any differences in the available resources (e.g. amount of RAM) or specification (e.g. CPU vendor and/or type) may affect both services performance and system configuration.

This general guideline can only be relaxed for servers/clusters running container(s) or container orchestration platforms (e.g. Kubernetes or OpenShift). Indeed, it is possible to safely execute most of the related optimization cases if the TestEnv guarantees enough spare/residual capacity (number of cores or amount of RAM) to allocate all the needed resources.

While for monolithic architectures this may translate into significant HW requirements, with microservices this might not be the case, for two main reasons:

microservices are typically smaller than monoliths and designed for horizontal scalability: this means that optimizing the configuration of the single instance (pod/container resources and runtime settings) becomes easier as they typically have smaller HW requirements;
approaches like Infrastructure-as-code (IaaC), typically used with cloud-native applications, allow for easily setting up cluster infrastructure (on-prem or on the cloud) that can mimic production environments.

Downscaled/downsized architecture

Test Envs are typically downscaled/downsized with respect to Prod Envs. If this is the case, then optimizations can be safely executed provided it is possible to generate a "production-like" workload on each of the nodes/elements of the architecture.

This can be usually achieved if all the architectural layers have the same scale ratio between the two environments and the generated workload is scaled accordingly. For example, if the ProdEnvs has 4 nodes at the front-end layer, 4 at the backend layer, and 2 at the database layer, then a TestEnv can have 2 nodes, 2 nodes, and 1 node respectively.

Load balancing among nodes

From a performance testing perspective, the existence of a load balancing among multiple nodes can be ignored, if the load balancing relies on an external component that ensures a uniform distribution of the load across all nodes.

On the contrary, if an application-level balancing is in place, it might be required to include at least two nodes in the testing scenario so as to take into account the impact of such a mechanism on the performance of the cluster.

External/downstream services

The TestEnv should also replicate the application ecosystem, including dependencies from external or downstream services.

External or downstream services should emulate the production behavior from both functional (e.g. response size and error rate) and performance (e.g. throughput and response times) perspectives. In case of constraints or limitations on the ability to leverage external/downstream services for testing purposes, the production behavior needs to be simulated via stubs/mock services.

In the case of microservices applications, it is also required to replicate dependencies within an application. Several approaches can be taken for this purpose, such as:

replicating interacting microservices;
disregarding dependencies with nonrelevant services (e.g. a post-processing service running on a mainframe whose messages are simply left published in a queue without being dequeued).

Test cases

The most representative performance test script would provide 100% coverage of all the possible test cases. Of course, this is very unlikely to be the case in performance testing. The following criteria and guidelines can be considered to establish the required test coverage.

Statistical relevance

The test cases included in the test script must cover at least 80% of the production workload.

Business relevance

The test cases included in the test script must cover all the business-critical functionalities that are known (or expected) to represent a significant load in the production environment

Technical relevance

The test cases included in the test script must cover all the functionalities that at the code level involve:

Large objects/data structure allocation and management
Long living objects/data structure allocation and management
Intensive CPU, data, or network utilization
"one of-a-kind" implementations, such as connections to a data source, ad-hoc objects allocation/management, etc.

Test user paths and behavior

The virtual user paths and behavior coded in the test script must be representative of the workload generated by production users. The most representative test script would account for the production users in terms of a mix of the different user paths, associated think times, and session length perspectives.

When single-user paths cannot be easily identified, the best practice is to consider each of them the most comprehensive user journey. In general, a worst-case approach is recommended.

The task of reproducing realistic workloads is easier for microservice architectures. On the contrary, for monolithic architectures, this task could become hard as it may not be easy to observe all of the workloads, due to custom frameworks, etc. With microservices, the workload can be completely decomposed in terms of APIs/endpoints and APM tools can provide full observability of production workload traffic and performance characteristics for each single API. This guarantees that the replicated workload can reproduce the production traffic as closely as possible.

Test data

Both test script data, that is datasets used in the test script, and test environment data, that is datasets in any involved databases/datastores, have to be characterized both in terms of size and variance so as to reproduce the production performances.

Test script data

The test script data has to be characterized in order to guarantee production-like performances (e.g. cache behavior). In case this characterization is difficult, the best practice is to adopt a worst-case approach.

Test environment data

The test data must be sized and have an adequate variance to guarantee production like performances in the interaction with databases/datastores (e.g. query response times).

Test scenarios

Most performance test tools provide the ability to easily define and modify the test scenarios on top of already defined test cases/scripts, test case-mix, and test data. This is especially useful in the Akamas context where it might be required to execute a specific test scenario, based on the specific optimization goal defined. The most common (and useful, in the Akamas context) test scenarios are described here below.

Load tests

A load test aims at measuring system performance against a specified workload level, typically the one experienced or expected in production. Usually, the workload level is defined in terms of virtual user concurrency or request throughput.

In the load test, after an initial ramp-up, the target load level is maintained constant for a steady state until the end of the test.

When validating a load test, the following two key factors have to be considered:

The steady-state concurrency/throughput level: a good practice is to apply a worst-case approach by emulating at least 110% of the production throughput;
The steady-state duration: in general defining the length for steady-state is a complex task because it is strictly dependent on the technologies under test and also because phenomena such as bootstraps, warm-ups, and caching can affect the performance and behavior of the system only before or after a certain amount of time; as a general guide to validate the steady-state duration, it is useful to:
1. execute a long-run test by keeping the defined steady-state for at least 2h to 3h;
2. analyze test results by looking for any variation in the performance and behavior of the system over time;
3. In case no variation is observed, shorten the defined same steady-state to at least 30+min.

Stress tests

A Stress test is all about pushing the system under test to its limit.

Stress tests are useful to identify the maximum throughput that an application can cope with while working within its SLOs. Identifying the breaking point of an application is also useful to highlight the bottleneck(s) of the application.

A stress test also makes it possible to understand how the system reacts to excessive load, thus validating the architectural expectations. For example, it can be useful to discover that the application crashes when reaching the limit, instead of simply enqueuing requests and slowing down processing them.

Endurance tests

An endurance test aims at validating the system's performance over an extended period of time.

Validating tests vs production

The first validation is provided by utilization metrics (e.g. CPU, RAM, I/O), which should closely display in the test environments the same behavior of production environments. If the delta is significant, some refinements of the test case and environment might be required to close the gap and gain confidence in the test results.

Creating workflows for live optimizations

A workflow for a automates all the actions required to interface the configuration management. Notice that metrics collection is an implicit action that does not need to be coded as part of the workflow.

More in detail, a typical workflow includes the following types of tasks:

Applying the configuration, by preparing and then applying the parameter configuration that has been recommended and/or approved to the target environment - this may require interfacing configuration management tools or pushing configuration to a repository

Depending on the complexity of the system, the workflow might be composed by multiple actions of the same type, each operating on separate components of the target system.

Creating optimization studies

The final preparatory step before running a study is to actually create the study, which also requires several substeps.

Most of the substeps are common for both a and an , even if they might need to be conceived differently in these two different contexts:

Offline optimization studies

For offline optimization studies, there are some additional (optional) steps:

Live optimization studies

For live optimization studies, there are some additional steps - including a mandatory one:

Defining optimization goal & constraints

The first fundamental step in creating a study is to define the study goal & constraints. While this step might be perceived as somewhat straightforward (e.g. constraints could be simply translated from SLOs already in place), defining the optimization goal really requires carefully balancing complexity and effectiveness, also as part of the general (iterative) optimization process. Please also read the Best Practices section here below.

In general, any performance engineering, tuning, and optimization activity involves complex tradeoffs among different - and potentially conflicting - goals and system performance metrics, such as:

Maximizing the business volume an application can support, while not making the single transaction slower or increasing errors above a desired threshold
Minimizing the duration of a batch processing task, while not increasing the cloud costs by more than 20% or using more than 8 CPUs

Akamas support all these (and other) scenarios by means of the optimization goal, that is the single metric or the formula combining multiple metrics that have to be either minimized or maximized, and one or more constraints among metrics of the system.

In general, constraints can be defined as either absolute constraints (e.g. app.response_time < 200 ms) or as relative constraints with respect to a baseline (e.g. app_response_time < +20% of the baseline), that is the current configuration in place, typically corresponding to the very first experiment in an offline optimization study which. Therefore, relative constraints are only applicable to offline optimization studies, while absolute constraints are applicable to both absolute and relative constraints.

Please notice that when defining constraints for an optimization study, it is required to also include those constraints listed in the Constraints section of the respective Optimization Packs which express internal constraints among parameters. For example, in case OpenJDK 11 components are to be tuned, the reference section is Constraints.

The Goal & Constraint page of the Study template in the reference guide describes the corresponding structures. For offline optimization studies only, the Akamas UI allows the optimization goal and constraints to be defined as part of the visual procedure activated by the "Create a study" button (see the following figure).

Please notice that any experiment that does not respect the constraints is marked by Akamas as failed, even if correctly executed. The reason for this failure can be inspected in the experiment status. Similarly to workflow failures (see below), the Akamas AI engine automatically takes any failure due to constraint violations into account when searching the optimization space to identify the parameter configurations that might improve the goal metrics while matching constraints.

Best Practices

There are no general guidelines and best practices on how to best define goals & constraints, as this is where experience, knowledge, and processes meet.

Please refer to the section Optimization examples for a number of examples related to a variety of technologies and the Knowledge Base guide for real-world examples.

Defining windowing policies

For both offline and live optimization studies, it is possible to define how to identify the time windows that Akamas needs to consider for assessing the result of an experiment. Defining a windowing policy helps achieve reliable optimizations by excluding metrics data points that should not influence the score of an experiment.

The following two windowing policies are available:

Trim windowing: discards the initial and final part of an experiment - e.g. to exclude warm-up and tear-down phases - trim windowing policy is the default (with entire interval selection whether no trimming is specified)
Stability windowing: discard those parts that do not correspond to the most stable window - this leverages the Akamas features of automatically identifying the most stable window based on the user-specified specified criteria

The Windowing policy page of the Study template section in the reference guide describes the corresponding structures. For offline optimization studies only, the Akamas UI allows the windowing policies to be defined as part of the visual procedure activated by the "Create a study" button (see the following figures).

Best Practices

The following sections provide general best practices on how to define suitable windowing policy.

Define windowing based on the optimization goal

In order to make the optimization process fully automated and unattended, Akamas automatically analyzes the time series of the collected metrics of each experiment and calculates the experiment score (all the system metrics will also be aggregated).

Based on the optimization goal, it is important to instruct Akamas on how to perform this experiment analysis, in particular, by also leveraging Akamas windowing policies.

For example, when optimizing an online or transactional application, there are two common scenarios:

Increase system performance (i.e. minimize response time) or reduce system costs (i.e. decrease resource footprint or cloud costs) while processing a given and fixed transaction volume (i.e. a load test);
Increase the maximum throughput a system can support (i.e., system capacity) while processing an increasing amount of load (e.g. a stress test).

In the first scenario, a load test scenario is typically used: the injected load (e.g. virtual users) ramps up for a period, followed by a steady state, with a final ramp-down period. From a performance engineering standpoint, since the goal is to assess the system performance during the steady state, the warm-up and tear-down periods can be discarded. This analysis can be automated by applying a windowing policy of type "trim" upon creating the optimization study, which makes Akamas automatically compute the experiment score by discarding a configurable warm-up and tear-down period.

In the second scenario, a stress test is typically used: the injected load follows a ramp with increasing levels of users, designed to stress the system up to its limit. In this case, a performance engineer is most likely interested in the maximum throughput the system can sustain before breaking down (possibly while matching a response time constraint). This analysis can be automated by applying a windowing policy of type "stability", which makes Akamas automatically compute the experiment score in the time window where the throughput was maximized but stable for a configurable period of time.

When optimizing a batch application, windowing is typically not required. In such scenarios, a typical goal is to minimize batch duration or aggregate resource utilization. Hence, there is no need to define any windowing policy: by default, the whole experiment timeframe is considered.

Finding an effective stability window

Setting up an effective stability window requires some knowledge of the test scenario and the variability of the environment.

As a general guideline it is recommended to run a baseline study with a stability window set to a low value, such as a value close to 0 or half of the expected mean of the metric, and then to inspect the results of the baseline to identify which window has been identified and update the standard deviation threshold accordingly. When using a continuous ramp the test has no plateaus, so the standard deviation threshold should be a bit higher to account for the increment of the traffic in the windowing period. On the contrary, when running a staircase test with many plateaus, the standard deviation can be smaller to identify a period of time with the same amount of users.

Applying the standard deviation filter to very stable metrics, such as the number of users, simplifies the definition of the standard deviation threshold but might hide some instability of the environment when subject to constant traffic. On the other hand, applying the threshold to a more direct measure of the performance, such as the throughput, makes it easier to identify the stability period of the application but might require more baseline experiments to identify the proper threshold value. The logs of the scoring phase provide useful insights into the maximum standard deviation found and the number of candidate windows that have been identified given a threshold value, which can be used to refine the threshold in a few baseline experiments.

Defining KPIs

While the optimization goal drives the Akamas AI toward optimal configurations, there might be other sub-optimal configurations of interest in case they do not simply match the optimization constraints but might also improve on some Key Performance Indicators (KPIs).

For example:

for a Kubernetes microservice Java-based application, a typical optimization goal is to reduce the overall (infrastructure or cloud) cost by tuning both Kubernetes and JVM parameters while keeping SLOs in terms of application response time and error rate under control
among different configurations that provide similar cost reduction in addition to matching all SLOs, a configuration that would also significantly cause the application response time might be worth considering with respect to an optimal configuration that does not improve on this KPI

Akamas automatically considers any metric referred to in the defined optimization goal and constraints for an offline optimization study as a KPI. Moreover, any other metrics of the system component can be specified as a KPI for an offline optimization study.

The KPIs page of the Study template section in the reference guide describes how to define the corresponding structure. Specifying the KPIs can be done while first defining the study or from the Akamas UI, at either study creation time or afterward (see the following figures).

Once KPIs are defined, Akamas will represent the results of the optimization in the Insights section of the Akamas UI. Moreover, the corresponding suboptimal configuration associated with a specific KPI is highlighted in the Akamas UI by a textual badge "Best <KPI name>".

Please notice that KPIs can also be re-defined after an offline optimization study has been completed as their definition does not affect the optimization process, only the evaluation of its results. See the section Analyzing offline optimization studies and the Optimization Insights page.

Defining parameters & metrics

After defining the goal and its constraints, the following substep in creating an optimization study is specifying the optimization and . In particular, selecting the parameters that are going to be tuned to optimize the system is a critical decision that requires carefully balancing complexity and effectiveness. As for goals & constraints, also this step may require adopting an iterative approach. See also the section here below.

The and pages of the section in the reference guide describe how to define the corresponding structure. For offline optimization studies only, the Akamas UI allows the parameters and metrics to be defined as part of the visual procedure activated by the "Create a study" button (see the following figure).

As illustrated by the previous and following figures, during this step is also possible to edit the range of values associated with each optimization parameter with respect to the default domain provided by either the original or custom optimization pack in use for the respective technology.

Parameter rendering

By default, all parameters specified in the parameters selection of a study are applied ("rendered"). Akamas allows specifying which configuration parameters should be applied in the optimization steps. More precisely:

parameter rendering is available at the step level for baseline, preset, and optimize steps
parameter rendering is not available for bootstrap steps (bootstrapped experiments are not executed)

This feature can be useful to deal with the different strategies through which applications and systems accept configuration parameters.

Best Practices

The following sections provide some best practices on how to best approach the step of defining optimization parameters. .

Configure parameters domains based on environment specs

Since the parameter domain defines the range of values that the Akamas AI engine can assign to the parameter, when defining the system parameters to be optimized, it is important to review the parameter domains and adjust them based on the system characteristics of the target system, environment and best practices in place.

Akamas optimization packs already provide parameter domains that are correct for most situations. For example, the OpenJDK 11 JVM gcType is a categorical parameter that already includes all the possible garbage collectors that can be set for this JVM version.

For other parameters, there are no sensible default domains as they depend on the environment. For example, the OpenJDK 11 maxHeapSize JVM parameter dictates how much memory the JVM can use. This obviously depends on the environment in which the JVM runs. For example, the upper bound might be 90% of the memory of the virtual machine or container in which the JVM runs.

Configure parameter constraints based on Optimization Pack best practices

Depending on the specific technology under optimization, the configuration parameters may have relationships among themselves. For example, in a JVM the newSize parameter defines the size of a region of the JVM heap, and hence its value should be always less than the maxHeapSize parameter.

Akamas AI engine supports the definition of constraints among parameters as this is a frequent need when optimizing real-life applications.

It is important to define the parameter constraints when creating a new study. The optimization pack documentation provides guidelines on what are the most important parameter constraints for the specific technology.

When optimizing a new or custom technology, it may happen that some experiments fail due to unknown parameter constraints being violated. For example, the application may fail to start and only by analyzing the application error logs, the reason for the failure can be understood. For a Java application, the JVM error message (e.g. "new size cannot be larger than max heap size") could provide useful hints. This would reveal that some constraints need to be added to the parameter constraints in the study.

While the Akamas AI engine has been designed to learn from failures, including those due to relationships among parameters that were not explicitly set as constraints, setting parameter constraints may help avoid unnecessary failures and thus speed up the optimization process.

Defining workloads

For a live optimization study, it is required to specify which component metrics represent the different workloads observed on the target system. A workload could be represented by either a metric directly measuring that workload, such as the application throughput, or a proxy metric, such as the percentage of reads and writes in your database.

The page of the section in the reference guide describes how to define the corresponding structure.

Akamas features automatic detection of workload contexts, corresponding to different patterns for the same workload. For example, workload context could correspond to the peak or idle load, or to the weekend or weekday traffic. This allows Akamas to recommend safe configurations based on the observed behavior of the system under similar workload conditions.

Akamas provides several parameters governing how the Akamas optimizer operates and leverages the workload information while a live optimization study is being executed. The most important parameter is the online mode (see ) as it related to whether the human user is part of the approval loop when the Akamas AI recommends a configuration to be applied.

Moreover, Akamas also provides customizable safety policies that drive the Akamas optimizer in evaluating candidate configurations with respect to defined goal constraints.

Online mode

Live optimizations can operate in one of the following online modes:

recommendation (or manual) mode (the default mode): Akamas does not immediately apply a configuration identified by Akamas AI: a new configuration is first recommended to the user, who needs to approve it, possibly after modifying it, before it gets applied - this is also referred to as human-in-the loop scenario;
fully autonomous (or automatic) mode: new configurations are immediately applied by Akamas as soon as they are generated by the Akamas AI, without being first recommended to (and approved by) the user.

It is worth noticing that under a recommendation mode, there might be a significant delay between the time a configuration is identified by Akamas and the time the recommended changes get applied. Therefore, the Akamas AI leverages the workload information differently when looking for a new configuration, depending on the defined online mode:

in the recommendation mode, Akamas takes into account all the defined workloads and looks for the configuration that best satisfies the goal constraints for all the observed workloads and provides the best improvements for all of them
in the fully autonomous mode, Akamas works on a single workload at each iteration (based on a customizable workload strategy - see below) and looks for an optimized configuration for that specific workload to be immediately applied in the next iteration, even if it might not be the best for the different workloads

The online mode can be specified at the study level and can also be overridden at the step level (only for steps of type "optimize" - see section ). The page of the section in the reference guide describes how to define the corresponding structure. This can be done either from the Akamas command line (see page ) or from the Akamas AI (see the following figure).

Notice that the online mode can be changed at any time, that is while the optimization study is running, to become immediately effective. For example, a live optimization could initially operate in recommendation mode and then be changed to fully autonomous mode afterward.

Defining optimization steps

A final step in defining an optimization study is to specify specifies the sequence of steps executed while running the study.

The following four types of steps are available:

Baseline: performs an experiment and sets it as a baseline for all the other ones
Bootstrap: imports experiments from other studies
Preset: performs an experiment with a specific configuration
Optimize: performs experiments and generates optimized configurations

Please notice that at least one baseline step is always required in any optimization study. This applies not only to offline optimization studies, but also to live optimization studies as it is being used to suggest changes to parameter values starting from the default values.

Best Practices

The following sections provide some best practices on how to best approach the step of defining the baseline step.

Ensure the baseline configuration is correct

In an optimization study, the baseline is an important experiment as it represents the system performance with the current configuration, and serves as a reference to assess the relative improvements the optimization achieved.

Therefore, it is important to make sure the baseline configuration of the study correctly reflects the current configuration - be it the vendor default or the result of a manual tuning exercise.

Evaluate which parameters to include in the baseline configuration

When defining the study baseline configuration it is important to evaluate which parameters to include. Indeed, several technologies have default values assigned to most of their configuration parameters. However, the runtime behavior can be different depending on whether the parameter is set to the default value or it is not set at all.

Therefore, it is recommended to review the current configuration (e.g. the one in place in production) and identify which parameters and values have been set (e.g. JVM maxHeapSize = 2GB, gcType = Parallel, etc.), and then to only set those parameters with their corresponding values, without adding any other parameters. This ensures that the specified baseline is consistent with the real production setup.

Setting safety policies

While Akamas leverages similar AI methods for both live optimizations and optimization studies, the way these methods are applied is radically different. Indeed, for optimization studies running in pre-production environments, the approach is to explore the configuration space by also accepting potential failed experiments, to identify regions that do not correspond to viable configurations. Of course, this approach cannot be accepted for live optimization running in production environments. For this purpose, Akamas live optimization uses observations of configuration changes combined with the automatic detection of workload contexts and provides several customizable safety policies when recommending configurations to be approved, revisited, and applied.

Akamas provides a few customizable optimizer options (refer to the options described on the page of the reference guide) that should be configured so as to make configurations recommended in live optimization and applied to production environments as safe as possible.

Exploration factor

Akamas provides an optimizer option known as the exploration factor that only allows gradual changes to the parameters. This gradual optimization allows Akamas to observe how these changes impact the system behavior before applying the following gradual changes.

By properly configuring the optimizer, Akamas can gradually explore regions of the configuration space and slowly approach any potentially risky regions, thus avoiding recommending any configurations that may negatively impact the system. Gradual optimization takes into account the maximum recommended change for each parameter. This is defined as a percentage (default is 5%) with respect to the baseline value. For example, in the case of a container whose CPU limit is 1000 millicores, the corresponding maximum allowed change is 50 millicores. It is important to notice that this does not represent an absolute cap, as Akamas also takes into account any good configurations observed. For example, in the event of a traffic peak, Akamas would recommend a good configuration that was observed working fine for a similar workload in the past, even if the change is higher than 5% of the current configuration value.

Notice that this feature would not work for categorical parameters (e.g. JVM GC Type) as their values do not change incrementally. Therefore, when it comes to these parameters, Akamas by default takes a conservative approach of only recommending configurations with categorical parameters taking already observed before values. This still allows some never observed values to be recommended as users are allowed to modify values also for categorical parameters when operating in human-in-the-loop mode. Once Akamas has observed that that specific configuration is working fine, the corresponding value can then be recommended. For example, a user might modify the recommended configuration for GC Type from Serial to Parallel. Once Parallel has been observed as working fine, Akamas would consider it for future recommendations of GC Type, while other values (e.g. G1) would not be considered until verified as safe recommendations.

The exploration factor can be customized for each live optimization individually and changed while live optimizations are running.

Safety factor

Akamas provides an optimizer option known as the safety factor designed to prevent Akamas from selecting configurations (even if slowly approaching them) that may impact the ability to match defined SLOs. For example, when optimizing container CPU limits, lower and lower CPU limits might be recommended, up to the point that the limit becomes too low that the application performance degrades.

Akamas takes into account the magnitude of constraint breaches: a severe breach is considered more negative than a minor breach. For example, in the case of an SLO of 200 ms on response time, a configuration causing a 1 sec response time is assigned a very different penalty than a configuration causing a 210 ms response time. Moreover, Akamas leverages the smart constraint evaluation feature that takes into account if a configuration is causing constraints to approach their corresponding thresholds. For example, in the case of an SLO of 200 ms on response time, a configuration changing response time from 170 ms to 190 ms is considered more problematic than one causing a change from 100 ms to 120 ms. The first one is considered by Akamas as corresponding to a gray area that should not be explored.

The safety factor is also used when starting the study in order to validate the behavior of the baseline to identify the safety of exploring configurations close to the baseline. If the baseline presents some constraint violations, then even exploring configurations close to the baseline might cause a risk. If Akamas identifies that, in the baseline configuration, more than (safety_factor*number_of_trials) manifest constraint violations then the optimization is stopped.

If your baseline has some trials failing constraint validation we suggest you analyze them before proceeding with the optimization

The safety factor is set by default to 0.5 and can be customized for each live optimization individually and changed while live optimizations are running.

Outlier detection

It is also worth mentioning that Akamas also features an outlier detection capability to compensate for production environments typically being noisy and much less stable than staging environments, thus displaying highly fluctuating performance metrics. As a consequence, constraints may fail from time to time, even for perfectly good configurations. This may be due to a variety of causes, such as shared infrastructure on the cloud, slowness of external systems, etc.

Running optimization studies

Once all the preparatory steps for creating a study are done, running a study is straightforward: An optimization study can be started from either the Akamas UI (see the following figures) or the command line (refer to the Resource management commands page).

Before actually running an optimization study, it is highly recommended to read the following sections:

Before running optimization studies
Analyzing results of offline optimization studies or Analyzing results of live optimization studies
Before applying optimization results

Once started, managing studies is different for offline optimization studies (see here below) and live optimization studies (see here below).

Offline optimization studies

Notice that once an offline optimization study has started, it can only be stopped or let be finished and not restarted again. However, it is also possible to reuse experiments executed in another study in another (successfully or not) finished study - this is called bootstrapping and is illustrated by the following figure (also refer to the Bootstrap Step page on the reference page).

This can be useful for multiple reasons, including the case of an error (e.g. a misconfigured workflow) that requires "restarting" the study.

Live optimization studies

For live optimization studies, it is possible to stop a study and restart it. However, please notice that this is an irreversible action, that would delete all the executed experiments, so basically, restarting a live study means starting it from scratch.

Before running optimization studies

The following provides some best practices that can be adopted before launching optimization studies, in particular for offline optimization studies.

Dry-running the optimization study

It is recommended to execute a dry-run of the study to verify that the workflow works as expected and in particular that the telemetry and configuration management steps are correctly executed.

Verify that workflow actually works

It is important to verify that all the steps of the workflow complete successfully and produce the expected results.

Verify that parameters are applied and effective

When approaching the optimization of new applications or technologies, it is important to make sure all the parameters that are being set are actually applied and used by the system.

Depending on the specific technology at hand, the following issues can be found:

parameters were set but they are not applied - for example parameters were set in the wrong configuration file or the path is not correct;
some automatic (corrective) mechanisms are in place that overrides the values applied for the parameters.

Therefore, it is important to always verify the actual values of the parameters once the system is up & running with a new configuration, and make sure they match the values applied by Akamas. This is typically done by leveraging:

monitoring tools, when the parameters are available as metrics or properties of the system;
native administration tools, which are typically available for introspection or troubleshooting activities (e.g. jcmd for the JVM).

Verify that load testing works

It is important to verify that the integration with load testing tools actually executes the intended load test scenarios.

Verify that telemetry collects all the relevant metrics

It is important to make sure that the integration with telemetry providers works correctly and that all the relevant metrics of the system are correctly collected.

Data-gathering from the telemetry data sources is launched at the end of the workflow tasks. The status of the telemetry process can be inspected in the Progress tab, where it is also possible to inspect the telemetry logs in case of failures.

Please notice that the telemetry process fails if the key metrics of the study cannot be gathered. This includes metrics defined in the goal function or constraints.

Baselining the system

Before running the optimization study, it is important to make sure the system and the environment where the optimization is running provide stable and reproducible performance.

Make sure the system performance is stable

In order to ensure a successful optimization, it is important to make sure that the target system displays stable and predictable performance and does not suffer from random variations.

To make sure this is the case, it is recommended to create a study that only runs a single baseline experiment. In order to assess the performance of the system, Akamas trials can be used to execute the same experiments (hence, the same configuration) multiple times (e.g. three times). Once the experiment is completed, the resulting performance metrics can be analyzed to assess the stability. The analysis can either be done by leveraging aggregate metrics in the Analysis tab, or to a deeper level on the actual time series by accessing the Metrics tab from the Akamas UI.

Ideally, no significant performance variation should be observed in the different trials, for the key system performance metrics. Otherwise, it is strongly recommended to identify the root cause before proceeding with the actual optimization activity.

If you are running a live optimization, any constraint violation in the baseline will halt the study. In order to recommend safe configurations, the optimization process requires that the baseline does not violate constraints for the entire observation period.

Backuping the original configuration

Before launching the optimization it might be a good idea to take note of (or backup) the original configuration. This is very important in the case of Linux OS parameters optimization.

Analyzing results of offline optimization studies

Since an offline optimization study lasts for at most the number of configured experiments and typically runs in a test or pre-production environment, results could be safely either analyzed after the study has completely finished.

However, it is a good practice to analyze partial results while the study is still running as this may provide useful insights about both the system being optimized (e.g. understanding of the system dynamics and sub-optimal configurations that could be immediately applied) and about the optimization study itself (e.g. how to re-design a workflow or change constraints), early-on.

The Akamas UI displays the results of an offline optimization study in different visual areas:

the Best Configuration section provides the optimal configuration identified by Akamas, as a list of recommended values for the optimization parameters compared to the baseline and ranked according to their relevance;

the Progress tab see the following figures) displays the progression of the study with respect to the study steps, the status of each experiment (and trial), its associated score, and the parameter values of the corresponding configurations; this area is mostly used for study monitoring (e.g. identifying failing workflows) and troubleshooting purposes;

the Analysis tab (see the following figures) displays how the baseline and experiments score with respect to the optimization goal, and the values of metrics and parameters for the corresponding configurations; this area supports the analysis of the different configurations;

the Metrics tab (see the following figure) displays the behavior of the metrics for all executed experiments (and trials); this area supports both study validation activities and deeper analysis of the system behavior;

Optimization Insights

While the main result of an optimization study is to identify the optimal configuration with respect to the defined goal & constraints, any suboptimal configuration that is improving on one of the defined KPIs can be also very valuable.

These configurations are displayed in a dedicated section of the Akamas UI and also displayed in other areas of the Akamas UI as textual badges "Best <KPI name>" referred to as (insights) tags.

Insights section

The following figures show the Insights section displayed on the study page and the Insights pages that can be drilled down to.

The following figure shows the insights tags in the Analysis tab:

Please notice that "Best", "Best Memory Limit" and any other KPI-related tags are displayed in the Akamas UI while the study progresses and thus may be reassigned as new experiments get executed and their configurations are scored and provide their results for the defined study KPIs. See

Insights tags

After starting a study, any finished experiment is labeled by one or more insights tags "Best <KPI name>" in case the corresponding configuration provides the best result so far for those KPIs. Notice that for experiments involving multiple trials, tags are only assigned after all their trials have finished.

Of course, after the very first experiment (i.e. a baseline) finishes, all tags are assigned to the corresponding configuration. This is displayed by the following figure for a study where the KPIs named CPU with formula renaissance.cpu_used and direction minimize and MEM with formula renaissance.mem_used and direction minimize:

When the following experiments finish, tags are reevaluated according with respect to the computed goal score and the achieved results for any single KPI. In this study, experiment #2 provided a better result for both the CPU and the study goal, so it got both the tags Best CPU and Best renaissance.response_time(which is defined as the goal of the study). Notice that the blue star is displayed by Akamas (except for baseline) to highlight the fact that this was automatically generated by Akamas and not assigned by a user.

Afterward, experiment #3 got the tag as the best configuration while experiment #4 got the tag Best CPU. as improving on experiment #2. Therefore two configurations displayed the blue star.

A number of experiments later, experiment #7 provided better memory usage than the baseline so got the tag Best MEM assigned. At this point, three configurations have the blue start, thus making evident that there are tradeoffs when trying to optimize with respect to the goal and the KPIs.

Analyzing results of live optimization studies

Even for live optimization studies, it is a good practice to analyze how the optimization is being executed with respect to the defined goal & constraints, and workloads.

This analysis may provide useful insights about the system being optimized (e.g. understanding of the system dynamics) and about the optimization study itself (e.g. how to adjust optimizer options or change constraints). Since this is more challenging for an environment that is being optimized live, a common practice to adopt a recommendation mode before possibly switching to a fully autonomous mode.

The Akamas UI displays the results of an offline optimization study in the following areas:

the Metrics section (see the following figures) displays the behavior of the metrics as configurations are recommended and applied (possibly after being reviewed and approved by users); this area supports the analysis of how the optimizer is driven by the configured safety and exploration factors.

The All Configurations section provides the list of all the recommended configurations, possibly as modified by the user, as well as the detail of each applied configuration (see the following figures).

in the case of a recommendation mode, the Pending configuration section (see the following figure) shows the configuration that is being recommended to allow users to review it (see the EDIT toggle) and approve it:

Before applying optimization results

The following best practices should be considered before applying a configuration identified by an offline optimization study from a test or pre-production environment to a production environment.

Most of these best practices are general and refer to any configuration change and application rollout, not only to Akamas-related scenarios.

Validating the study results

Any configuration identified by Akamas in a test or pre-production environment, by executing a number of experiments and trials in a limited timeframe, should be first validated before being promoted to production in its ability to consistently deliver the expected performance over time.

Running endurance tests

An endurance test typically lasts for several hours and can either mimic the specific load profile of production environments (e.g. morning peaks or low load phases during the night) or a simple constant high load (flat load). A specific Akamas study can be implemented for this purpose.

Applying results of optimization studies

When applying a new configuration to a production environment it is important to reduce the risk of severely impacting the supported services and allowing time to backtrack if required.

Adopt gradual rollouts

With a gradual rollout approach, a new configuration is applied to only a subset of the target system to allow the system to be observed for a period of time and avoid impacting the entire.

Several strategies are possible, including:

Canary deployment, where a small percentage of the traffic is served by the instance with the new configuration;
Shadow traffic, where traffic is mirrored and redirected to the instance with the new configuration, and responses are not impacting the user.

Assess the impact on the infrastructure and other applications

In the case of an application sharing entire layers or single components (e.g. microservices) with other applications, it is important to assess in advance the potential impact on other applications before applying a configuration identified by only considering SLOs related to a single application.

The following general considerations may help in assessing the impact on the infrastructure:

if the new configuration is more efficient (i.e. it is less demanding in terms of resources) or it does require changes to resource requirements (e.g. does not change K8s request limits), then the configuration can be expected to be beneficial as the resources will be freed and become available for additional applications;
If the new configuration is less efficient (i.e. it requires more resources), then appropriate checks of whether the additional capacity is available in the infrastructure (e.g. in the K8s cluster or namespace) should be done, as when allocating new applications.

As far as the other applications are concerned:

Just reducing the operational cost of a service does not have any impact on other applications that are calling or using the service;
While tuning service for performance may put the caller system in back-pressure fatigue, this is not the typical behavior of enterprise systems, where the most susceptible systems are on the backend side:
- Tuning most external services will not increase the throughput much, which is typically business-driven, thus the risk to overwhelm the backends is low;
- Tuning the backends allows the caller systems to handle faster connections, thus reducing the memory footprint and increasing the resilience of the entire system;
Especially in the case of highly distributed systems, such as microservices, the number of inflight packages for a given period of time is something to be minimized;
A latency reduction for a microservice implies fewer in-flight packages throughout the system, leading to better performance, faster failures, and fewer pending transactions to be rolled back in case of incidents.

Guidelines for choosing optimization parameters

In this section, some guidelines on how to choose optimization parameters are provided for the following specific technologies:

Kubernetes
JVM (OpenJDK)
JVM (OpenJ9)
Oracle Database
PostgreSQL

These guidelines also provide an example of how to approach the selection of parameters (and how to define the associated domains and constraints) in an optimization study.

Guidelines for Kubernetes

When starting a new Kubernetes optimization, the following is a list of recommended parameters. After having selected the parameters, always make sure to add/review the corresponding parameter domains and constraints based on your environment. Please refer to the optimization pack reference page for more information.

Suggested Parameters for Kubernetes Containers

cpu_request
cpu_limit
memory_request
memory_limit

Guidelines for JVM layer (OpenJDK)

When starting a new JVM optimization, the following is a list of recommended parameters. After having selected the parameters, always make sure to add/review the corresponding parameter domains and constraints based on your environment. Please refer to the optimization pack reference page for more information.

Suggested Parameters for Open JDK 11

jvm_maxHeapSize
jvm_minHeapSize
jvm_gcType
jvm_newSize
jvm_survivorRatio
jvm_maxTenuringThreshold
jvm_parallelGCThreads
jvm_concurrentGCThreads
jvm_maxInlineSize
jvm_inlineSmallCode
jvm_useTransparentHugePages
jvm_alwaysPreTouch

Suggested Parameters for Open JDK 8

When starting a new JVM optimization, the following is a list of recommended parameters to include in your study:

jvm_gcType
jvm_maxHeapSize
jvm_newSize
jvm_survivorRatio
jvm_maxTenuringThreshold
jvm_parallelGCThreads
jvm_concurrentGCThreads

Guidelines for JVM (OpenJ9)

Suggested Parameters for OpenJ9

j9vm_minFreeHeap
j9vm_maxFreeHeap
j9vm_minHeapSize
j9vm_maxHeapSize
j9vm_gcCompact
j9vm_gcThreads
j9vm_gcPolicy
j9vm_codeCacheTotal
j9vm_compilationThreads
j9vm_aggressiveOpts

The following describes how to approach tuning JVM in the following areas:

JVM Heap
JVM Garbage Collection
JVM compilation
JVM aggressive optimization

Tuning JVM Heap

The most relevant JVM parameters are the ones defining the boundaries of the allocated heap (j9vm_minHeapSize, j9vm_maxHeapSize). The upper bound to configure for this domain strongly depends on the memory in megabytes available on the host instance or on how much we are willing to allocate, while the lower bound depends on the minimum requirements to run the application.

The free heap parameters (j9vm_minFreeHeap, j9vm_maxFreeHeap) define some boundaries for the free space target ratio, which impacts the trigger thresholds of the garbage collector. The suggested starting ranges are from 0.1 and 0.6 for the minimum free ratio range, and from 0.3 to 0.9 for the maximum.

The following represents a sample snippet of the section parametersSelection in the study definition:

parametersSelection:
  - name: jvm.j9vm_maxHeapSize
    domain: [<LOWER_BOUND>, <UPPER_BOUND>]
  - name: jvm.j9vm_minHeapSize
    domain: [<LOWER_BOUND>, <UPPER_BOUND>]

  - name: jvm.j9vm_minFreeHeap
    domain: [0.1, 0.6]
  - name: jvm.j9vm_maxFreeHeap
    domain: [0.3, 0.9]

It is also recommended to define the following constraints:

min heap size lower than or equal to the max heap size:

jvm.j9vm_minHeapSize <= jvm.j9vm_maxHeapSize

upper bound to be at least 5 percentage points higher than the lower bound

jvm.j9vm_minFreeHeap + 0.05 < jvm.j9vm_maxFreeHeap

Tuning JVM Garbage Collection

The following JVM parameters define the behavior of the garbage collector:

j9vm_gcPolicy
j9vm_gcThreads
j9vm_gcCompact

The garbage collection policy (j9vm_gcPolicy) defines the collection strategy used by the JVM. This parameter is key for the performance of the application: the default garbage collector (gencon) is the best solution for most scenarios, but some specific kinds of applications may benefit from one of the alternative options.

The number of GC threads (j9vm_gcThreads) defines the level of parallelism available to the collector. This value can range from 1 to the maximum number of CPUs that are available or we are willing to allocate.

The GC compaction (j9vm_gcCompact) selects if garbage collections perform full compactions always, never, or based on internal policies.

The following represents a sample snippet of the section parametersSelection in the study definition:

parametersSelection:
  - name: jvm.j9vm_gcPolicy
    categories: [balanced, gencon, metronome, optavgpause, optthruput]

  - name: jvm.j9vm_gcThreads
    domain: [1, <MAX_CPUS>]

  - name: jvm.j9vm_gcCompact

Tuning JVM compilation

The following JVM parameters define the behaviors of the compilation:

j9vm_compilationThreads
j9vm_codeCacheTotal

The compilation threads parameter (j9vm_compilationThreads) defines the number available for the JIT compiler. Its range depends on the available CPUs.

The code cache parameter (j9vm_codeCacheTotal) defines the maximum size limit for the JIT code cache. Higher values may benefit complex server-type applications, at the expense of the memory footprint, so should be taken into account in the overall memory requirements.

The following represents a sample snippet of the section parametersSelection in the study definition:

parametersSelection:
  - name: jvm.j9vm_compilationThreads
    domain: [1, <MAX_CPUS>]

  - name: jvm.j9vm_codeCacheTotal
    domain: [2, <UPPER_BOUND>]

Tuning JVM aggressive optimizations

The following JVM parameter defines the behavior of aggressive optimization:

j9vm_aggressiveOpts

Aggressive optimizations (j9vm_aggressiveOpts) enables some experimental features that usually lead to performance gains.

The following represents a sample snippet of the section parametersSelection in the study definition:

parametersSelection:
  - name: j9vm_aggressiveOpts

Guidelines for Oracle Database

This page provides a list of best practices when optimizing an Oracle database with Akamas.

Memory Allocation Sub-spaces

This section provides some guidelines on the most relevant memory-related parameters and how to configure them to perform a high-level optimization of a generic Oracle Database instance.

Oracle DBAs can choose, depending on their needs or expertise, the desired level of granularity when configuring the memory allocated to the database areas and components, and let the Oracle instance automatically manage the lower layers. In the same way, Akamas can tune a target instance with different levels of granularity.

In particular, we can configure an Akamas study so that it simply tunes the overall memory of the instance, leaving Oracle automatically manage how to allocate it between shared memory (SGA) and program memory (PGA); alternatively, we can tune the target values of both of these areas and let Oracle take care of their components, or go even deeper and have total control of the sizing of every single component.

Notice: running the queries in this guide requires a user with the ALTER SYSTEM, SELECT ON V_$PARAMETER, and SELECT ON V_$OSSTAT privileges

Also notice that to define the domain of some of the parameters you need to know the physical memory of the instance. You can find the value in MiB running the query select round(value/1024/1024)||'M' "physical_memory" from v$osstat where stat_name='PHYSICAL_MEMORY_BYTES'. Otherwise, if you have access to the underlying machine, you can run the bash command free -m

Tuning the Total Memory

This is the simplest of the memory-optimization set of parameters, where the study configures only the overall memory available for the instance and lets Oracle’s Automatic Memory Management (AMM) dynamically assign the space to the SGA and PGA. This is useful for simple studies where you want to minimize the overall used memory, usually coupled with constraints to make sure the performances of the overall system remain within acceptable values.

memory_target: this parameter specifies the total memory used by the Oracle instance. When AMM is enabled can find the default value with the query select display_value "memory_target" from v$parameter where name='memory_target'. Otherwise, you can get an estimate summing the configured SGA size found running select display_value "sga_target" from v$parameter where name LIKE 'sga_target' and the size of the PGA found with select ceil(value/1024/1024)||'M' "physical_memory" from v$pgastat where name='maximum PGA allocated'. The explored domain strongly depends on your application and hardware, but an acceptable range goes from 152M (the minimum configurable value) to the physical size of your instance. Over time, Akamas will learn to avoid automatically the configuration with not-enough memory.

To configure the Automatic Memory Management you also need to make sure that the parameters sga_target and pga_aggregate_limit are set to 0 by configuring them among the default values of a study, or manually running the configuration queries.

The following snippet shows the parameter selection to tune the total memory of the instance. The domain is configured to go from the minimum value to the maximum physical memory (7609M in our example).

parametersSelection:
- name: ora.memory_target
  domain: [152, 7609]

Tuning the Shared and Program Memory Global Areas

With the following set of parameters, Akamas tunes the individual sizes of the SGA and PGA, letting Oracle’s Automatic Shared Memory Management (ASMM) dynamically size their underlying SGA components. You can leverage these parameters for studies where, like the previous scenario, you want to find the configuration with the lowest memory allocation that still performs within your SLOs. Another possible scenario is to find the balance in the allocation of the memory available that best fits your optimization goals.

sga_target: this parameter specifies the target SGA size. When ASMM is configured, you can find the default value with the query select display_value "sga_target" from v$parameter where name='sga_size'. The explored domain strongly depends on your application and hardware, but an acceptable range goes from 64M (the minimum configurable value) to the physical size of your instance minus a reasonable size for the PGA (usually up to 80% of physical memory).
pga_aggregate_target: this parameter specifies the target PGA size. You can find the default value with the query select display_value "pga_aggregate_target" from v$parameter where name='pga_aggregate_target'. The explored domain strongly depends on your application and hardware, but an acceptable range goes from 10M (the minimum configurable value) to the physical size of your instance minus a reasonable size for the SGA.

To tune the SGA and PGA, you also must set the memory_target to 0 to disable AMM by configuring them among the default values of a study, or manually running the configuration queries. ASMM will dynamically tune all the SGA components whose size is not specified, so set to 0 all the parameters (db_cache_size, log_buffer, java_pool_size, large_pool_size, shared_pool_size, and streams_pool_size) unless you have any specific requirements.

The following snippet shows the parameter selection to tune both SGA and PGA sizes. Each parameter is configured to go from the minimum value to 90% of the maximum physical memory (6848M in our example), allowing Akamas to explore all the possible ways to partition the space between the two areas and find the best configuration for our use case:

parametersSelection:
- name: ora.sga_target
  domain: [64, 6848]
- name: ora.pga_aggregate_target
  domain: [10, 6848]

The following code snippet forces Akamas to explore configuration spaces where the total memory, expressed in MiB, does not exceed the total memory available (7609M in our example). This allows speeding up the optimization avoiding configurations that won’t work correctly.

parameterConstraints:
- name: Limit total memory
  formula: ora.sga_target + ora.pga_aggregate_target <= 7609

Tuning the Shared Memory

With the following set of parameters, Akamas tunes the space allocated to one or more of the components that make the System Global Area, along with the size of the Program Global Area size. This scenario is useful for studies where you want to find the memory distribution that best fits your optimization goals.

pga_aggregate_target: this parameter specifies the size of the PGA. You can find the default value with the query select display_value "pga_aggregate_target" from v$parameter where name='pga_aggregate_target'. The explored domain strongly depends on your application and hardware, but an acceptable range goes from 10M (the minimum configurable value) to the physical size of your instance.
db_cache_size: this parameter specifies the size of the default buffer pool. You can find the default value with the query select * from v$sgainfo where name='Buffer Cache Size'.
log_buffer: this parameter specifies the size of the log buffer. You can find the default value with the query select * from v$sgainfo where name='Redo Buffers'.
java_pool_size: this parameter specifies the size of the java pool. You can find the default value with the query select * from v$sgainfo where name='Java Pool Size'.
large_pool_size: this parameter specifies the size of the large pool. You can find the default value with the query select * from v$sgainfo where name='Large Pool Size'.
streams_pool_size: this parameter specifies the size of the streams pool. You can find the default value with the query select * from v$sgainfo where name='Streams Pool Size'.
shared_pool_size: this parameter specifies the size of the shared pool. You can find the default value with the query select * from v$sgainfo where name='Shared Pool Size'.

The explored domains of the SGA components strongly depend on your application and hardware; an approach is to scale both up and down the baseline value by a reasonable factor to define the domain boundaries (eg: from 20% to 500% of the baseline).

To tune all the components set both the memory_target and sga_target parameters to 0 by configuring them among the default values of a study, or manually running the configuration queries.

Notice: if your system leverages non-standard block-size buffers you should consider tuning also the db_Nk_cache_size parameters.

The following snippet shows the parameter selection to tune the size of the PGA and the SGA components. The PGA parameter is configured to go from the minimum value to 90% of the maximum physical memory (6848M in our example), while the domains for the SGA components are configured scaling their default value by approximatively a factor of 10. Along with the constraint defined below, these domains give Akamas great flexibility while exploring how to distribute the available memory space:

parametersSelection:
- name: ora.pga_aggregate_target
  domain: [10, 6848]
- name: ora.db_cache_size
  domain: [128, 6848]
- name: ora.log_buffer
  domain: [1, 128]
- name: ora.java_pool_size
  domain: [4, 240]
- name: ora.large_pool_size
  domain: [12, 1024]
- name: ora.shared_pool_size
  domain: [12, 1024]

The following code snippet forces Akamas to explore configuration spaces where the total memory, expressed in MiB, does not exceed the total memory available (7609M in our example).

parameterConstraints:
- name: Limit total memory
  formula: ora.db_cache_size + name: ora.log_buffer + ora.java_pool_size + ora.large_pool_size + ora.shared_pool_size + ora.pga_aggregate_target <= 7609

You should also add to the equation any db_Nk_cache_size tuned in the study.

Guidelines for PostgreSQL

Suggested Parameters

When running a PostgreSQL optimization, consider starting from these recommendations:

Parameter

Recommendation

Guidelines for defining optimization studies

This section provides some guidelines on how to define optimization studies by means of a few examples related to single-technology/layer systems, in particular on how to define workflows and telemetry providers.

More complex real-world examples are provided by the guide.

Optimizing Linux

When optimizing Linux systems, typically the goal is to allow cost savings or improve performance and the quality of service, such as sustaining higher levels of traffic or enabling transactions with lower latency.

Please refer to the for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

Akamas provides the as the preferred way to apply Linux parameters to a system to be optimized. The operator connects via SSH to your Linux components and will employ different strategies to apply Linux parameters. Notice that this operator allows you to exclude some block/network devices from being configured.

A typical workflow

You can organize a typical workflow to optimize Linux in three parts:

Configure Linux
- Use the to apply configuration parameters to the operating system, no restart is required
Test the performance of the system
- Use to execute a performance test against the system
Perform some cleanup
- Use to perform any clean-up to guarantee any subsequent execution of the workflow will run without problems

Here’s an example of a typical workflow for a Linux system:

Telemetry Providers

Optimizing Java OpenJDK

When optimizing Java applications based on OpenJDK, typically the goal is to tune the JVM from both the point of view of cost savings and quality of service.

Please refer to the for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

Akamas offers many operators that you can use to apply the parameters for the tuned JVM. In particular, it is suggested to use the to create a configuration file or inject the arguments directly in the command string using a template.

The following is an example of templatized executions string:

A typical workflow

A typical workflow to optimize a Java application can be structured in two parts:

Configure the Java arguments
1. Generate a configuration file or a command string containing the selected JVM parameters using a .
Run the Java application
1. Use available to execute a performance test against the application.

Here’s an example of a typical workflow where Akamas executes the script containing the command string generated by the file configurator:

Telemetry Providers

Here’s a configuration example for a telemetry provider instance that uses Prometheus to extract all the JMX metrics defined in this optimization pack:

where the configuration of the monitored component provides the additional references as in the following snippet:

Examples

Optimizing OpenJ9

When optimizing Java applications based on OpenJ9, typically the goal is to tune the JVM from both the point of view of cost savings and quality of service.

Please refer to the for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

Akamas offers many operators that you can use to apply the parameters for the tuned JVM. In particular, it is suggested to leverage the to create a configuration file or inject the arguments directly in the command string using a template.

The following is an example of templatized executions string:

A typical workflow

A typical workflow to optimize a Java application can be structured in two parts:

Configure the Java arguments
1. Generate a configuration file or a command string containing the selected JVM parameters using a .
Run the Java application
1. Use available to execute a performance test against the application.

Here’s an example of a typical workflow where Akamas executes the script containing the command string generated by the file configurator:

Telemetry Providers

Here’s a configuration example for a telemetry provider instance that uses Prometheus to extract all the JMX metrics defined in this optimization pack:

where the configuration of the monitored component provides the additional references as in the following snippet:

Examples

Optimizing Web Applications

This page intends to provide some guidance in optimizing web applications. Please refer to the Web Application optimization pack for the list of component types, parameters, metrics, and constraints.

Telemetry configuration

No specialized telemetry solution to gather Web Application metrics is included. The following providers however can integrate with the provided metrics:

CSV File Provider: this provider can be configured to ingest data points generated by any monitoring application able to export the data in CSV format.
integrations leveraging NeoLoad Web, LoadRunner Professional or LoadRunner Enterprise as a load generator can use this ad-hoc provider that comes out of the box and uses the metrics defined in this optimization pack.

Workflows

Applying parameters

The provided component type does not define any parameter. The workflow will optimize parameters defined in other component types representing the underlying technological stack.

A typical workflow

A typical workflow to optimize a web application is structured in three parts:

Configure and restart the application
1. Use the FileConfigura operator to interpolate the tuned parameters in the configuration files of the underlying stack.
2. Restart the application using an Executor operator.
3. Wait for the application to come up using the Sleep or Executor operator.
Run the test
1. use any of the available operators to trigger the execution of the performance test against the application.
Perform the cleanup
1. use any of the available operators to restore the application to the original state.

Here's an example workflow to perform a test on a Java web application using NeoLoad as a load generator:

name: "webapp workflow"
tasks:
  - name: Set Java parameters
    operator: FileConfigurator
    arguments:
      source:
        hostname: myapp.mycompany.com
        username: ubuntu
        key: # ...
        path: /home/ubuntu/conf_template
      target:
        hostname: myapp.mycompany.com
        username: ubuntu
        key: # ...
        path: /home/ubuntu/conf

  - name: Restart application
    operator: Executor
    arguments:
      command: "/home/ubuntu/myapp_down.sh; /home/ubuntu/myapp_sh -opts '/home/ubuntu/conf'"
      host:
        hostname: myapp.mycompany.com
        username: ubuntu
        key: # ...

  - name: Run NeoLoadWeb load test
    operator: NeoLoadWeb
    arguments:
      accountToken: NLW_TOKEN
      projectFile:
        # NeoLoad projectfile location ...

Examples

See this page for an example of a study leveraging the Web Application pack.

Optimizing Kubernetes

When optimizing Kubernetes applications, typically the goal is to find the configuration that assigns resources to containerized applications so as to minimize waste and ensure the quality of service.

Please refer to the for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

Akamas offers different operators to configure Kubernetes entities. In particular, you can use the to update the definition file of a resource and apply it with the .

The following example is the definition of a deployment, where the replicas and resources are templatized in order to work with the FileConfigurator:

A typical workflow

A typical workflow to optimize a Kubernetes application is usually structured as the following:

Wait for the application to be ready: run a custom script to wait until the rollout is complete.
Run the test: execute the benchmark.

Here’s an example of a typical workflow for a system:

Telemetry Providers

Here’s a configuration example for a telemetry provider instance that uses Prometheus to extract all the Kubernetes metrics defined in this optimization pack:

where the configuration of the monitored component provides the additional filters as in the following snippet:

Please keep in mind that some resources, such as pods belonging to deployments, require wildcards in order to match the auto-generated names.

Examples

Optimizing Spark

When optimizing applications running on the Apache Spark framework, the goal is to find the configurations that best optimize the allocated resources or the execution time.

Please refer to the Spark optimization pack for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

Akamas offers several operators that you can use to apply the parameters for the tuned Spark application. In particular, we suggest using the Spark SSH Submit operator, which connects to a target instance to submit the application using the configuration parameters to test.

A typical workflow

You can organize a typical workflow to optimize a Spark application in three parts:

Setup the test environment
1. prepare any required input data
2. apply the Spark configuration parameters, if you are going for a file-based solution
Execute the Spark application
Perform cleanup

Here’s an example of a typical workflow where Akamas executes the Spark application using the Spark SSH Submit operator:

name: Spark workflow
tasks:
   - name: cwspark
     arguments:
        master: yarn
        deployMode: cluster
        file: /home/hadoop/scripts/pi.py
        args: [ 100 ]L

Telemetry Providers

Akamas can access Spark History Server statistics using the Spark History Server Provider. This provider maps the metrics in this optimization pack to the statistics provided by the History Server endpoint.

Here’s a configuration example for a telemetry provider instance:

provider: SparkHistoryServer
config:
  address: sparkmaster.akamas.io
  port: 18080

Examples

See this page for an example of a study leveraging the Spark pack.

Optimizing Oracle Database

When optimizing a MongoDB instance, typically the goal is to maximize the throughput of an Oracle-backed application or to minimize its resource consumption, thus reducing costs.

Please refer to the for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

One common way to configure Oracle parameters is through the execution ALTER SYSTEM statements on the database instance: to automate the execution of this task Akamas provides the . For finer control, Akamas provides the , which allows building custom statements in a script file that can be executed by the .

Oracle Configurator

The allows the workflow to configure an on-premise instance with minimal configuration. The following snippet is an example of a configuration task, where all the connection arguments are already defined in the referenced component:

File Configurator and Executor

Most cloud providers offer web APIs as the only way to configure database services. In this case, the can submit an API request through a custom executable using a configuration file generated by a . The following is an example workflow where a FileConfigurator task generates a configuration file (oraconf), followed by an Executor task that parses and submits the configuration to the API endpoint through a custom script (api_update_db_conf.sh):

A typical workflow

The optimization of an Oracle database usually includes the following tasks in the workflow, as implemented in the example below:

Apply the Oracle configuration suggested by Akamas and restart the instance if needed (Update parameters task).
Perform any additional warm-up task that may be required to bring the database up at the operating regime (Execute warmup task).
Execute the workload targeting the database or the front-end in front of it (Execute performance test task).
Restore the original state of the database in order to guarantee the consistency of further tests, removing any dirty data added by the workload and possibly flushing the database caches (Cleanup task).

The following is the complete YAML configuration file of the workflow described above:

Telemetry Providers

The following example shows how to configure a telemetry instance for a Prometheus provider in order to query the data points extracted from the exporter described above:

Examples

Optimizing MongoDB

When optimizing a MongoDB instance, typically the goal is one of the following:

Throughput optimization - increasing the capacity of a MongoDB deployment to serve clients
Cost optimization - decreasing the size of a MongoDB deployment while guaranteeing the same service level

To reach such goals, it is recommended to tune the parameters that manage the cache, which is of the elements that impact performances the most, in particular those parameters that control the lifecycle and the size of the MongoDB’s cache.

Even though it is possible to evaluate performance improvements of MongoDB by looking at the business application that uses it as its database, looking at the end-to-end throughput or response time, or using a performance test like , the optimization pack provides internal MongoDB metrics that can shed a light too on how MongoDB is performing, in particular in terms of throughput, for example:

The number of documents inserted in the database per second
The number of active connections

Please refer to the for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

Akamas offers many operators that you can use to apply freshly tuned configuration parameters to your MongoDB deployment. In particular, we suggest using the to create a configuration script file and the ExecutorOperator to execute it and thus apply the parameters.

FileConfigurator and Executor operator

Here’s an example of the aforementioned template file:

You can leverage the FileConfigurator by creating a template file on a remote host that contains some scripts to configure MongoDB with placeholders that will be replaced with the values of parameters tuned by Akamas. Once the FileConfigurator has replaced all the tokens, you can use the Executor operator to actually execute the script to configure MongoDB.

A typical workflow

A typical workflow to optimize a MongoDB deployment can be structured in three parts:

Configure MongoDB
Test the performance of the application
Prepare test results (optional)
Cleanup

Finally, when running performance experiments on a database, is common practice to execute some cleanup tasks at the end of the test to restore the database initial condition and avoid impacting subsequent tests.

Here’s an example of a typical workflow for a MongoDB deployment, which uses the YCSB benchmark to run performance tests:

Telemetry providers

Here’s an example of a telemetry providers instance that uses Prometheus to extract all the MongoDB metrics defined in this optimization pack:

Examples

Optimizing MySQL Database

When optimizing a MySQL instance, typically the goal is one of the following:

Throughput optimization: increasing the capacity of a MySQL deployment to serve clients
Cost optimization: decreasing the size of a MySQL deployment while guaranteeing the same service level

Please refer to the for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

Usually, MySQL parameters are configured by writing them in the MySQL configuration file, typically called my.cnf, and located under /etc/mysql/ on most Linux systems.

In order to preserve the original config file intact, it is best practice to use additional configuration files, located in /etc/mysql/conf.d to override the default parameters. These files are automatically read by MySQL.

FileConfigurator and Executor operator

You can leverage the by creating a template file on a remote host that contains some scripts to configure MySQL with placeholders that will be replaced with the values of parameters tuned by Akamas. When all the placeholders in FileConfigurator get replaced, the operator can be used to actually execute the script to configure and restart the database

A typical workflow

A typical workflow to optimize a MySQL deployment can be structured in three parts:

Configure MySQL
1. Use the to specify an input and an output template file. The input template file is used to specify how to interpolate MySQL parameters into a configuration file, and the output file is used to contain the result of the interpolation.
Restart MySQL
1. Use the to restart MySQL allowing it to load the new configuration file produced in the previous step.
2. Optionally, use the to verify that the application is up and running and has finished any initialization logic.
Test the performance of the application
1. Use any of the to perform a performance test against the application.
Prepare test results
1. Use any of the to organize test results so that they can be imported into Akamas using the supported (see also section here below).

Finally, when running performance experiments on databases is common practice to do some cleanup tasks at the end of the test to restore the database's initial condition to avoid impacting subsequent tests.

Here’s an example of a typical workflow for MySQL, which uses the OLTP Resourcestresser benchmark to run performance tests

Telemetry providers

Here’s an example of a telemetry providers instance that uses Prometheus to extract all the MySQL metrics defined in this optimization pack:

Examples

Optimizing PostgreSQL

When optimizing a PostgreSQL instance, typically the goal is one of the following:

Throughput optimization: increasing the number of transactions
Cost optimization: minimize resource consumption according to a typical workload, thus cutting costs

Please refer to the for the list of component types, parameters, metrics, and constraints.

Workflow

Applying parameters

Akamas offers many operators that you can use to apply the parameters for the tuned PostgreSQL instances. In particular, we suggest using the for parameters templating and configuration, and the for restoring DB data and launching scripts.

A typical optimization process involves the following steps: