In cases where a testing environment is not available or it is hard to build representative load tests Akamas can directly optimize production environments by running a Live Optimization study. Production environments differ from test environments in many ways, here are the main aspects that affect how Akamas can optimize the system in such a scenario and that define live optimization studies:
Safety, in terms of application stability and performance, is critical in production environments where SLO might be in place.
The approval process is usually different between production and lower-level environments. In many cases, a configuration change in a production environment must be manually approved by the SRE or Application team and follow a custom deployment scenario.
The workload on the application in a production environment is usually not controlled, it might change with the time of the day, due to special events or external factors
These are the main factors that make live optimization studies differ from offline optimizations.
The following figure represents the iterative process associated with live optimizations:
The following 5 phases can be identified for each iteration:
Collect KPIs: Akamas collects the metrics of the system required to observe its behavior under the current parameter configuration by leveraging the associated telemetry provider - here Akamas is also observing and categorizing the different workload contexts that are used to recommend configurations that are appropriate for each specific workload context
Score vs Goal: Akamas scores the applied parameter configuration under the specific workload context against the defined goal and constraints
Recommend Conf: Akamas provides a recommendation for parameter configuration based on the observed behavior and leveraging the Akamas AI
Human Approval: the recommendation is inspected, possibly revisited, and approved by users before being applied to the system. This step is optional and can be automated.
Apply Conf: Akamas applies the recommended configuration by leveraging the defined workflow.
Overall the core process is very similar to the one of offline optimization studies. The main difference is the (optional) presence of a manual configuration review and approval step.
Even if the process is similar, the way recommended configurations are generated is quite different as it's subject to some safety policies such as:
The exploration factor defines the maximum magnitude of the change of a parameter from one configuration to the next (e.g. reducing the CPU limit of a container by at most 10%). As changes are smaller in magnitude their effect on the system is also smaller, this leads to safer optimizations as the optimization can better track changes in the core metrics. As a side effect, it might take more time for a live optimization to fully optimize a configuration when compared to an offline study.
The safety factor defines how tight the constraints defined in the study are. As the configuration changes some metrics might approach a limit imposed by constraints. As an example, if we set a response time threshold of 300ms akamas will keep track of how the response time changes due to the configuration changes and react to keep the constraint fulfilled. The safety factor influences how quickly Akamas reacts to approaching constraints.
You can read more on safety policies in the related documentation section.
A key aspect of live optimization studies is the fact that the incoming workload of the application is not generated by a test script but by real users. This means that, after deploying a new configuration the incoming might be different with respect to the use used to evaluate the previous one. Nevertheless, the Akamas AI algorithm is capable of taking into account the differences in the incoming workload and fairly evaluating different configurations even if applied in different scenarios. As an example, the traffic of web applications exposed to the general public is usually different between workdays and weekends or working hours and nights.
To instruct Akamas to take into account changes that are not controlled by the deployment process you just need to specify the workloadsSelection
parameter in the optimization study.
The workload selection should contain a list of metrics that are independent of the configuration and represent external factors that affect the performance of the configuration in terms of goals or constraints. Most of the time the application throughput is a good metric to use as a workload metric.
When one or more workload metrics are specified Akamas will take into account the differences in the workload and build clusters of similar workloads to identify repetitive working conditions for the application. It will then use this information to contextualize the evaluation of each configuration and provide a recommended configuration that fulfills the defined constraints on all the workload conditions seen by the optimization process.
You can read more on this parameter on the reference workload selection page.
Live optimizations are separated from offline optimization studies and are available in the second entry on the left menu.
Live optimizations are run usually for a longer period compared to offline optimizations and their effect on the goal and the constraints is more gradual. For this reason, Akamas offers a specific UI that allows users to evaluate the progress of live optimizations and compare many different configurations applied by looking at the evolution of core metrics.
Even for live optimization studies, it is a good practice to analyze how the optimization is being executed with respect to the defined goal & constraints, and workloads.
This analysis may provide useful insights about the system being optimized (e.g. understanding of the system dynamics) and about the optimization study itself (e.g. how to adjust optimizer options or change constraints). Since this is more challenging for an environment that is being optimized live, a common practice to adopt a recommendation mode before possibly switching to a fully autonomous mode.
The Akamas UI displays the results of an offline optimization study in the following areas:
the Metrics section (see the following figures) displays the behavior of the metrics as configurations are recommended and applied (possibly after being reviewed and approved by users); this area supports the analysis of how the optimizer is driven by the configured safety and exploration factors.
The All Configurations section provides the list of all the recommended configurations, possibly as modified by the user, as well as the details of each applied configuration (see the following figures).
in the case of a recommendation mode, the Pending configuration section (see the following figure) shows the configuration that is being recommended to allow users to review it (see the EDIT toggle) and approve it: