This page provides a short compendium of general performance engineering best practices to be applied in any load testing exercise. The focus is on how to ensure that realistic performance tests are designed and implemented to be successfully leveraged for optimization initiatives.
The goal of ensuring realistic performance tests boils down to two aspects:
sound test environments;
realistic workloads.
A test o the pre-production environment (Test Env from now on) needs to represent as closely as possible the production environment (ProdEnv from now on).
The most representative test environment would be a perfect replica of the production environment from both infrastructure (hardware) and architecture perspectives. The following criteria and guidelines can help design a TestEnv that is suitable for performance testing supporting optimization initiatives.
Hardware specifications
The hardware specifications of the physical or virtual servers running in TestEnv and ProdEnv must be identical. This is because any differences in the available resources (e.g. amount of RAM) or specification (e.g. CPU vendor and/or type) may affect both services performance and system configuration.
This general guideline can only be relaxed for servers/clusters running container(s) or container orchestration platforms (e.g. Kubernetes or OpenShift). Indeed, it is possible to safely execute most of the related optimization cases if the TestEnv guarantees enough spare/residual capacity (number of cores or amount of RAM) to allocate all the needed resources.
While for monolithic architectures this may translate into significant HW requirements, with microservices this might not be the case, for two main reasons:
microservices are typically smaller than monoliths and designed for horizontal scalability: this means that optimizing the configuration of the single instance (pod/container resources and runtime settings) becomes easier as they typically have smaller HW requirements;
approaches like Infrastructure-as-code (IaaC), typically used with cloud-native applications, allow for easily setting up cluster infrastructure (on-prem or on the cloud) that can mimic production environments.
Downscaled/downsized architecture
Test Envs are typically downscaled/downsized with respect to Prod Envs. If this is the case, then optimizations can be safely executed provided it is possible to generate a "production-like" workload on each of the nodes/elements of the architecture.
This can be usually achieved if all the architectural layers have the same scale ratio between the two environments and the generated workload is scaled accordingly. For example, if the ProdEnvs has 4 nodes at the front-end layer, 4 at the backend layer, and 2 at the database layer, then a TestEnv can have 2 nodes, 2 nodes, and 1 node respectively.
Load balancing among nodes
From a performance testing perspective, the existence of a load balancing among multiple nodes can be ignored, if the load balancing relies on an external component that ensures a uniform distribution of the load across all nodes.
On the contrary, if an application-level balancing is in place, it might be required to include at least two nodes in the testing scenario so as to take into account the impact of such a mechanism on the performance of the cluster.
External/downstream services
The TestEnv should also replicate the application ecosystem, including dependencies from external or downstream services.
External or downstream services should emulate the production behavior from both functional (e.g. response size and error rate) and performance (e.g. throughput and response times) perspectives. In case of constraints or limitations on the ability to leverage external/downstream services for testing purposes, the production behavior needs to be simulated via stubs/mock services.
In the case of microservices applications, it is also required to replicate dependencies within an application. Several approaches can be taken for this purpose, such as:
replicating interacting microservices;
mocking these microservices and simulating realistic response times using simulation tools such as https://github.com/spectolabs/hoverfly;
disregarding dependencies with nonrelevant services (e.g. a post-processing service running on a mainframe whose messages are simply left published in a queue without being dequeued).
The most representative performance test script would provide 100% coverage of all the possible test cases. Of course, this is very unlikely to be the case in performance testing. The following criteria and guidelines can be considered to establish the required test coverage.
Statistical relevance
The test cases included in the test script must cover at least 80% of the production workload.
Business relevance
The test cases included in the test script must cover all the business-critical functionalities that are known (or expected) to represent a significant load in the production environment
Technical relevance
The test cases included in the test script must cover all the functionalities that at the code level involve:
Large objects/data structure allocation and management
Long living objects/data structure allocation and management
Intensive CPU, data, or network utilization
"one of-a-kind" implementations, such as connections to a data source, ad-hoc objects allocation/management, etc.
The virtual user paths and behavior coded in the test script must be representative of the workload generated by production users. The most representative test script would account for the production users in terms of a mix of the different user paths, associated think times, and session length perspectives.
When single-user paths cannot be easily identified, the best practice is to consider each of them the most comprehensive user journey. In general, a worst-case approach is recommended.
The task of reproducing realistic workloads is easier for microservice architectures. On the contrary, for monolithic architectures, this task could become hard as it may not be easy to observe all of the workloads, due to custom frameworks, etc. With microservices, the workload can be completely decomposed in terms of APIs/endpoints and APM tools can provide full observability of production workload traffic and performance characteristics for each single API. This guarantees that the replicated workload can reproduce the production traffic as closely as possible.
Both test script data, that is datasets used in the test script, and test environment data, that is datasets in any involved databases/datastores, have to be characterized both in terms of size and variance so as to reproduce the production performances.
Test script data
The test script data has to be characterized in order to guarantee production-like performances (e.g. cache behavior). In case this characterization is difficult, the best practice is to adopt a worst-case approach.
Test environment data
The test data must be sized and have an adequate variance to guarantee production like performances in the interaction with databases/datastores (e.g. query response times).
Most performance test tools provide the ability to easily define and modify the test scenarios on top of already defined test cases/scripts, test case-mix, and test data. This is especially useful in the Akamas context where it might be required to execute a specific test scenario, based on the specific optimization goal defined. The most common (and useful, in the Akamas context) test scenarios are described here below.
Load tests
A load test aims at measuring system performance against a specified workload level, typically the one experienced or expected in production. Usually, the workload level is defined in terms of virtual user concurrency or request throughput.
In the load test, after an initial ramp-up, the target load level is maintained constant for a steady state until the end of the test.
When validating a load test, the following two key factors have to be considered:
The steady-state concurrency/throughput level: a good practice is to apply a worst-case approach by emulating at least 110% of the production throughput;
The steady-state duration: in general defining the length for steady-state is a complex task because it is strictly dependent on the technologies under test and also because phenomena such as bootstraps, warm-ups, and caching can affect the performance and behavior of the system only before or after a certain amount of time; as a general guide to validate the steady-state duration, it is useful to:
execute a long-run test by keeping the defined steady-state for at least 2h to 3h;
analyze test results by looking for any variation in the performance and behavior of the system over time;
In case no variation is observed, shorten the defined same steady-state to at least 30+min.
Stress tests
A Stress test is all about pushing the system under test to its limit.
Stress tests are useful to identify the maximum throughput that an application can cope with while working within its SLOs. Identifying the breaking point of an application is also useful to highlight the bottleneck(s) of the application.
A stress test also makes it possible to understand how the system reacts to excessive load, thus validating the architectural expectations. For example, it can be useful to discover that the application crashes when reaching the limit, instead of simply enqueuing requests and slowing down processing them.
Endurance tests
An endurance test aims at validating the system's performance over an extended period of time.
The first validation is provided by utilization metrics (e.g. CPU, RAM, I/O), which should closely display in the test environments the same behavior of production environments. If the delta is significant, some refinements of the test case and environment might be required to close the gap and gain confidence in the test results.
After modeling the system and its components and ensuring that appropriate telemetry instances are defined, the following step (see the following figure) is to define a workflow.
A workflow automates all the tasks to be executed in sequence (see the following figure) during the optimization study, in particular those leveraging integrations with external entities, such as telemetry providers or configuration management tools. Akamas provides a number of general-purpose and specialized workflow operators (see Workflow Operator page).
The Workflow template section of the reference guide describes the template required to define a workflow, while the commands for creating a workflow are listed on the Resource Management command page.
Since a workflow is an Akamas resource defined at the workspace level and that can be used by multiple studies, it might be the case that a convenient workflow is already available or can be used to create a new workflow for the specific target system and integrations, by adding/removing some workflow tasks, changing the task sequence or the values assigned to task parameters.
Notice that since the structure of workflows defined for a live optimization study and for an offline optimization study are very different, these cases are described by a specific page:
A workflow for an offline optimization study automates all the actions required to interface the configuration management and load testing tools (see the following figure) at each experiment or trial. Notice that metrics collection is an implicit action that does not need to be coded as part of the workflow.
More in detail, a typical workflow includes the following types of tasks:
Preparing the application, by executing all cleaning or reset actions that are required to prepare the load testing phase and ensuring that each experiment is executed under exactly the same conditions - for example, this may involve cleaning caches, uploading test data, etc
Applying the configuration, by preparing and then applying the parameter configuration under test to the target environment - this may require interfacing configuration management tools or pushing configuration to a repository, restarting the entire application or some of its components to ensure that some parameters are effectively applied, and then checking that after restarting the application is up & running before the workflow execution continues, and checking whether the configuration has been correctly applied
Applying the workload, by launching a load test to assess the behavior of the system under the applied configuration and synthetic workload defined in the load testing scenarios - of course, a preliminary step is to design a load testing scenario and synthetic workload that ensures that optimized configurations resulting from the offline optimization can be applied to the target system under the real or expected workload
The Optimization examples section provides some examples of how to define workload for a specific technology. In a complex application, a workflow may include multiple actions of the same type, each operating on separate components of the target system. The Knowledge base guide provides some real-world examples of how to create workflows and optimization studies.
A workflow interrupts in case any of its steps does. A failing workflow causes the experiment or trial to fail. This should be considered as a different situation than a specific configuration not matching optimization constraints or causing the system under test to fail to run. For example, if the amount of max memory configured was too low, the application may fail to start.
When an experiment fails, the Akamas AI engine takes this information into account and thus learns that that parameter configuration was bad. This way, the AI engine automatically tries to avoid the regions of the parameter space which can lead to low scores or failures.
This explains why it is important to build robust workflows that ensure experiments only fail in case bad configurations are tested. See the specific entry Building robust workflow in the best practices section below.
Creating effective workflows is essential to ensure that Akamas can automatically identify the optimal configuration in a reliable and efficient way. Some best practices on how to build robust workflows are described here below.
Some additional best practices related to the design and implementation of load testing are described in the Performing load testing to support optimization activities page.
Since Akamas workflows are first-class entities that can be used by multiple studies, it might be useful to avoid creating (and maintaining) multiple workflows and instead define workflows that can be easily reused, by factoring all differences into specific action parameters.
Of course, this general guideline should be balanced with respect to other requirements, such as avoiding potential conflicts due to different teams modifying the same workload for different uses and potentially impacting optimization results.
Akamas takes into account the exit code of each of the workflow tasks, and the whole workflow fails if a task exits with an error. Therefore, the best practice is to make use of exit codes in each task, to ensure that task failures can only happen in case of bad parameter configuration.
For example, it is important to always check that the application has correctly started and is up and running (after a new configuration has been applied). This can be done by:
including a workflow task that tests the application is up and running after the tasks where the configuration is applied;
making sure that this task exits with an error in case the application has not correctly started (typically after a timeout).
Another example is when the underlying environment incurs issues during the optimization (e.g. a database might be mistakenly shut down by another team). As much as possible, all these environmental transient issues should be carefully avoided. Akamas also provides the ability to execute multiple task retries (default is twice, configurable) to compensate for these transient issues, provided they only last for a short time (the retry time and delay are also configurable).
Building workflows that ensure reproducible experiments
As for any other performance evaluation activity, Akamas experiments should be designed to be reproducible: if the same experiment (hence, the same parameter configuration) is executed multiple times (i.e. in multiple trials), the same performance results should be found for each trial.
Therefore, it is fundamental that workflows include all the necessary tasks to realize reproducible experiments. Particular care needs to be taken to correctly manage the system state across the experiments and trials. System state can include:
Application caches
Operating system cache and buffers (e.g. Linux filesystem page cache)
Database tables that fill up during the optimization process
All experiments should always start with a clean and well-known state. If the state is not properly managed, it may happen that the performance of the system is observed to change (whether higher or lower) not because of the effect of the applied parameters, but due to other effects (e.g. warming of caches).
Best practices to consistently manage system state across experiments include:
Restoring the system state at the beginning of each experiment - this may involve restarting the application, clearing caches, restoring DB tables, etc;
Allowing for a sufficient warm-up period in the performance tests, so to ensure application performance has reached stability. See also the recommended best practices about properly managing warm-up periods in the following section about creating an optimization study.
Another common cause that can impact the reproducibility of experiments is an unstable infrastructure or environment. Therefore, it is important to ensure that the underlying infrastructure is stable and that no other workload that might impact the optimization results is running on it. For example, beware of scheduled system jobs (e.g. backups), automatic software updates or anti-virus systems that might not explicitly be considered as part of the environment but that may unexpectedly alter its performance behavior.
Taking into account workflow duration
When designing workflows, it is important to take into account the potential duration of their tasks. Indeed, the task duration impacts the duration of the overall optimization and might impact the ability to execute a sufficient number of experiments within the overall time interval or specific time windows allowed for the optimization study.
Typically, the longest task in a workflow is the one related to applying workload (e.g. launching a load test or a batch job): such tasks can last for dozens of minutes if not hours. However, a workflow may also include other ancillary tasks that may provide nontrivial contributions to the task durations (e.g. checking the status to ensure that the application is up & running).
Making workflows fail fast
As general guidance, it is better to fail fast by performing quick checks executed as early as possible. For example, it is better to do a status check before launching a load test instead of possibly waiting for it to complete (maybe after 1h) just to discover that the application did not even start.
A workflow for a automates all the actions required to interface the configuration management. Notice that metrics collection is an implicit action that does not need to be coded as part of the workflow.
More in detail, a typical workflow includes the following types of tasks:
Applying the configuration, by preparing and then applying the parameter configuration that has been recommended and/or approved to the target environment - this may require interfacing configuration management tools or pushing configuration to a repository
Depending on the complexity of the system, the workflow might be composed by multiple actions of the same type, each operating on separate components of the target system.
As expected, with respect to , there are no actions to apply synthetic workloads as part of a load-testing scenario.