# Optimizing a Spark application

In this example study we’ll tune the parameters of *SparkPi*, one of the example applications provided by most of the Apache Spark distributions, to minimize its execution time. Application monitoring is provided by the Spark History Server APIs.

## Environment setup <a href="#environment-setup" id="environment-setup"></a>

The test environment includes the following instances:

* **Akamas**: instance running Akamas
* **Spark cluster**: composed of instances with 16 vCPUs and 64 GB of memory, where the Spark binaries are installed under `/usr/lib/spark`. In particular, the roles are:
  * **1x master instance**: the Spark node running the resource manager and Spark History Server (host: `sparkmaster.akamas.io`)
  * **2x worker instances**: the other instances in the cluster

### Telemetry Infrastructure setup <a href="#telemetry-infrastructure-setup" id="telemetry-infrastructure-setup"></a>

To gather metrics about the application we will leverage the Spark History Server. If it is not already running, start it on the master instance with the following command:

```
/usr/lib/spark/sbin/start-history-server.sh
```

### Application and Test tools <a href="#application-and-test-tools" id="application-and-test-tools"></a>

To make sure the tested application is available on your cluster and runs correctly, execute the following commands:

```bash
file /usr/lib/spark/examples/jars/spark-examples.jar
spark-submit \
  --master yarn --deploy-mode client \
  --class 'org.apache.spark.examples.SparkPi' \
  /usr/lib/spark/examples/jars/spark-examples.jar 100
```

## Optimization setup <a href="#optimization-setup" id="optimization-setup"></a>

In this section, we will guide you through the steps required to set up on Akamas the optimization of the Spark application execution.

### System <a href="#system" id="system"></a>

#### System *spark* <a href="#system-spark" id="system-spark"></a>

Here’s the definition of the system we will use to group our components and telemetry instances for this example:

```yaml
name: spark
description: A system to tune the Spark Pi example application
```

To create the system run the following command:

```bash
akamas create system system.yaml
```

#### Component *sparkPi* <a href="#component-sparkpi" id="component-sparkpi"></a>

We’ll use a component of type [Spark Application 2.3.0](https://docs.akamas.io/akamas-docs/3.6/reference/optimization-packs/spark-pack/spark-application-2.3.0) to represent the application running on the Apache Spark framework 2.3.

In the snippet shown below, we specify:

* the field properties required by Akamas to connect via SSH to the cluster master instance
* the parameters required by `spark-submit` to execute the application
* the `sparkApplication` flag required by the telemetry instance to associate the metrics from the History Server to this component

```yaml
name: sparkPi
description: The Spark Application used to calculate KPIs for ContentWise Analytics
componentType: Spark Application 2.3.0

properties:
  hostname: sparkmaster.akamas.io
  username: hadoop
  key: ssh_key

  master: yarn
  deployMode: client
  className: org.apache.spark.examples.SparkPi
  file: /usr/lib/spark/examples/jars/spark-examples.jar
  args: [ 1000 ]

  sparkApplication: 'true'
```

To create the component in the system run the following command:

```bash
akamas create component sparkPi.yaml spark
```

### Workflow <a href="#workflow" id="workflow"></a>

The workflow used for this study contains only a single stage, where the operator submits the application along with the Spark parameters under test.

Here’s the definition of the workflow:

```yaml
name: Run SparkPi
tasks:
- name: run application
  operator: SSHSparkSubmit
  arguments:
    component: sparkPi
    retries: 0
```

To create the workflow run the following command:

```bash
akamas create workflow workflow.yaml
```

### Telemetry <a href="#telemetry" id="telemetry"></a>

If you have not installed the Spark History Server telemetry provider yet, take a look at the telemetry provider page [Spark History Server Provider](https://docs.akamas.io/akamas-docs/3.6/reference/telemetry-metric-mapping/spark-history-server-metrics-mapping) to proceed with the installation.

Here’s the definition of the component, specifying the History Server endpoint:

```yaml
provider: SparkHistoryServer
config:
  address: sparkmaster.akamas.io
  port: 18080

  importLevel: job
```

To create the telemetry instance in the system run the following command:

```bash
akamas create telemetry-instance telemetry.yaml spark
```

This telemetry instance will be able to bind the fetched metrics to the related *sparkPi* component thanks to the `sparkApplication` attribute we previously added in its definition.

### Study <a href="#study" id="study"></a>

The goal of this study is to find a Spark configuration that minimizes the execution time for the example application.

To achieve this goal we’ll operate on the number of executor processes available to run the application job, and the memory and CPUs allocated for both driver and executors.\
The domains are configured so that the single driver/executor process does not exceed the size of the underlying instance, and the constraints make it so that the application overall does not require more resources than the ones available in the cluster, also taking into account that some resources must be reserved for other services such as the cluster manager.

Note that this study uses two constraints on the total number of resources to be used by the spark application. This example refers to a cluster of three nodes with 16 cores and 64 GB of memory each, and at least one core per instance should be reserved for the system.

Here’s the definition of the study:

```yaml
name: Speedup SparkPi execution
system: spark
workflow: Run SparkPi

goal:
  objective: minimize
  function:
    formula: sparkPi.spark_application_duration

parametersSelection:
- name: sparkPi.driverCores
  domain: [1, 10]
- name: sparkPi.driverMemory
  domain: [32, 2048]
- name: sparkPi.executorCores
  domain: [1, 15]
- name: sparkPi.executorMemory
  domain: [32, 2048]
- name: sparkPi.numExecutors
  domain: [1, 45]

parameterConstraints:
- name: cap_total_allocated_cpus
  formula: (spark.driverCores + spark.executorCores*spark.numExecutors) <= 15*3

- name: cap_total_allocated_memory
  formula: (spark.driverMemory + spark.executorMemory*spark.numExecutors) <= 60*3

steps:
- name: baseline
  type: baseline

- name: tune
  type: optimize
  numberOfExperiments: 200
  maxFailedExperiments: 200
```

To create and run the study execute the following commands:

```bash
akamas create study study.yaml
akamas start study 'Speedup SparkPi execution'
```
