Optimizing Spark

When optimizing applications running on the Apache Spark framework, the goal is to find the configurations that best optimize the allocated resources or the execution time.

Please refer to the Spark optimization pack for the list of component types, parameters, metrics, and constraints.

Workflows

Applying parameters

Akamas offers several operators that you can use to apply the parameters for the tuned Spark application. In particular, we suggest using the Spark SSH Submit operator, which connects to a target instance to submit the application using the configuration parameters to test.

A typical workflow

You can organize a typical workflow to optimize a Spark application in three parts:

Setup the test environment
1. prepare any required input data
2. apply the Spark configuration parameters, if you are going for a file-based solution
Execute the Spark application
Perform cleanup

Here’s an example of a typical workflow where Akamas executes the Spark application using the Spark SSH Submit operator:

name: Spark workflow
tasks:
   - name: cwspark
     arguments:
        master: yarn
        deployMode: cluster
        file: /home/hadoop/scripts/pi.py
        args: [ 100 ]

Telemetry Providers

Akamas can access Spark History Server statistics using the Spark History Server Provider. This provider maps the metrics in this optimization pack to the statistics provided by the History Server endpoint.

Here’s a configuration example for a telemetry provider instance:

provider: SparkHistoryServer
config:
  address: sparkmaster.akamas.io
  port: 18080

Examples

See this page for an example of a study leveraging the Spark pack.

Last updated 28 days ago