diff --git a/website/source/guides/spark/configuration.html.md b/website/source/guides/spark/configuration.html.md new file mode 100644 index 000000000..e35d7623d --- /dev/null +++ b/website/source/guides/spark/configuration.html.md @@ -0,0 +1,144 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration - Configuration Properties" +sidebar_current: "guides-spark-configuration" +description: |- + Comprehensive list of Spark configuration properties. +--- + +# Spark Configuration Properties + +Spark [configuration properties](https://spark.apache.org/docs/latest/configuration.html#available-properties) +are generally applicable to the Nomad integration. The properties listed below + are specific to running Spark on Nomad. Configuration properties can be set by + adding `--conf [property]=[value]` to the `spark-submit` command. + +- `spark.nomad.cluster.expectImmediateScheduling` `(bool: false)` - Specifies +that `spark-submit` should fail if Nomad is not able to schedule the job +immediately. + +- `spark.nomad.cluster.monitorUntil` `(string: "submitted"`) - Specifies the +length of time that `spark-submit` should monitor a Spark application in cluster + mode. When set to `submitted`, `spark-submit` will return as soon as the + application has been submitted to the Nomad cluster. When set to `scheduled`, + `spark-submit` will return as soon as the Nomad job has been scheduled. When + set to `complete`, `spark-submit` will tail the output from the driver process + and return when the job has completed. + +- `spark.nomad.datacenters` `(string: dynamic)` - Specifies a comma-separated +list of Nomad datacenters to use. This property defaults to the datacenter of +the first Nomad server contacted. + +- `spark.nomad.docker.email` `(string: nil)` - Specifies the email address to +use when downloading the Docker image specified by +[spark.nomad.dockerImage](#spark.nomad.dockerImage). See the +[Docker driver authentication](https://www.nomadproject.io/docs/drivers/docker.html#authentication) +docs for more information. + +- `spark.nomad.docker.password` `(string: nil)` - Specifies the password to use + when downloading the Docker image specified by + [spark.nomad.dockerImage](#spark.nomad.dockerImage). See the +[Docker driver authentication](https://www.nomadproject.io/docs/drivers/docker.html#authentication) +docs for more information. + +- `spark.nomad.docker.serverAddress` `(string: nil)` - Specifies the server +address (domain/IP without the protocol) to use when downloading the Docker +image specified by [spark.nomad.dockerImage](#spark.nomad.dockerImage). Docker +Hub is used by default. See the +[Docker driver authentication](https://www.nomadproject.io/docs/drivers/docker.html#authentication) +docs for more information. + +- `spark.nomad.docker.username` `(string: nil)` - Specifies the username to use + when downloading the Docker image specified by + [spark.nomad.dockerImage](#spark-nomad-dockerImage). See the +[Docker driver authentication](https://www.nomadproject.io/docs/drivers/docker.html#authentication) +docs for more information. + +- `spark.nomad.dockerImage` `(string: nil)` - Specifies the `URL` for the +[Docker image](https://www.nomadproject.io/docs/drivers/docker.html#image) to +use to run Spark with Nomad's `docker` driver. When not specified, Nomad's +`exec` driver will be used instead. + +- `spark.nomad.driver.cpu` `(string: "1000")` - Specifies the CPU in MHz that +should be reserved for driver tasks. + +- `spark.nomad.driver.logMaxFileSize` `(string: "1m")` - Specifies the maximum +size by time that Nomad should use for driver task log files. + +- `spark.nomad.driver.logMaxFiles` `(string: "5")` - Specifies the number of log + files that Nomad should keep for driver tasks. + +- `spark.nomad.driver.networkMBits` `(string: "1")` - Specifies the network +bandwidth that Nomad should allocate to driver tasks. + +- `spark.nomad.driver.retryAttempts` `(string: "5")` - Specifies the number of +times that Nomad should retry driver task groups upon failure. + +- `spark.nomad.driver.retryDelay` `(string: "15s")` - Specifies the length of +time that Nomad should wait before retrying driver task groups upon failure. + +- `spark.nomad.driver.retryInterval` `(string: "1d")` - Specifies Nomad's retry +interval for driver task groups. + +- `spark.nomad.executor.cpu` `(string: "1000")` - Specifies the CPU in MHz that +should be reserved for executor tasks. + +- `spark.nomad.executor.logMaxFileSize` `(string: "1m")` - Specifies the maximum + size by time that Nomad should use for executor task log files. + +- `spark.nomad.executor.logMaxFiles` `(string: "5")` - Specifies the number of +log files that Nomad should keep for executor tasks. + +- `spark.nomad.executor.networkMBits` `(string: "1")` - Specifies the network +bandwidth that Nomad should allocate to executor tasks. + +- `spark.nomad.executor.retryAttempts` `(string: "5")` - Specifies the number of + times that Nomad should retry executor task groups upon failure. + +- `spark.nomad.executor.retryDelay` `(string: "15s")` - Specifies the length of +time that Nomad should wait before retrying executor task groups upon failure. + +- `spark.nomad.executor.retryInterval` `(string: "1d")` - Specifies Nomad's retry +interval for executor task groups. + +- `spark.nomad.job` `(string: nil)` - Specifies the Nomad job name. + +- `spark.nomad.job.template` `(string: nil)` - Specifies the path to a JSON file +containing a Nomad job to use as a template. This can also be set with +`spark-submit's --nomad-template` parameter. + +- `spark.nomad.priority` `(string: nil)` - Specifies the priority for the +Nomad job. + +- `spark.nomad.region` `(string: dynamic)` - Specifies the Nomad region to use. +This property defaults to the region of the first Nomad server contacted. + +- `spark.nomad.shuffle.cpu` `(string: "1000")` - Specifies the CPU in MHz that +should be reserved for shuffle service tasks. + +- `spark.nomad.shuffle.logMaxFileSize` `(string: "1m")` - Specifies the maximum + size by time that Nomad should use for shuffle service task log files.. + +- `spark.nomad.shuffle.logMaxFiles` `(string: "5")` - Specifies the number of +log files that Nomad should keep for shuffle service tasks. + +- `spark.nomad.shuffle.memory` `(string: "256m")` - Specifies the memory that +Nomad should allocate for the shuffle service tasks. + +- `spark.nomad.shuffle.networkMBits` `(string: "1")` - Specifies the network +bandwidth that Nomad should allocate to shuffle service tasks. + +- `spark.nomad.sparkDistribution` `(string: nil)` - Specifies the location of +the Spark distribution archive file to use. + +- `spark.nomad.tls.caCert` `(string: nil)` - Specifies the path to a `.pem` file + containing the certificate authority that should be used to validate the Nomad + server's TLS certificate. + +- `spark.nomad.tls.cert` `(string: nil)` - Specifies the path to a `.pem` file +containing the TLS certificate to present to the Nomad server. + +- `spark.nomad.tls.trustStorePassword` `(string: nil)` - Specifies the path to a + `.pem` file containing the private key corresponding to the certificate in +[spark.nomad.tls.cert](#spark-nomad-tls-cert). + diff --git a/website/source/guides/spark/customizing.html.md b/website/source/guides/spark/customizing.html.md new file mode 100644 index 000000000..6c8f7240e --- /dev/null +++ b/website/source/guides/spark/customizing.html.md @@ -0,0 +1,118 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration - Customizing Applications" +sidebar_current: "guides-spark-customizing" +description: |- + Learn how to customize the Nomad job that is created to run a Spark + application. +--- + +# Customizing Applications + +By default, the Spark integration will start with a blank Nomad job and add +configuration to it as necessary. In `cluster` mode, groups and tasks are added +for the driver and the executors (the driver task group is not relevant for +`client` mode) . A task will also be added for the +[shuffle service](/guides/spark/dynamic.html) if it has been enabled. All tasks +have the `spark.nomad.role` meta value defined. For example: + +```hcl +job "structure" { + meta { + "spark.nomad.role" = "application" + } + + # A driver group is only added in cluster mode + group "driver" { + task "driver" { + meta { + "spark.nomad.role" = "driver" + } + } + } + + group "executors" { + count = 2 + task "executor" { + meta { + "spark.nomad.role" = "executor" + } + } + + # Shuffle service tasks are only added when enabled (as it must be when + # using dynamic allocation) + task "shuffle-service" { + meta { + "spark.nomad.role" = "shuffle" + } + } + } +} +``` + +You can customize the Nomad job that Spark creates by [setting configuration +properties](/guides/spark/configuration.html) or by using a job template. The +order of precedence for settings is as follows: + +1. Explicitly set configuration properties. +2. Settings in the job template if provided. +3. Default values of the configuration properties. + +## Customization Using a Nomad Job Template + +Rather than having Spark create a Nomad job from scratch to run your +application, you can set the `spark.nomad.job.template` configuration property +to the path of a file containing a template job specification. There are two +important considerations: + + * The template must use the JSON format. You can convert an HCL jobspec to + JSON by running `nomad run -output `. + + * `spark.nomad.job.template` should be set to a path on the submitting + machine, not to a URL (even in cluster mode). The template does not need to + be accessible to the driver or executors. + +Using a job template you can override Spark’s default resource utilization, add +additional metadata or constraints, set environment variables or add sidecar +tasks. The template does not need to be a complete Nomad job specification, since +Spark will add everything necessary to run your the application. For example, +your template might set `job` metadata, but not contain any task groups, making +it an incomplete Nomad job specification but still a valid template to use with +Spark. + +To customize the driver task group, include a task group in your template that +has a task that contains a `spark.nomad.role` meta value set to `driver`. + +To customize the executor task group, include a task group in your template that +has a task that contains a `spark.nomad.role` meta value set to `executor` or +`shuffle`. + +The following template adds a `meta` value at the job level and an environment +variable to the executor task group: + +```hcl +job "template" { + + meta { + "foo" = "bar" + } + + group "executor-group-name" { + + task "executor-task-name" { + meta { + "spark.nomad.role" = "executor" + } + + env { + BAZ = "something" + } + } + } +} +``` + +## Next Steps + +Learn how to [allocate resources](/guides/spark/resource.html) for your Spark +applications. diff --git a/website/source/guides/spark/dynamic.html.md b/website/source/guides/spark/dynamic.html.md new file mode 100644 index 000000000..4f1c7d043 --- /dev/null +++ b/website/source/guides/spark/dynamic.html.md @@ -0,0 +1,28 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration - Dynamic Executors" +sidebar_current: "guides-spark-dynamic" +description: |- + Learn how to dynamically scale Spark executors based the queue of pending + tasks. +--- + +# Dynamically Allocate Spark Executors + +By default, the Spark application will use a fixed number of executors. Setting +`spark.dynamicAllocation` to `true` enables Spark to add and remove executors +during execution depending on the number of Spark tasks scheduled to run. As +described in [Dynamic Resource Allocation](http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation), dynamic allocation requires that `spark.shuffle.service.enabled` be set to `true`. + +On Nomad, this adds an additional shuffle service task to the executor +task group. This results in a one-to-one mapping of executors to shuffle +services. + +When the executor exits, the shuffle service continues running so that it can +serve any results produced by the executor. Due to the nature of resource +allocation in Nomad, the resources allocated to the executor tasks are not + freed until the shuffle service (and the application) has finished. + +## Next Steps + +Learn how to [integrate Spark with HDFS](/guides/spark/hdfs.html). diff --git a/website/source/guides/spark/hdfs.html.md b/website/source/guides/spark/hdfs.html.md new file mode 100644 index 000000000..e6f71e64f --- /dev/null +++ b/website/source/guides/spark/hdfs.html.md @@ -0,0 +1,126 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration - Using HDFS" +sidebar_current: "guides-spark-hdfs" +description: |- + Learn how to deploy HDFS on Nomad and integrate it with Spark. +--- + +# Using HDFS + +[HDFS](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system) +is a distributed, replicated and scalable file system written for the Hadoop +framework. Spark was designed to read from and write to HDFS, since it is +common for Spark applications to perform data-intensive processing over large +datasets. HDFS can be deployed as its own Nomad job. + +## Running HDFS on Nomad + +A sample HDFS job file can be found [here](https://github.com/hashicorp/nomad/blob/f-terraform-config/terraform/examples/spark/spark-history-server-hdfs.nomad). +It has two task groups, one for the HDFS NameNode and one for the +DataNodes. Both task groups use a [Docker image](https://github.com/hashicorp/nomad/tree/f-terraform-config/terraform/examples/spark/docker/hdfs) that has Hadoop installed: + +```hcl + group "NameNode" { + + constraint { + operator = "distinct_hosts" + value = "true" + } + + task "NameNode" { + + driver = "docker" + + config { + image = "rcgenova/hadoop-2.7.3" + command = "bash" + args = [ "-c", "hdfs namenode -format && exec hdfs namenode + -D fs.defaultFS=hdfs://${NOMAD_ADDR_ipc}/ -D dfs.permissions.enabled=false" ] + network_mode = "host" + port_map { + ipc = 8020 + ui = 50070 + } + } + + resources { + memory = 500 + network { + port "ipc" { + static = "8020" + } + port "ui" { + static = "50070" + } + } + } + + service { + name = "hdfs" + port = "ipc" + } + } + } +``` + +The NameNode task registers itself in Consul as `hdfs`. This enables the +DataNodes to generically reference the NameNode: + +```hcl + group "DataNode" { + + count = 3 + + constraint { + operator = "distinct_hosts" + value = "true" + } + + task "DataNode" { + + driver = "docker" + + config { + network_mode = "host" + image = "rcgenova/hadoop-2.7.3" + args = [ "hdfs", "datanode" + , "-D", "fs.defaultFS=hdfs://hdfs.service.consul/" + , "-D", "dfs.permissions.enabled=false" + ] + port_map { + data = 50010 + ipc = 50020 + ui = 50075 + } + } + + resources { + memory = 500 + network { + port "data" { + static = "50010" + } + port "ipc" { + static = "50020" + } + port "ui" { + static = "50075" + } + } + } + + } + } +``` + +The HDFS job can be deployed using the `nomad run` command: + +```shell +$ nomad run hdfs.nomad +``` + +## Next Steps + +Learn how to [monitor the output](/guides/spark/monitoring.html) of your +Spark applications. diff --git a/website/source/guides/spark/monitoring.html.md b/website/source/guides/spark/monitoring.html.md new file mode 100644 index 000000000..778e5f083 --- /dev/null +++ b/website/source/guides/spark/monitoring.html.md @@ -0,0 +1,166 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration - Monitoring Output" +sidebar_current: "guides-spark-monitoring" +description: |- + Learn how to monitor Spark application output. +--- + +# Monitoring Spark Application Output + +By default, `spark-submit` in `cluster` mode will submit your application + to the Nomad cluster and return immediately. You can use the + [spark.nomad.cluster.monitorUntil](/guides/spark/configuration.html#spark-nomad-cluster-monitoruntil) configuration property to have + `spark-submit` monitor the job continuously. Note that, with this flag set, + killing `spark-submit` will *not* stop the spark application, since it will be + running independently in the Nomad cluster. + +## Spark UI + +In cluster mode, if `spark.ui.enabled` is set to `true` (as by default), the +Spark web UI will be dynamically allocated a port. The Web UI will be exposed by + Nomad as a service, and the UI’s `URL` will appear in the Spark driver’s log. By +default, the Spark web UI will terminate when the application finishes. This can +be problematic when debugging an application. You can delay termination by +setting `spark.ui.stopDelay` (e.g. `5m` for 5 minutes). Note that this will +cause the driver process to continue to run. You can force termination + immediately on the “Jobs” page of the web UI. + +## Spark History Server + +It is possible to reconstruct the web UI of a completed application using +Spark’s [history server](https://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact). +The history server requires the event log to have been written to an accessible +location (e.g., [HDFS](/guides/spark/hdfs.html)). + +Sample history server job file: + +```hcl +job "spark-history-server" { + datacenters = ["dc1"] + type = "service" + + group "server" { + count = 1 + + task "history-server" { + driver = "docker" + + config { + image = "barnardb/spark" + command = "/spark/spark-2.1.0-bin-nomad/bin/spark-class" + args = [ "org.apache.spark.deploy.history.HistoryServer" ] + port_map { + ui = 18080 + } + network_mode = "host" + } + + env { + "SPARK_HISTORY_OPTS" = "-Dspark.history.fs.logDirectory=hdfs://hdfs.service.consul/spark-events/" + "SPARK_PUBLIC_DNS" = "spark-history.service.consul" + } + + resources { + cpu = 500 + memory = 500 + network { + mbits = 250 + port "ui" { + static = 18080 + } + } + } + + service { + name = "spark-history" + tags = ["spark", "ui"] + port = "ui" + } + } + + } +} +``` + +The job file above can also be found [here](https://github.com/hashicorp/nomad/blob/f-terraform-config/terraform/examples/spark/spark-history-server.nomad). + +To run the history server, first [deploy HDFS](/guides/spark/hdfs.html) and then +create a directory in HDFS to store events: + +```shell +$ hdfs dfs -mkdir /spark-events +```shell + +You can then deploy the history server with: + +```shell +$ nomad run spark-history-server-hdfs.nomad +``` + +You can get the private IP for the history server with a Consul DNS lookup: + +```shell +$ dig.spark-history.service.consul +``` + +Find the public IP that corresponds to the private IP returned by the `dig` +command above. You can access the history server at http://PUBLIC_IP:18080. + +Use the `spark.eventLog.enabled` and `spark.eventLog.dir` configuration +properties in `spark-submit` to log events for a given application: + +$ spark-submit \ + --class org.apache.spark.examples.JavaSparkPi \ + --master nomad \ + --deploy-mode cluster \ + --conf spark.executor.instances=4 \ + --conf spark.nomad.cluster.monitorUntil=complete \ + --conf spark.eventLog.enabled=true \ + --conf spark.eventLog.dir=hdfs://hdfs.service.consul/spark-events \ + --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz \ + https://s3.amazonaws.com/rcgenova-nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar 100 +``` + +## Logs + +Nomad clients collect the `stderr` and `stdout` of running tasks. The CLI or the + HTTP API can be used to inspect logs, as documented in +[Accessing Logs](https://www.nomadproject.io/docs/operating-a-job/accessing-logs.html). +In cluster mode, the `stderr` and `stdout` of the `driver` application can be +accessed in the same way. The [Log Shipper Pattern](https://www.nomadproject.io/docs/operating-a-job/accessing-logs.html#log-shipper-pattern) uses sidecar tasks to forward logs to a central location. This +can be done using a job template as follows: + +```hcl +job "template" { + group "driver" { + + task "driver" { + meta { + "spark.nomad.role" = "driver" + } + } + + task "log-forwarding-sidecar" { + # sidecar task definition here + } + } + + group "executor" { + + task "executor" { + meta { + "spark.nomad.role" = "executor" + } + } + + task "log-forwarding-sidecar" { + # sidecar task definition here + } + } +} +``` + +## Next Steps + +Review the Nomad/Spark [configuration properties](/guides/spark/configuration.html). diff --git a/website/source/guides/spark/pre.html.md b/website/source/guides/spark/pre.html.md new file mode 100644 index 000000000..f99ec9ed5 --- /dev/null +++ b/website/source/guides/spark/pre.html.md @@ -0,0 +1,115 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration Prerequisites" +sidebar_current: "guides-spark-pre" +description: |- + Understand what the prerequisites and dependencies are for the Nomad/Spark + integration. +--- + +# Prerequisites + +There are three basic prerequisites to using the Nomad/Spark integration: + +- A Nomad cluster with sufficient [resources](/guides/spark/resource.html). See +the [Getting Started](/intro/getting-started/install.html) guide and the +[Terraform configuration](https://github.com/hashicorp/nomad/terraform). + +- Access to a [Spark distribution](https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz) +built with Nomad support. This is required for the machine that will submit +applications as well as the Nomad tasks that will run the Spark executors. + +- A Java runtime environment (JRE) for the submitting machine and the executors. + +The subsections below explain further. + +## Configure the Submitting Machine + +To run Spark applications on Nomad, the submitting machine must have access to +the cluster and have the Nomad-enabled Spark distribution installed. The code +snippets below walk through installing Java and Spark on Ubuntu: + +Install Java: + +```shell +$ sudo add-apt-repository -y ppa:openjdk-r/ppa +$ sudo apt-get update +$ sudo apt-get install -y openjdk-8-jdk +$ JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::") +``` + +Install Spark: + + +```shell +$ wget -O - https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz \ + | sudo tar xz -C /usr/local +$ export PATH=$PATH:/usr/local/spark-2.1.0-bin-nomad-preview-6/bin +``` + +Export NOMAD_ADDR to point Spark to your Nomad cluster: + +```shell +$ export NOMAD_ADDR=http://NOMAD_SERVER_IP:4646 +``` + +Nomad's [Terraform configuration](https://github.com/hashicorp/nomad/terraform) +can also be used to automatically provision a Spark-enabled Nomad environment in + AWS. The [Spark example](https://github.com/hashicorp/nomad/terraform/examples/spark) + provides for a quickstart experience that can be used in conjunction with this +guide. + +## Executor Access to the Spark Distribution + +When running on Nomad, Spark creates Nomad tasks to run executors for use by the +application's driver program. The executor tasks need access to a JRE, a Spark +distribution built with Nomad support, and (in cluster mode) the Spark +application itself. By default, Nomad will only place Spark executors on client +nodes that have the Java runtime installed (version 7 or higher). + +In this example, the Spark distribution and the Spark application JAR file are +being pulled from Amazon S3: + +```shell +$ spark-submit \ + --class org.apache.spark.examples.JavaSparkPi \ + --master nomad \ + --deploy-mode cluster \ + --conf spark.executor.instances=4 \ + --conf spark.nomad.sparkDistribution=https://s3.amazonaws.com/rcgenova-nomad-spark/spark-2.1.0-bin-nomad-preview-6.tgz \ + https://s3.amazonaws.com/rcgenova-nomad-spark/spark-examples_2.11-2.1.0-SNAPSHOT.jar 100 +``` + +### Using a Docker Image + +An alternative to installing the JRE on every client node is to set the +[spark.nomad.dockerImage](/guides/spark/configuration.html#spark-nomad-dockerimage) + configuration property to the URL of a Docker image that has the Java runtime +installed. If set, Nomad will use the `docker` driver to run Spark executors in +a container created from the image. The +[spark.nomad.dockerAuth](/guides/spark/configuration.html#spark-nomad-dockerauth) + configuration property can be set to a JSON object to provide Docker repository + authentication configuration. + +When using a Docker image, both the Spark distribution and the application +itself can be included (in which case local URLs can be used for `spark-submit`). + +Here, we include [spark.nomad.dockerImage](/guides/spark/configuration.html#spark-nomad-dockerimage) +and use local paths for +[spark.nomad.sparkDistribution](/guides/spark/configuration.html#spark-nomad-sparkdistribution) +and the application JAR file: + +```shell +$ spark-submit \ + --class org.apache.spark.examples.JavaSparkPi \ + --master nomad \ + --deploy-mode cluster \ + --conf spark.nomad.dockerImage=rcgenova/spark \ + --conf spark.executor.instances=4 \ + --conf spark.nomad.sparkDistribution=/spark-2.1.0-bin-nomad-preview-6.tgz \ + /spark-examples_2.11-2.1.0-SNAPSHOT.jar 100 +``` + +## Next Steps + +Learn how to [submit applications](/guides/spark/submit.html). diff --git a/website/source/guides/spark/resource.html.md b/website/source/guides/spark/resource.html.md new file mode 100644 index 000000000..7a5e959ce --- /dev/null +++ b/website/source/guides/spark/resource.html.md @@ -0,0 +1,82 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration - Resource Allocation" +sidebar_current: "guides-spark-resource" +description: |- + Learn how to configure resource allocation for your Spark applications. +--- + +# Resource Allocation + +Resource allocation can be configured using a job template or through +configuration properties. Here is a sample template in HCL syntax (this would +need to be converted to JSON): + +```hcl +job "template" { + group "group-name" { + + task "executor" { + meta { + "spark.nomad.role" = "executor" + } + + resources { + cpu = 2000 + memory = 2048 + network { + mbits = 100 + } + } + } + } +} +``` +Resource-related configuration properties are covered below. + +## Memory + +The standard Spark memory properties will be propagated to Nomad to control +task resource allocation: `spark.driver.memory` (set by `--driver-memory`) and +`spark.executor.memory` (set by `--executor-memory`). You can additionally specify + [spark.nomad.shuffle.memory](/guides/spark/configuration.html#spark-nomad-shuffle-memory) + to control how much memory Nomad allocates to shuffle service tasks. + +## CPU + +Spark sizes its thread pools and allocates tasks based on the number of CPU +cores available. Nomad manages CPU allocation in terms of processing speed +rather than number of cores. When running Spark on Nomad, you can control how +much CPU share Nomad will allocate to tasks using the +[spark.nomad.driver.cpu](/guides/spark/configuration.html#spark-nomad-driver-cpu) +(set by `--driver-cpu`), +[spark.nomad.executor.cpu](/guides/spark/configuration.html#spark-nomad-executor-cpu) +(set by `--executor-cpu`) and +[spark.nomad.shuffle.cpu](/guides/spark/configuration.html#spark-nomad-shuffle-cpu) +properties. When running on Nomad, executors will be configured to use one core +by default, meaning they will only pull a single 1-core task at a time. You can +set the `spark.executor.cores` property (set by `--executor-cores`) to allow +more tasks to be executed concurrently on a single executor. + +## Network + +Nomad does not restrict the network bandwidth of running tasks, bit it does +allocate a non-zero number of Mbit/s to each task and uses this when bin packing +task groups onto Nomad clients. Spark defaults to requesting the minimum of 1 +Mbit/s per task, but you can change this with the +[spark.nomad.driver.networkMBits](/guides/spark/configuration.html#spark-nomad-driver-networkmbits), +[spark.nomad.executor.networkMBits](/guides/spark/configuration.html#spark-nomad-executor-networkmbits), and +[spark.nomad.shuffle.networkMBits](/guides/spark/configuration.html#spark-nomad-shuffle-networkmbits) +properties. + +## Log rotation + +Nomad performs log rotation on the `stdout` and `stderr` of its tasks. You can +configure the number number and size of log files it will keep for driver and +executor task groups using +[spark.nomad.driver.logMaxFiles](/guides/spark/configuration.html#spark-nomad-driver-logmaxfiles) +and [spark.nomad.executor.logMaxFiles](/guides/spark/configuration.html#spark-nomad-executor-logmaxfiles). + +## Next Steps + +Learn how to [dynamically allocate Spark executors](/guides/spark/dynamic.html). diff --git a/website/source/guides/spark/spark.html.md b/website/source/guides/spark/spark.html.md new file mode 100644 index 000000000..84f8cb661 --- /dev/null +++ b/website/source/guides/spark/spark.html.md @@ -0,0 +1,24 @@ +--- +layout: "guides" +page_title: "Running Apache Spark on Nomad" +sidebar_current: "guides-spark-spark" +description: |- + Learn how to run Apache Spark on a Nomad cluster. +--- + +# Running Apache Spark on Nomad + +Nomad is well-suited for analytical workloads, given its [performance +characteristics](https://www.hashicorp.com/c1m/) and first-class support for +[batch scheduling](https://www.nomadproject.io/docs/runtime/schedulers.html). +Apache Spark is a popular data processing engine/framework that has been +architected to use third-party schedulers. The Nomad ecosystem includes a +[fork of Apache Spark](https://github.com/hashicorp/nomad-spark) that natively +integrates Nomad as a cluster manager and scheduler for Spark. When running on +Nomad, the Spark executors that run Spark tasks for your application, and +optionally the application driver itself, run as Nomad tasks in a Nomad job. + +## Next Steps + +The links in the sidebar contain detailed information about specific aspects of +the integration, starting with the [prerequisites](/guides/spark/pre.html). diff --git a/website/source/guides/spark/submit.html.md b/website/source/guides/spark/submit.html.md new file mode 100644 index 000000000..4cb24d931 --- /dev/null +++ b/website/source/guides/spark/submit.html.md @@ -0,0 +1,81 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration - Submitting Applications" +sidebar_current: "guides-spark-submit" +description: |- + Learn how to submit Spark jobs that run on a Nomad cluster. +--- + +# Submitting Applications + +The [`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html) +script located in Spark’s `bin` directory is used to launch applications on a +cluster. Spark applications can be submitted to Nomad in either `client` mode +or `cluster` mode. + +## Client Mode + +In client mode (the default deployment mode), the Spark application is either +directly started by the user, or run directly by sparksubmit, so the application +driver runs on a machine that is not necessarily in the Nomad cluster. The +driver’s SparkContext creates a Nomad job to run Spark executors. The executors +connect to the driver and run Spark tasks on behalf of the application. When the +driver’s SparkContext is stopped, the executors are shut down. + +Note that the machine running the driver or `spark-submit` needs to be reachable +from the Nomad clients so that the executors can connect to it. + +In client mode, application resources need to start out present on the +submitting machine, so jars (both the primary jar and those added with the +--jars option) can’t be specified using `http:` or `https:` URLs. You can either +use files on the submitting machine (either as raw paths or `file:` URLs), or use +`local:` URLs to indicate that the files are independently available on both the +submitting machine and all of the Nomad clients where the executors might run. + +In this mode, the `spark-submit` invocation doesn’t return until the application +has finished running, and killing the spark-submit process kills the application. + +For example, to submit an application in client mode: + +```shell +$ spark-submit --class org.apache.spark.examples.SparkPi \ + --master nomad \ + --conf spark.nomad.sparkDistribution=http://example.com/spark.tgz \ + lib/spark-examples*.jar \ + 10 +``` + +## Cluster Mode + +In cluster mode, the `spark-submit` process creates a Nomad job to run the Spark +application driver itself. The driver’s `SparkContext` then adds Spark executors to the +Nomad job. The executors connect to the driver and run Spark tasks on behalf of +the application. When the driver’s `SparkContext` is stopped, the executors are +shut down. + +In cluster mode, application resources need to be hosted somewhere accessible +to the Nomad cluster, so jars (both the primary jar and those added with the +`--jars` option) can’t be specified using raw paths or `file:` URLs. You can either +use `http:` or `https:` URLs, or use `local:` URLs to indicate that the files are +independently available on all of the Nomad clients where the driver and executors +might run. + +Note that in cluster mode, the Nomad master URL needs to be routable from both +the submitting machine and the Nomad client node that runs the driver. If the +Nomad cluster is integrated with Consul, you may want to use a DNS name for the +Nomad service served by Consul. + +For example, to submit an application in cluster mode: + +```shell +$ spark-submit --class org.apache.spark.examples.SparkPi \ + --master nomad \ + --deploy-mode cluster \ + --conf spark.nomad.sparkDistribution=http://example.com/spark.tgz \ + http://example.com/spark-examples.jar \ + 10 +``` + +## Next Steps + +Learn how to [customize applications](/guides/spark/customizing.html). diff --git a/website/source/guides/spark/template.html.md b/website/source/guides/spark/template.html.md new file mode 100644 index 000000000..1a89d6e0f --- /dev/null +++ b/website/source/guides/spark/template.html.md @@ -0,0 +1,22 @@ +--- +layout: "guides" +page_title: "Apache Spark Integration - Title" +sidebar_current: "guides-spark-name" +description: |- + Learn how to . +--- + +# Title + + + +## Section 1 + +## Section 2 + +## Section 3 + + +## Next Steps + +[Next step](/guides/spark/name.html) diff --git a/website/source/layouts/guides.erb b/website/source/layouts/guides.erb index 9435578ae..097a55785 100644 --- a/website/source/layouts/guides.erb +++ b/website/source/layouts/guides.erb @@ -22,6 +22,37 @@ > Outage Recovery + + > + Apache Spark Integration + + + <% end %>