Using Of Commands Of Spark

序言

简单介绍下Spark的命令,在理解Spark命令的同时,也会顺带理解Spark的服务和功能cuiyaonan2000@163.com

spark-submit

这个就是往Spark集群中提交任务的命令和入口,且同时支持Spark On Yarn 和 Spark Standalone两种模式.

在我们搭建好服务器后,官网提供了一些用例供我们测试理解使用.如下所示:

./spark-submit  --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster --driver-memory 4g --executor-memory 2g --e
xecutor-cores 1 --queue default ../examples/jars/spark-examples*.jar 10

命令格式

./spark-submit -h  查看该命令的所有说明信息和配置信息

官方说明

如下罗列了所有的参数配置,同时说明了一些模式下的专有配置说明.

Usage: spark-submit [options]  [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]
Usage: spark-submit run-example [options] example-class [example args]

Options:
  --master MASTER_URL         spark://host:port, mesos://host:port, yarn,
                              k8s://https://host:port, or local (Default: local[*]).
  --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally ("client") or
                              on one of the worker machines inside the cluster ("cluster")
                              (Default: client).
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --name NAME                 A name of your application.
  --jars JARS                 Comma-separated list of jars to include on the driver
                              and executor classpaths.
  --packages                  Comma-separated list of maven coordinates of jars to include
                              on the driver and executor classpaths. Will search the local
                              maven repo, then maven central and any additional remote
                              repositories given by --repositories. The format for the
                              coordinates should be groupId:artifactId:version.
  --exclude-packages          Comma-separated list of groupId:artifactId, to exclude while
                              resolving the dependencies provided in --packages to avoid
                              dependency conflicts.
  --repositories              Comma-separated list of additional remote repositories to
                              search for the maven coordinates given with --packages.
  --py-files PY_FILES         Comma-separated list of .zip, .egg, or .py files to place
                              on the PYTHONPATH for Python apps.
  --files FILES               Comma-separated list of files to be placed in the working
                              directory of each executor. File paths of these files
                              in executors can be accessed via SparkFiles.get(fileName).
  --archives ARCHIVES         Comma-separated list of archives to be extracted into the
                              working directory of each executor.

  --conf, -c PROP=VALUE       Arbitrary Spark configuration property.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.

  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.

  --executor-memory MEM       Memory per executor (e.g. 1000M, 2G) (Default: 1G).

  --proxy-user NAME           User to impersonate when submitting the application.
                              This argument does not work with --principal / --keytab.

  --help, -h                  Show this help message and exit.
  --verbose, -v               Print additional debug output.
  --version,                  Print the version of current Spark.

 Cluster deploy mode only:
  --driver-cores NUM          Number of cores used by the driver, only in cluster mode
                              (Default: 1).

 Spark standalone or Mesos with cluster deploy mode only:
  --supervise                 If given, restarts the driver on failure.

 Spark standalone, Mesos or K8s with cluster deploy mode only:
  --kill SUBMISSION_ID        If given, kills the driver specified.
  --status SUBMISSION_ID      If given, requests the status of the driver specified.

 Spark standalone, Mesos and Kubernetes only:
  --total-executor-cores NUM  Total cores for all executors.

 Spark standalone, YARN and Kubernetes only:
  --executor-cores NUM        Number of cores used by each executor. (Default: 1 in
                              YARN and K8S modes, or all available cores on the worker
                              in standalone mode).

 Spark on YARN and Kubernetes only:
  --num-executors NUM         Number of executors to launch (Default: 2).
                              If dynamic allocation is enabled, the initial number of
                              executors will be at least NUM.
  --principal PRINCIPAL       Principal to be used to login to KDC.
  --keytab KEYTAB             The full path to the file that contains the keytab for the
                              principal specified above.

 Spark on YARN only:
  --queue QUEUE_NAME          The YARN queue to submit to (Default: "default").

参数说明参数名称参数值说明举例--class

应用程序的入口点

比如我们打包的SpringBoot工程的启动类

--class org.apache.spark.examples.SparkPi--master

参数值为(可选值如下所示):

spark://host:port
yarn
local

该值标识将当前任务推动给那种集群Spark集群.对应的集群类型一次为:

Standalone 模式
Spark On Yarn模式(根据HADOOP_CONF_DIR的设置连接Yarn)
本地模式

--deploy-mode

cluster
client(默认)

cluster:将driver部署到worker节点,或者nodemanager节点
client:driver作为外部客户端部署到本地,随时与worker或者nodemanager节点进行通信

--name 给该应用指定一个名字给该应用指定一个名字 --num-executors该参数用于设置Spark作业总共要用多少个Executor进程来执行。Driver在向YARN集群管理器申请资源时，YARN集群管理器会尽可能按照你的设置来在集群的各个工作节点上，启动相应数量的Executor进程。

--num-executors 100

在 yarn 下使用

--executor-memory该参数用于设置每个Executor进程的内存。Executor内存的大小，很多时候直接决定了Spark作业的性能，而且跟常见的JVM OOM异常，也有直接的关联。建议每个Executor进程的内存设置4G~8G较为合适。看看资源队列的最大内存限制是多少，num-executors乘以executor-memory，就代表了你的Spark作业申请到的总内存量–executor-memory 4g--executor-cores该参数用于设置每个Executor进程的CPU core数量。这个参数决定了每个Executor进程并行执行task线程的能力。因为每个CPU core同一时间只能执行一个task线程，因此每个Executor进程的CPU core数量越多，越能够快速地执行完分配给自己的所有task线程。Executor的CPU core数量设置为2~4个较为合适。

–executor-cores 4

在yarn或者standalone下使用

--driver-memoryDriver 程序运行时需要的内存，默认为512M。–driver-memory 1G--conf spark.default.parallelism

通过--conf 来设置一些配置信息.可以是

任意的Spark配置属性，格式key=value，如果值包含空格，可以加引号“key=value”。

spark.default.parallelism:

该参数用于设置每个stage的默认task数量。这个参数极为重要，如果不设置可能会直接影响你的Spark作业性能。建议：Spark作业的默认task数量多一点。不设置这个参数是个错误，默认情况下，Spark根据底层HDFS的block数量来设置task的数量，默认是一个HDFS block对应一个task。通常来说，Spark默认设置的数量是偏少的（比如就几十个task），如果task数量偏少的话，就会导致前面设置的Executor参数作用不大。无论Executor进程有多少个，内存和CPU有多大，但是task只有1个或者10个，那么90%的Executor进程可能根本就没有task执行，会白白浪费了资源！Spark官网建议的设置原则是，设置该参数为num-executors * executor-cores的2~3倍较为合适，比如Executor的总CPU core数量为300个，那么设置1000个task是可以的，此时可以充分地利用Spark集群的资源。

--conf spark.default.parallelism=100--conf spark.storage.memoryFraction

通过--conf 来设置一些配置信息.可以是

任意的Spark配置属性，格式key=value，如果值包含空格，可以加引号“key=value”。

该参数用于设置RDD持久化数据在Executor内存中能占的比例，默认是0.6。也就是说，默认Executor 60%的内存，可以用来保存持久化的RDD数据。根据你选择的不同的持久化策略，如果内存不够时，可能数据就不会持久化，或者数据会写入磁盘。建议：如果Spark作业中，有较多的RDD持久化操作，该参数的值可以适当提高一些，保证持久化的数据能够容纳在内存中。避免内存不够缓存所有的数据，导致数据只能写入磁盘中，降低了性能。但是如果Spark作业中的shuffle类操作比较多，而持久化操作比较少，那么这个参数的值适当降低一些比较合适。此外，如果发现作业由于频繁的gc导致运行缓慢（通过spark web ui可以观察到作业的gc耗时），意味着task执行用户代码的内存不够用，那么同样建议调低这个参数的值。

这个参数根据计算的类型,来设置内存的数据.--conf spark.shuffle.memoryFraction

通过--conf 来设置一些配置信息.可以是

任意的Spark配置属性，格式key=value，如果值包含空格，可以加引号“key=value”。

该参数用于设置shuffle过程中一个task拉取到上个stage的task的输出后，进行聚合操作时能够使用的Executor内存的比例，默认是0.2。也就是说，Executor默认只有20%的内存用来进行该操作。shuffle操作在进行聚合时，如果发现使用的内存超出了这个20%的限制，那么多余的数据就会溢写到磁盘文件中去，此时就会极大地降低性能。建议：如果Spark作业中的RDD持久化操作较少，shuffle操作较多时，建议降低持久化操作的内存占比，提高shuffle操作的内存占比比例，避免shuffle过程中数据过多时内存不够用，必须溢写到磁盘上，降低了性能。此外，如果发现作业由于频繁的gc导致运行缓慢，意味着task执行用户代码的内存不够用，那么同样建议调低这个参数的值。

这个参数的意义跟上面差不多,都是针对某种类型计算的内存调优--queueQUEUE_NAME #提交应用程序给哪个YARN的队列，默认是default队列，仅限于Spark on Yarn模式 --queue default--jars

指定应用程序,即我们submit的jar中所依赖jar包的路径,可以是hdfs://myjars 或者是file://myjars.

注意如果是file则要保证所有的节点服务器上都需要能访问到该路径cuiyaonan2000@163.com

spark-sql

/bin/spark-sql区别与/bin/spark-submit .

spark-submit是用于提交任务.spark-sql是创建一个标准的JDBC连接.用于直接执行sql操作.相当于是一个客户端.不能用于任务的提交.这里说的任务提交就是将Jar包上传到Spark服务器中cuiyaonan2000@163.com

Using Of Commands Of Spark

[ 申请 ]友情链接：