Q&A
Q. What are the factors involved in selecting a specific deployment mode for Spark?
A. The choice of deployment mode for Spark is primarily dependent upon the environment you are running in and the availability of external scheduling frameworks such as YARN or Mesos. For instance, if you are using Spark with Hadoop and you have an existing YARN infrastructure, Spark on YARN is a logical deployment choice. However, if you are running Spark independent of Hadoop (for instance sourcing data from S3 or a local filesystem), Spark Standalone may be a better deployment method.
Q. What is the difference between the yarn-client and the yarn-cluster options of the --master argument using spark-submit?
A. Both the yarn-client and yarn-cluster options execute the program in the Hadoop cluster using YARN as the scheduler; however, the yarn-client option uses the client host as the driver for the program and is designed for testing as well as interactive shell usage.