Running Spark Jobs from Eclipse in YARN Mode

Some time ago, alike with MapReduce jobs, I needed to configure the submission of the Spark jobs from Eclipse. In this post I will describe in detail what I have done to be able to submit the Spark jobs through YARN from my IDE. There are three different possibilities when it comes to resource management when you are considering using Spark: Standalone, YARN and Mesos(out of the scope of this post). Standalone resource manager is an implementation of the resource manager within the scope of the named Apache project. Despite its simplicity in installation it lacks basic security mechanisms and dynamic resource sharing between different user groups. On the other hand YARN satisfies most of the user requirements and provides additional flexibility and support for other data processing framework to be configured on the same infrastructure. In my case these were the determining factors to carry on the tests with YARN resource manager.

Setup

  • Hadoop 2.6.0 (from CDH repo)
  • Java 1.8
  • Scientific Linux 6
  • Spark 1.6(assembled for Hadoop 2.6)
  • Client machine only has Eclipse Lune installed
    • Project Configuration

      Assuming that you have fully functional and job submission ready Hadoop (with YARN resource manager) cluster (check this post if you still need to setup the infrastructure), the first thing you need to know is that to run Spark jobs on the cluster, you do not need to any modifications to the infrastructure itself. YARN is clever enough to understand what kind of job it is dealing with, if you provide inside the job runnable the implementation of determined interfaces (YarnContainer for example). All the configuration must be done on the client side, to ensure that you provide enough information for resource manager to identify the job and figure out how to execute it. First of all, required libraries needs to be imported into the project classpath. Unfortunately I was not able to pinpoint exact dependencies required for the execution of the job and was forced to use pre-assembled jars provided by Spark project (check Spark downloads page and select the package type corresponding to pre-built current version of Spark for Hadoop 2.6). In gradle my dependencies file looks like:

      dependencies.build

          compile files("libs/spark-1.6.0-yarn-shuffle.jar")
          compile files("libs/spark-assembly-1.6.0-hadoop2.6.0.jar")

      I you do not want to bother with build systems, you should add the dependency directly in Eclipse.

      Now you need to pass Hadoop configuration along the Spark job, first all of for Spark to know service endpoints, secondly to setup the classpath (since YARN does not provide default classpath for remote jobs).

          System.setProperty("SPARK_YARN_MODE", "true");
       
          SparkConf sparkConfiguration = new SparkConf();
          sparkConfiguration.setMaster("yarn-client");
          sparkConfiguration.setAppName("test-spark-job");
          sparkConfiguration.setJars(new String[] { "<path the application jar file>" });
       
          sparkConfiguration.set("spark.hadoop.fs.defaultFS", "hdfs://<your host>");
          sparkConfiguration.set("spark.hadoop.dfs.nameservices", "<your host>:8020");
          sparkConfiguration.set("spark.hadoop.yarn.resourcemanager.hostname", "<your host>");
          sparkConfiguration.set("spark.hadoop.yarn.resourcemanager.address", "<your host>:8050");
       
          sparkConfiguration.set("spark.hadoop.yarn.application.classpath",
                      "$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,"
                              + "$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,"
                              + "$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,"
                              + "$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*");
       
          SparkContext sparkContext = new SparkContext(sparkConfiguration);
          JavaSparkContext javaSparkContext = new JavaSparkContext(sparkContext);
      Few considerations about the above code. First of all, you should export your project as Runnable Jar in Eclipse (Right mouse click on project name -> Export... -> Java -> Runnable JAR file -> Select the main class and the directory -> Finish). Copy the path and the jar name and replace with respective JAR location and JAR name. Secondly you should have your HDFS+YARN environment up and running before the job execution (check "Running Hadoop Jobs from Eclipse: Server Side" for more details). The above code is enough to test if your configuration is proper and allows you to submit the jobs, so just throw it into the main method, export it as Runnable Jar and run it. You will see immediately if there are problems with your installation.