Tuesday, July 14, 2015

BigInsights 4 - Creating An Eclipse and Spark Development Environment

I've just started developing my big data applications with Spark and the thing that I wanted most was a development environment for Eclipse (I'm not an IntelliJ user) that allowed me to build and run Spark/Scala applications locally. Fortunately all of the components have been created to make this happen and being a long time Java/Maven/Eclipse user getting them working together was relatively straightforward. The approach is already described in a number of articles  (Google for Spark and Eclipse) but I wanted to make it work with the Jar versions that ship with BigInsights v4.0.0.1. Here's what I did.

Dependencies

The major dependencies I'm working with are as follows:

Windows 7


IBM JDK 7 (SR3 build 2.7)
Scala 2.10.4
Spark 1.2.1
Eclipse Juno SR 2
Maven 3.0.5

Summary Of The Steps Involved



2 - Install the Scala Eclipse plugin from update site (http://download.scala-ide.org/sdk/helium/e38/scala210/stable/site) update sites listed here (http://scala-ide.org/download/prev-stable.html)
4 - Set up Eclipse so that it can import the maven project (http://scala-ide.org/docs/tutorials/m2eclipse/)
4a - Install the winutils.exe so that Hadoop will work on Windows 7 (for windows users only)
5 - Import the project into Eclipse as a maven project 

I describe steps 3 and 5 in more detail below. 

Creating A Maven Project For Spark Development (Step 3 Detail)

In some directory create a new project directory with the following structure:


SomeDirectory/
    MySparkProject/
       src/
           main/
                scala/

       pom.xml  


There are probably Maven archetypes for creating these directories but I didn't check as for my purposes I only needed this very simple structure. The pom.xml file is obviously the interesting thing here. I started with the following contents:




<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
   <modelVersion>4.0.0</modelVersion>

   <groupId>com.ibm.ets</groupId>
   <artifactId>MyProject</artifactId>
   <version>0.1-SNAPSHOT</version>

   <name>MyProject</name>


    <repositories>
        <repository>
            <id>scala-tools.org</id>
            <name>Scala-Tools Maven2 Repository</name>
            <url>http://scala-tools.org/repo-releases</url>
        </repository>
    </repositories>
    <pluginRepositories>
      <pluginRepository>
        <id>scala-tools.org</id>
        <name>Scala-Tools Maven2 Repository</name>
        <url>http://scala-tools.org/repo-releases</url>
      </pluginRepository>
    </pluginRepositories>

   <dependencies>
     <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
        <version>2.10.4</version>
      </dependency>

      <dependency>
         <groupId>org.apache.spark</groupId>
         <artifactId>spark-core_2.10</artifactId>
         <version>1.2.1</version>
      </dependency>
     
      <dependency>
         <groupId>org.apache.spark</groupId>
         <artifactId>spark-sql_2.10</artifactId>
         <version>1.2.1</version>
      </dependency>
     
      <dependency>
         <groupId>org.apache.spark</groupId>
         <artifactId>spark-hive_2.10</artifactId>
         <version>1.2.1</version>
      </dependency>     
     
      <dependency>
         <groupId>org.apache.spark</groupId>
         <artifactId>spark-streaming_2.10</artifactId>
         <version>1.2.1</version>
      </dependency>
     
      <dependency>
         <groupId>org.apache.spark</groupId>
         <artifactId>spark-mllib_2.10</artifactId>
         <version>1.2.1</version>
      </dependency>
     
      <dependency>
         <groupId>org.apache.spark</groupId>
         <artifactId>spark-graphx_2.10</artifactId>
         <version>1.2.1</version>
      </dependency>
     
      <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>2.6.0</version>
      </dependency>

   </dependencies>

<build>
    <pluginManagement>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <version>2.15.2</version>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                        <source>1.7</source>
                        <target>1.7</target>
                      </configuration>
            </plugin>
        </plugins>
    </pluginManagement>
    <plugins>
        <plugin>
            <groupId>org.scala-tools</groupId>
            <artifactId>maven-scala-plugin</artifactId>
            <executions>
                <execution>
                    <id>scala-compile-first</id>
                    <phase>process-resources</phase>
                    <goals>
                        <goal>add-source</goal>
                        <goal>compile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

</project>



I've selected Scala, Spark and Hadoop versions that match with those shipped in BigInsights 4.


You can now build Scala/Spark (or Java/Spark) programs here, using you favorite text editor, e.g. notepad or vi, and compile them using Maven to produce a jar for use with spark-submit. However I wanted to do my development in Eclipse. 

Installing winutils So This All Works on Windows (Step 4a Detail)

This is of course for windows users only.  

I initially got the following error when running a Spark program from Eclipse
 
Could not locate executable null\bin\winutils.exe
 
There are lots of posts about this and the solution is easy. Here is an example http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path. 

I copied the winutils.exe to a directory and then set HADOOP_HOME variable as follows

HADOOP_HOME=C:\simon\svn\bigdata2\Spark\winutil 

Then remember to restart Eclipse! 
 

Importing The Project Into Eclipse (Step 5 Detail)

You will have installed the "maven 2 eclipse" plugin at step 4 (see the summary of steps above) so you can now import your Spark maven project using the following option:

 


To run Spark programs from Eclipse locally you need to configure Spark to run locally (rather than on a cluster). As a convenience I pass a parameter into my code to indicate whether I'm running in Eclipse. I modified the sample Spark code as follows to process the parameter. 

import org.apache.spark.SparkContext;
import org.apache.spark.SparkContext._;
import org.apache.spark.SparkConf;
import org.apache.spark.rdd.RDD;


object SparkTutorial {
  def main(args: Array[String]) {
    var conf:SparkConf = null;
    var taxi:RDD[String] = null;
   
    if (args.length > 0 && args(0) == "local") {
      conf = new SparkConf().setAppName("Spark Tutorial").setMaster("local[*]");
    } else {
      conf = new SparkConf().setAppName("Spark Tutorial").setMaster("yarn-cluster");
    }

    val sc = new SparkContext(conf);
   
    if (args.length > 0 && args(0) == "local") {
       taxi = sc.textFile("nyctaxisub.csv", 2);
    } else {
       taxi = sc.textFile("/user/spark/simonstuff/sparkdata/nyctaxisub/*");
    }
  
    val taxiSplitColumns = taxi.map(line=>line.split(','));
    val taxiMedCountsTuples = taxiSplitColumns.map(vals=>(vals(6),1));
    val taxiMedCountsOneLine = taxiMedCountsTuples.reduceByKey((x,y) => x + y);
   
   
    for (pair<-taxiMedCountsOneLine.map(_.swap).top(10)){
      println("Taxi Medallion %s had %s Trips".format(pair._2,pair._1));
    }
  }
}

Note that I check twice whether a single parameter value of "local" is passed into the program. If so the command  setMaster("local[*]") instructs Spark to run locally,  i.e. doesn't require a separate cluster installation. It also reads the input data from a local directory path rather than from an HDFS directory path. 

I pass the parameter into the program via an Eclipse run configuration in the following way:


Running On a Cluster


Having developed your Spark application in Eclipse you probably want to run it on a cluster. I export the project as a Java jar file from Eclipse. You can also generate the Jar file by going back to your Maven project and building it there, for example, by running mvn package.  

Once you have the Jar you can copy it to the head node of your cluster and run it using Spark, for example,
Spark-submit –class “SparkTutorial” –master yarn-cluster /SomeDirectory/MyProject/target/MyProject.jar


  

2 comments:

Unknown said...

Had problems when installing the maven2eclipse plugin v1.5.2; eclipse was complaining about missing jar files:
com.google.guava_12.0.0.v201212092141
com.google.guava_15.0.0.v201403281430

The solution to this is to make sure you have the Luna repository in "Available Software Sites".

You can add it in : Help -> Install New Software... Then in the "Work with" input, you type http://download.eclipse.org/releases/luna/ and press enter. You do not need to install the software, just searching for it will automatically make the Luna workspace available.

After that, you should be able to install m2e with http://download.eclipse.org/technology/m2e/releases/ .

Solution found on stackoverflow:
https://stackoverflow.com/questions/24479109/maven-for-eclipse-1-5-0-plugin-cannot-be-installed-under-kepler/24490176#24490176?newreg=4c262412f8734fa49b7e565d16470281

simon said...

Mars (4.5)
Have just repeated this on Eclipse Mars (4.5). Here are the dependencies:

IBM JDK 8

Scala Eclipse Plugin
http://scala-ide.org/download/current.html
http://download.scala-ide.org/sdk/lithium/e44/scala211/stable/site

M2E Scala Plugin
http://alchim31.free.fr/m2e-scala/update-site/

Had to manually set the Scala compiler for the project to be 2.10.X as that's what I happen to be using (see project preferences/scala compiler)

Also had to set the target runtime to the Java 7 as no class files were generated which it was set to Java 8 see project preferences/scala compiler). Not sure if this is related to the select Scala version.