Wednesday, June 1, 2016

BigInsights 4 - Running later versions of Spark

With BigInsights you get what you get in terms of the versions of components that are shipped in the box. I'm on BigInisighs 4.1 currently which ships with Spark 1.4.1 and I needed to quickly test something on the cluster with Spark 1.6.1. What I did is not a supported configuration but it got me going quickly and relatively simply. YMMV

Firstly I downloaded the Spark binaries from Apache that have the right versions of Hadoop etc. I used the following link

http://spark.apache.org/downloads.html

I selected

"1.6.1" as the Spark release
"Pre-built for Hadoop 2.6 or later" as the package type

This gave me spark-1.6.1-bin-hadoop2.6.tgz. I checked the signatures and unpacked it in my home directory on the cluster. My home directory happens to be mounted on all of the nodes in the cluster.

I then wrote a script call spark-submit.sh to use in place of the spark-submit command that ships with Spark 1.4.1 from BigInsights 4.1. It has the following contents:

export SPARK_HOME=/home/me/spark/spark-1.6.1-bin-hadoop2.6
export PATH=$SPARK_HOME/bin:$PATH

# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
export HADOOP_CONF_DIR=/etc/hadoop/conf
# - SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
export SPARK_EXECUTOR_INSTANCES=1

$SPARK_HOME/bin/spark-submit --master yarn-client "$@"


Note that here we are just setting the environment to point to the 1.6.1 install and, for testing purposes, just using one executor. It then calls through to the 1.6.1 spark-submit program which will in turn use the BigInsights Yarn install to fire up containers for the Spark job you are submitting.

Note that the spark distribution directory contains a conf directory. In my case

/home/me/spark/spark-1.6.1-bin-hadoop2.6/conf

Two files need to be created there. When I first tried to run some jobs I was getting the error



Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
 

To solve this I did the following

cd /home/me/spark/spark-1.6.1-bin-hadoop2.6/conf


cp spark-defaults.conf.template spark-defaults.conf

vi spark-defaults.conf

add

spark.executor.extraJavaOptions  -Diop.version=4.1.0.0

spark.driver.extraJavaOptions -Diop.version=4.1.0.0

spark.yarn.am.extraJavaOptions -Diop.version=4.1.0.0

Then

touch java-opts

add

-Diop.version=4.1.0.0


After this my example program ran successfully.

I can now choose which version of Spark I want to run with and I have changed the Ambari based configuration of BigInsights. This is not ideal if you want a consistent cluster but OK for me running a few quick tests.