Firstly I downloaded the Spark binaries from Apache that have the right versions of Hadoop etc. I used the following link
"1.6.1" as the Spark release
"Pre-built for Hadoop 2.6 or later" as the package type
This gave me spark-1.6.1-bin-hadoop2.6.tgz. I checked the signatures and unpacked it in my home directory on the cluster. My home directory happens to be mounted on all of the nodes in the cluster.
I then wrote a script call spark-submit.sh to use in place of the spark-submit command that ships with Spark 1.4.1 from BigInsights 4.1. It has the following contents:
# Options read in YARN client mode
# - HADOOP_CONF_DIR, to point Spark towards Hadoop configuration files
# - SPARK_EXECUTOR_INSTANCES, Number of executors to start (Default: 2)
$SPARK_HOME/bin/spark-submit --master yarn-client "$@"
Note that here we are just setting the environment to point to the 1.6.1 install and, for testing purposes, just using one executor. It then calls through to the 1.6.1 spark-submit program which will in turn use the BigInsights Yarn install to fire up containers for the Spark job you are submitting.
Note that the spark distribution directory contains a conf directory. In my case
Two files need to be created there. When I first tried to run some jobs I was getting the error
Error: Could not find or load main class org.apache.spark.deploy.yarn.ApplicationMaster
To solve this I did the following
cp spark-defaults.conf.template spark-defaults.conf
After this my example program ran successfully.
I can now choose which version of Spark I want to run with and I have changed the Ambari based configuration of BigInsights. This is not ideal if you want a consistent cluster but OK for me running a few quick tests.