Wednesday, August 26, 2015

BigInsights 4 - Setting up to use SparkR

There is a great post here http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/ about getting set up with Spark, SparkR and RStrudio. I'm not going to repeat the detail but just record a few details about my setup.

With BigInsights 4.1 just having been release (http://www-01.ibm.com/software/data/infosphere/hadoop/trials.html) we now get the IBM Open Platform including Spark 1.4.1 and hence including SparkR. I have BigInsights installed on a cluster but to get started on my Windows 7 machine I downloaded a binary version of Spark 1.4.1 separately.

I tried following the instructions in the linked blog but I got a warning that SparkR had been compile with R 3.1.3 while I was at an older version (3.0.3). So first of all I upgraded my R version to the latest (3.2.2)

My install of RStudio magically detected the new version of R installed and ran with that (presumably it picks the version number up from the registry)

The SparkR library loads now without any warnings. However the sparkR.init() step doesn't complete. There are some posts about this issue out there. I found that the problem was that the scripts as provided in the Spark dowload from Apache were not set as executable. Even on windows. Doing an "ls" in Cygwin gave.

4 -rwxr-xr-x  1 slaws None 3121 Jul  8 23:59 spark-shell
4 -rw-r--r--  1 slaws None 1008 Jul  8 23:59 spark-shell.cmd
4 -rw-r--r--  1 slaws None 1868 Jul  8 23:59 spark-shell2.cmd
4 -rwxr-xr-x  1 slaws None 1794 Jul  8 23:59 spark-sql
4 -rwxr-xr-x  1 slaws None 1291 Jul  8 23:59 spark-submit
4 -rw-r--r--  1 slaws None 1083 Aug 26 10:57 spark-submit.cmd
4 -rw-r--r--  1 slaws None 1374 Jul  8 23:59 spark-submit2.cmd

A quick "chmod 755 *" got me a step closer. Now when running the sparkR.init() I get not error response but it still fails to step. Further investigation showed that depending on how the system2() call is configured, that sparkR.init() uses under the covers, the the call out to spark-commit worked or didn't work. This seemed very strange so somewhat at random I switched to R version 3.1.3 which, from above, you will see is the version that the SparkR library is built against. Low and behold it worked. Yay!

Here is the script I'm playing with which is 99% copied from the blog linked at the top of this post.

# primarily copied from
# http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/

# add SparkR lib dir to library paths
Sys.setenv(SPARK_HOME = "C:/simonbu/big-data/Spark/runtime/spark-1.4.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
.libPaths()
#.Platform$OS.type
#getAnywhere(launchBackend)

# load SparkR
library(SparkR)

# initialise the SparkR context using Spark locally
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)

# create a spark data frame with a sample data set
sparkDF <- createDataFrame(sqlContext, faithful)
head(sparkDF)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))

# Convert local data frame to a SparkR DataFrame
convertedDF <- createDataFrame(sqlContext, localDF)

printSchema(sparkDF)
# printSchema(localDF)  - doesn't work
printSchema(convertedDF)

summary(convertedDF)

# Register this DataFrame as a table.
registerTempTable(convertedDF, "people")

# SQL statements can be run by using the sql methods provided by sqlContext
teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")

# Call collect to get a local data.frame
teenagersLocalDF <- collect(teenagers)

# Print the teenagers in our dataset
print(teenagersLocalDF)

sparkR.stop()


No comments: