Wednesday, August 26, 2015

BigInsights 4 - Setting up to use SparkR

There is a great post here http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/ about getting set up with Spark, SparkR and RStrudio. I'm not going to repeat the detail but just record a few details about my setup.

With BigInsights 4.1 just having been release (http://www-01.ibm.com/software/data/infosphere/hadoop/trials.html) we now get the IBM Open Platform including Spark 1.4.1 and hence including SparkR. I have BigInsights installed on a cluster but to get started on my Windows 7 machine I downloaded a binary version of Spark 1.4.1 separately.

I tried following the instructions in the linked blog but I got a warning that SparkR had been compile with R 3.1.3 while I was at an older version (3.0.3). So first of all I upgraded my R version to the latest (3.2.2)

My install of RStudio magically detected the new version of R installed and ran with that (presumably it picks the version number up from the registry)

The SparkR library loads now without any warnings. However the sparkR.init() step doesn't complete. There are some posts about this issue out there. I found that the problem was that the scripts as provided in the Spark dowload from Apache were not set as executable. Even on windows. Doing an "ls" in Cygwin gave.

4 -rwxr-xr-x  1 slaws None 3121 Jul  8 23:59 spark-shell
4 -rw-r--r--  1 slaws None 1008 Jul  8 23:59 spark-shell.cmd
4 -rw-r--r--  1 slaws None 1868 Jul  8 23:59 spark-shell2.cmd
4 -rwxr-xr-x  1 slaws None 1794 Jul  8 23:59 spark-sql
4 -rwxr-xr-x  1 slaws None 1291 Jul  8 23:59 spark-submit
4 -rw-r--r--  1 slaws None 1083 Aug 26 10:57 spark-submit.cmd
4 -rw-r--r--  1 slaws None 1374 Jul  8 23:59 spark-submit2.cmd

A quick "chmod 755 *" got me a step closer. Now when running the sparkR.init() I get not error response but it still fails to step. Further investigation showed that depending on how the system2() call is configured, that sparkR.init() uses under the covers, the the call out to spark-commit worked or didn't work. This seemed very strange so somewhat at random I switched to R version 3.1.3 which, from above, you will see is the version that the SparkR library is built against. Low and behold it worked. Yay!

Here is the script I'm playing with which is 99% copied from the blog linked at the top of this post.

# primarily copied from
# http://www.r-bloggers.com/installing-and-starting-sparkr-locally-on-windows-os-and-rstudio/

# add SparkR lib dir to library paths
Sys.setenv(SPARK_HOME = "C:/simonbu/big-data/Spark/runtime/spark-1.4.1-bin-hadoop2.6")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"),"R","lib"), .libPaths()))
.libPaths()
#.Platform$OS.type
#getAnywhere(launchBackend)

# load SparkR
library(SparkR)

# initialise the SparkR context using Spark locally
sc <- sparkR.init(master = "local")
sqlContext <- sparkRSQL.init(sc)

# create a spark data frame with a sample data set
sparkDF <- createDataFrame(sqlContext, faithful)
head(sparkDF)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))

# Convert local data frame to a SparkR DataFrame
convertedDF <- createDataFrame(sqlContext, localDF)

printSchema(sparkDF)
# printSchema(localDF)  - doesn't work
printSchema(convertedDF)

summary(convertedDF)

# Register this DataFrame as a table.
registerTempTable(convertedDF, "people")

# SQL statements can be run by using the sql methods provided by sqlContext
teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")

# Call collect to get a local data.frame
teenagersLocalDF <- collect(teenagers)

# Print the teenagers in our dataset
print(teenagersLocalDF)

sparkR.stop()


Tuesday, August 25, 2015

Running Titan with Solr - Part 1

I wanted to understand the integration between Titan and Solr. Titan can use Solr (or Elastic Search) to index vertex and edge properties but it's not clear from reading the internet how this works and whether you have any sensible access to the created index outside of the Titan API.

Installing Titan 

I started with the recent Titan 0.9.0 M2 release from here http://s3.thinkaurelius.com/downloads/titan/titan-0.9.0-M2-hadoop1.zip

I unzipped that and tried to fire up gremlin with

bin/gremlin.bat

It failed complaining that the classpath was too long so I made a few changes to the script. I commented out the part that loops round collecting all the individual jar paths.

::set CP=
::for %%i in (%LIBDIR%\*.jar) do call :concatsep %%i

 Further down I directly set the classpath to be just the jars from the lib directory using a wildcard. I.e. I was going for the shortest classpath possible.

::set CLASSPATH=%CP%;%OLD_CLASSPATH%
set CLASSPATH=./lib/*

This seemed to do the trick but then I fell over Java version problems. I upgraded to Java 8. The Titan jars wouldn't run with the IBM JDK for some reason so I ended up with the Oracle JDK and set my environment accordingly.

set path=C:\simon\apps\jdk-8-51-sun\bin;c:\Windows\system32
set JAVA_HOME=C:\simon\apps\jdk-8-51-sun

 Installing Solr

With the gremlin shell now running I set about getting solr going. I downloaded 5.2.1 from here http://archive.apache.org/dist/lucene/solr/

After unzipping I had a poke about. As a first time user it's hard to work out how to get it going. I wasn't sure whether I needed a stand alone setup or a cloud setup for me simple testing. The thing that was relly confusing me was the question of what schema was required.

I tried the cloud example

bin\solr start -e cloud

and answered the questions. This bought up Solr at http://localhost:8983/solr. But then I wanted to see if I could add some data so I tried the post example.

bin/post -c gettingstarted docs/

But that didn't work as it complained that it didn't understand the fields that were being pushed in. I tried creating a "core" in the admin UI but couldn't work out how to make that hang together. Eventually I found.

bin\solr start -e schemaless

And life was good! I was able to run the post example and see the data in the index.

Having got Solr going I set about creating a core to store the Titan index. I went with.

name: titan
instance: /tmp/solr-titan/titan
data: /tmp/solr-titan/data

I Copied the contents of titan/conf/solr to /tmp/solr-titan/titan-core/conf

I had to comment out some stuff to do with geo in the schema.xml due to a class not found problem
then I successfully created the titan core.

I stated out with a different name for the core but had to come back and rename it to "titan". See below. 

Connecting Titan to Solr

Having got Titan and Solr working I now needed to start Titan with a suitable connection to Solr. From gremlin.sh I did.

graph = TitanFactory.open('conf/titan-berkeleyje.properties')

This command creates a Titan database runtime within the grmplin process based on the configuration in the properties file. If you look inside the file is just tells Titan where to find Solr. Here is a subset of the file.

storage.directory=../db/berkeley
index.search.backend=solr
index.search.solr.mode=http
index.search.solr.http-urls=http://localhost:8983/solr















I just used the settings as supplied. There are a couple od gotchas here though.

Firstly the Solr core name isn't defined and Titan assumes it will be called "titan". I called it something else first and had to go back and rename it.

Secondly, when you run this TitanFactory.open command it creates the database based on the storage.directory property and caches all of these properties in the database. So when you restart titan it's ready to go. The downside of this is that I tried a couple of configurations before settling on this one. The first one I tried involved Elastic Search. I was subsequently confused that when I tried to run against Solr I was getting the following error.

15/08/17 14:45:36 INFO util.ReflectiveConfigOptionLoader: Loaded and initialized
 config classes: 12 OK out of 12 attempts in PT0.196S
15/08/17 14:45:36 WARN configuration.GraphDatabaseConfiguration: Local setting i
ndex.search.solr.mode=http (Type: GLOBAL_OFFLINE) is overridden by globally mana
ged value (cloud).  Use the ManagementSystem interface instead of the local conf
iguration to control this setting.
15/08/17 14:45:36 WARN configuration.GraphDatabaseConfiguration: Local setting i
ndex.search.backend=solr (Type: GLOBAL_OFFLINE) is overridden by globally manage
d value (elasticsearch).  Use the ManagementSystem interface instead of the loca
l configuration to control this setting.
15/08/17 14:45:36 INFO configuration.GraphDatabaseConfiguration: Generated uniqu
e-instance-id=0914d88f7404-R9E67YR1
15/08/17 14:45:36 INFO diskstorage.Backend: Configuring index [search]
15/08/17 14:45:37 INFO elasticsearch.plugins: [Blink] loaded [], sites []
15/08/17 14:45:39 INFO es.ElasticSearchIndex: Configured remote host: 127.0.0.1
: 9300
Could not instantiate implementation: com.thinkaurelius.titan.diskstorage.es.Ela
sticSearchIndex
Display stack trace? [yN]

The answer is to simply delete the data, in my case directory db/berkeley and start again.

This got Titan and Solr up and running and I was ready to create a graph and look at what index was generated. I'll create a separate post about that.