Monday, July 6, 2015

BigInsights 4 - First Giraph Example

After running some of the example that come with Giraph, including SimpleShortestPathsComputation, I wanted to try something for myself with a different data set. I chose to use the Google 2002 programming contest data set you can find here. One interesting thing about this SNAP data is that it is an edge data set to it's a file containing pairs of vertex numbers representing graph edges, for example

1 2
2 3
1 3

This format is different from the format used in the Giraph examples which is typically vertex oriented, for example, the SimpleShortestPathsComputation (as described on the Quick Start page) uses the following data set:

[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]
 
Here the first number pair on each line is the vertex ID and a vertex value of the vertex that the line describes. The following number pairs on each line describe edges to other vertexes and associated edge values.

So the biggest challenge for me was working out how to read in an edge format rather than the vertex format used in the examples.

I decided the re-use the SimpleInDegreeCountComputation but re-purpose it to read the edge value data. This is what I ended up with.

/**
 * Simple function to return the out degree for each vertex.
 */
//@Algorithm(
//    name = "Indegree Count"
//)
public class GoogleInDegreeCount extends BasicComputation<
  IntWritable, IntWritable, NullWritable, IntWritable> {

  @Override
  public void compute(
      Vertex<IntWritable, IntWritable, NullWritable> vertex,
      Iterable<IntWritable> messages) throws IOException {
    if (getSuperstep() == 0) {
      Iterable<Edge<IntWritable, NullWritable>> edges = vertex.getEdges();
      for (Edge<IntWritable, NullWritable> edge : edges) {
        sendMessage(edge.getTargetVertexId(), new IntWritable(1));
      }
    } else {
      int sum = 0;
      for (IntWritable message : messages) {
        sum++;
      }
      IntWritable vertexValue = vertex.getValue();
      vertexValue.set(sum);
      vertex.setValue(vertexValue);
      vertex.voteToHalt();
    }
  }
}
 
 The first thing to note is the change of the type parameters for the base class

BasicComputation<IntWritable, IntWritable, NullWritable, IntWritable>
 
If you look at the documentation for this class you see that the  type parameters are:

org.apache.giraph.graph.BasicComputation<I,V,E,M>

I - Vertex id
V - Vertex data
E - Edge data
M - Message type
 So for edge data (which is two integers per line remember) I'm saying that
  
         I  - the vertex ID is an integrer
         V - the vertex data is an integer as it's a degree count
         E - the edge data is null as we are not storing any edge data
         M - the message type is an integer as we are propagating degree counts (which are integers)

Also of interest is the signature of the compute method

public void compute(
      Vertex<IntWritable, IntWritable, NullWritable> vertex,
      Iterable<IntWritable> messages)
 
Again if we look up in the documentation what this is supposed to look like we see the following:

public abstract void compute(Vertex<I,V,E> vertex, Iterable<M1> messages)

The types follow the pattern set by the BasicCompute signature above.

The rest of the code remains the same but with suitable changes to take account of the changes in the type parameters.

To run it we have to change the way that we launch the Giraph application. Here's the script:

export GG=/home/slaws/graph/giraph-google
export GH=/home/slaws/graph/giraph-1.1.0-hadoop2-for-hadoop-2.5.1
export GL=${GH}/lib

# declare jars that are required by the driver program
export HADOOP_CLASSPATH=${GH}/giraph-core-1.1.0-hadoop2.jar:${GG}/target/giraph-google-1.0.jar:${GL}/*

# declare jars that are required by each computation node
export LIBJARS=${GH}/giraph-core-1.1.0-hadoop2.jar,${GG}/target/giraph-google-1.0.jar,${GL}/netty-3.6.2.Final.jar,${GL}/netty-all-4.0.14.Final.jar,${GL}/typetools-0.2.1.jar,${GL}/metrics-core-2.2.0.jar,${GL}/metrics-core-3.0.0.jar,${GL}/json-20090211.jar,${GL}/fastutil-6.5.4.jar,${GL}/base64-2.3.8.jar

# remove output file first
hadoop fs -rmdir /tmp/graph/giraph-google/outcounts

# submit a job via YARN
hadoop jar ${GG}/target/giraph-google-1.0.jar org.apache.giraph.GiraphRunner -libjars ${LIBJARS} com.ibm.ets.giraph.GoogleInDegreeCount -eif org.apache.giraph.io.formats.IntNullTextEdgeInputFormat -eip /tmp/graph/giraph-google/web-Google.txt -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat -op /tmp/graph/giraph-google/outcounts -w 1 -ca giraph.SplitMasterWorker=false


Note now that in the "hadoop jar" step at the end we use an edge input format instead of a vertex input format:

-eif org.apache.giraph.io.formats.IntNullTextEdgeInputFormat

The IntNull part here means that each end of an edge is represented by a single integer with no edge value.

The edge data is specific using an edge input path

 -eip /tmp/graph/giraph-google/web-Google.txt

We still create a vertex output format which lists each vertex alongside its degree count. 

1 comment:

Anonymous said...

Thank you so much Simon, I feel fortunate enough to land on this blog post, saved crucial hours, thanks very much!