Pregunta

Soy relativamente nuevo en Giraph y estoy intentando que mi bucle de edición, compilación e implementación de Giraph funcione para nuestro código.Puedo ejecutar varios ejemplos inspirados en http://blog.cloudera.com/blog/2014/02/how-to-write-and-run-giraph-jobs-on-hadoop/ , pero estoy atascado con una ClassNotFoundException cuando ejecuto mi versión modificada del ejemplo SimpleShortestPathsVertex Giraph.Probé varias combinaciones de -libjars y HADOOP_CLASSPATH, pero no tengo ideas y realmente agradecería su ayuda.Los detalles siguen.

Versiones

  • Hadoop:Hadoop 2.0.0-cdh4.4.0
  • Girafa:ejemplos-giraph-1.0.0-para-hadoop-2.0.0-alpha-jar-con-dependencias.jar

El PageRankBenchmark funciona correctamente

$ hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \
org.apache.giraph.benchmark.PageRankBenchmark \
-Dgiraph.zkList=<myhost>:2181 \
-e 1 -s 3 -v -V 50 -w 1

...
14/08/01 11:42:44 INFO mapred.JobClient: Job complete: job_201407291058_0015
...
(full output is below)

GiraphRunner SimpleShortestPathsVertex también se ejecuta correctamente

$ hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=<myhost>:2181 \
org.apache.giraph.examples.SimpleShortestPathsVertex \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip ginput/tiny_graph.txt \
-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op goutput/shortestpathsC2 \
-ca SimpleShortestPathsVertex.source=2 \
-w 1

...
14/08/01 11:47:46 INFO mapred.JobClient: Job complete: job_201407291058_0017
...
(full output is below)

Prima:los resultados son correctos:

$ hadoop fs -cat goutput/shortestpathsC2/p*
0   1.0
2   2.0
1   0.0
3   1.0
4   5.0

Pero mi versión modificada de SimpleShortestPathsVertex obtiene ClassNotFoundException

El frasco que contiene el vértice modificado (KdlSimpleShortestPathsVertex, sin paquete) está bien:

$ jar -tf ~/kdl_hadoop_play.jar
META-INF/MANIFEST.MF
KdlSimpleShortestPathsVertex.class
META-INF/

Pero mi carrera vomita:

$ hadoop jar $GIRAPH_HOME/giraph-core/target/giraph-1.0.0-for-hadoop-2.0.0-alpha-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=<myhost>:2181 \
-libjars ~/kdl_hadoop_play.jar \
KdlSimpleShortestPathsVertex \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip /user/cornell/ginput/tiny_graph.txt \
-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/cornell/goutput/shortestpathsC2 \
-ca KdlSimpleShortestPathsVertex.source=2 \
-w 1

Exception in thread "main" java.lang.ClassNotFoundException: KdlSimpleShortestPathsVertex
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.giraph.utils.ConfigurationUtils.populateGiraphConfiguration(ConfigurationUtils.java:210)
at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:147)
at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Mi mejor suposición...

... después de mirar a su alrededor es que tal vez GiraphRunner no esté procesando los -libjars correctamente, como lo insinúa http://grepalex.com/2013/02/25/hadoop-libjars/ ("Asegúrese de que su código utilice GenericOptionsParser").Al navegar por la fuente de Giraph, no veo que se haya accedido a esa clase.Intenté configurar HADOOP_CLASSPATH en mi jar, pero eso no resolvió el problema.

¡Cualquier ayuda sería increíble!

Salida de PageRankBenchmark

14/08/01 11:42:27 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4)
14/08/01 11:42:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/01 11:42:28 WARN bsp.BspOutputFormat: checkOutputSpecs: ImmutableOutputCommiter will not check anything
14/08/01 11:42:29 INFO mapred.JobClient: Running job: job_201407291058_0015
14/08/01 11:42:30 INFO mapred.JobClient:  map 0% reduce 0%
14/08/01 11:42:40 INFO mapred.JobClient:  map 50% reduce 0%
14/08/01 11:42:41 INFO mapred.JobClient:  map 100% reduce 0%
14/08/01 11:42:44 INFO mapred.JobClient: Job complete: job_201407291058_0015
14/08/01 11:42:44 INFO mapred.JobClient: Counters: 39
14/08/01 11:42:44 INFO mapred.JobClient:   File System Counters
14/08/01 11:42:44 INFO mapred.JobClient:     FILE: Number of bytes read=0
14/08/01 11:42:44 INFO mapred.JobClient:     FILE: Number of bytes written=369846
14/08/01 11:42:44 INFO mapred.JobClient:     FILE: Number of read operations=0
14/08/01 11:42:44 INFO mapred.JobClient:     FILE: Number of large read operations=0
14/08/01 11:42:44 INFO mapred.JobClient:     FILE: Number of write operations=0
14/08/01 11:42:44 INFO mapred.JobClient:     HDFS: Number of bytes read=88
14/08/01 11:42:44 INFO mapred.JobClient:     HDFS: Number of bytes written=0
14/08/01 11:42:44 INFO mapred.JobClient:     HDFS: Number of read operations=2
14/08/01 11:42:44 INFO mapred.JobClient:     HDFS: Number of large read operations=0
14/08/01 11:42:44 INFO mapred.JobClient:     HDFS: Number of write operations=1
14/08/01 11:42:44 INFO mapred.JobClient:   Job Counters 
14/08/01 11:42:44 INFO mapred.JobClient:     Launched map tasks=2
14/08/01 11:42:44 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=15772
14/08/01 11:42:44 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
14/08/01 11:42:44 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/08/01 11:42:44 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/08/01 11:42:44 INFO mapred.JobClient:   Map-Reduce Framework
14/08/01 11:42:44 INFO mapred.JobClient:     Map input records=2
14/08/01 11:42:44 INFO mapred.JobClient:     Map output records=0
14/08/01 11:42:44 INFO mapred.JobClient:     Input split bytes=88
14/08/01 11:42:44 INFO mapred.JobClient:     Spilled Records=0
14/08/01 11:42:44 INFO mapred.JobClient:     CPU time spent (ms)=2230
14/08/01 11:42:44 INFO mapred.JobClient:     Physical memory (bytes) snapshot=411357184
14/08/01 11:42:44 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2428895232
14/08/01 11:42:44 INFO mapred.JobClient:     Total committed heap usage (bytes)=806027264
14/08/01 11:42:44 INFO mapred.JobClient:   Giraph Stats
14/08/01 11:42:44 INFO mapred.JobClient:     Aggregate edges=50
14/08/01 11:42:44 INFO mapred.JobClient:     Aggregate finished vertices=50
14/08/01 11:42:44 INFO mapred.JobClient:     Aggregate vertices=50
14/08/01 11:42:44 INFO mapred.JobClient:     Current master task partition=0
14/08/01 11:42:44 INFO mapred.JobClient:     Current workers=1
14/08/01 11:42:44 INFO mapred.JobClient:     Last checkpointed superstep=0
14/08/01 11:42:44 INFO mapred.JobClient:     Sent messages=0
14/08/01 11:42:44 INFO mapred.JobClient:     Superstep=4
14/08/01 11:42:44 INFO mapred.JobClient:   Giraph Timers
14/08/01 11:42:44 INFO mapred.JobClient:     Input superstep (milliseconds)=238
14/08/01 11:42:44 INFO mapred.JobClient:     Setup (milliseconds)=2903
14/08/01 11:42:44 INFO mapred.JobClient:     Shutdown (milliseconds)=68
14/08/01 11:42:44 INFO mapred.JobClient:     Superstep 0 (milliseconds)=77
14/08/01 11:42:44 INFO mapred.JobClient:     Superstep 1 (milliseconds)=64
14/08/01 11:42:44 INFO mapred.JobClient:     Superstep 2 (milliseconds)=45
14/08/01 11:42:44 INFO mapred.JobClient:     Superstep 3 (milliseconds)=43
14/08/01 11:42:44 INFO mapred.JobClient:     Total (milliseconds)=3442

Salida SimpleShortestPathsVertex

14/08/01 11:47:37 INFO utils.ConfigurationUtils: No edge input format specified. Ensure your InputFormat does not require one.
14/08/01 11:47:37 INFO utils.ConfigurationUtils: Setting custom argument [SimpleShortestPathsVertex.source] to [2] in GiraphConfiguration
14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format vertex index type is not known
14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format vertex value type is not known
14/08/01 11:47:37 WARN job.GiraphConfigurationValidator: Output format edge value type is not known
14/08/01 11:47:37 INFO job.GiraphJob: run: Since checkpointing is disabled (default), do not allow any task retries (setting mapred.map.max.attempts = 0, old value = 4)
14/08/01 11:47:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/08/01 11:47:38 INFO mapred.JobClient: Running job: job_201407291058_0017
14/08/01 11:47:39 INFO mapred.JobClient:  map 0% reduce 0%
14/08/01 11:47:44 INFO mapred.JobClient:  map 50% reduce 0%
14/08/01 11:47:45 INFO mapred.JobClient:  map 100% reduce 0%
14/08/01 11:47:46 INFO mapred.JobClient: Job complete: job_201407291058_0017
14/08/01 11:47:46 INFO mapred.JobClient: Counters: 39
14/08/01 11:47:46 INFO mapred.JobClient:   File System Counters
14/08/01 11:47:46 INFO mapred.JobClient:     FILE: Number of bytes read=0
14/08/01 11:47:46 INFO mapred.JobClient:     FILE: Number of bytes written=367068
14/08/01 11:47:46 INFO mapred.JobClient:     FILE: Number of read operations=0
14/08/01 11:47:46 INFO mapred.JobClient:     FILE: Number of large read operations=0
14/08/01 11:47:46 INFO mapred.JobClient:     FILE: Number of write operations=0
14/08/01 11:47:46 INFO mapred.JobClient:     HDFS: Number of bytes read=200
14/08/01 11:47:46 INFO mapred.JobClient:     HDFS: Number of bytes written=30
14/08/01 11:47:46 INFO mapred.JobClient:     HDFS: Number of read operations=5
14/08/01 11:47:46 INFO mapred.JobClient:     HDFS: Number of large read operations=0
14/08/01 11:47:46 INFO mapred.JobClient:     HDFS: Number of write operations=2
14/08/01 11:47:46 INFO mapred.JobClient:   Job Counters 
14/08/01 11:47:46 INFO mapred.JobClient:     Launched map tasks=2
14/08/01 11:47:46 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=8538
14/08/01 11:47:46 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=0
14/08/01 11:47:46 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
14/08/01 11:47:46 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
14/08/01 11:47:46 INFO mapred.JobClient:   Map-Reduce Framework
14/08/01 11:47:46 INFO mapred.JobClient:     Map input records=2
14/08/01 11:47:46 INFO mapred.JobClient:     Map output records=0
14/08/01 11:47:46 INFO mapred.JobClient:     Input split bytes=88
14/08/01 11:47:46 INFO mapred.JobClient:     Spilled Records=0
14/08/01 11:47:46 INFO mapred.JobClient:     CPU time spent (ms)=1590
14/08/01 11:47:46 INFO mapred.JobClient:     Physical memory (bytes) snapshot=341344256
14/08/01 11:47:46 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2363527168
14/08/01 11:47:46 INFO mapred.JobClient:     Total committed heap usage (bytes)=504758272
14/08/01 11:47:46 INFO mapred.JobClient:   Giraph Stats
14/08/01 11:47:46 INFO mapred.JobClient:     Aggregate edges=12
14/08/01 11:47:46 INFO mapred.JobClient:     Aggregate finished vertices=5
14/08/01 11:47:46 INFO mapred.JobClient:     Aggregate vertices=5
14/08/01 11:47:46 INFO mapred.JobClient:     Current master task partition=0
14/08/01 11:47:46 INFO mapred.JobClient:     Current workers=1
14/08/01 11:47:46 INFO mapred.JobClient:     Last checkpointed superstep=0
14/08/01 11:47:46 INFO mapred.JobClient:     Sent messages=0
14/08/01 11:47:46 INFO mapred.JobClient:     Superstep=4
14/08/01 11:47:46 INFO mapred.JobClient:   Giraph Timers
14/08/01 11:47:46 INFO mapred.JobClient:     Input superstep (milliseconds)=181
14/08/01 11:47:46 INFO mapred.JobClient:     Setup (milliseconds)=313
14/08/01 11:47:46 INFO mapred.JobClient:     Shutdown (milliseconds)=128
14/08/01 11:47:46 INFO mapred.JobClient:     Superstep 0 (milliseconds)=57
14/08/01 11:47:46 INFO mapred.JobClient:     Superstep 1 (milliseconds)=54
14/08/01 11:47:46 INFO mapred.JobClient:     Superstep 2 (milliseconds)=36
14/08/01 11:47:46 INFO mapred.JobClient:     Superstep 3 (milliseconds)=35
14/08/01 11:47:46 INFO mapred.JobClient:     Total (milliseconds)=805
¿Fue útil?

Solución

OK, después de mirar los scripts de Hadoop junto con la fuente de Hadoop y Giraph, creo que lo descubrí. La gran pista vino de usando la opción de libjars con Hadoop junto con esta línea desde La salida:

Warn Mapred.JobClient: Use Genericoptionsparser para analizar el argumentos. Las aplicaciones deben implementar la herramienta para el mismo.

La causa parece ser que GIRAPHRUNNER utiliza su propio ConfigurationUtils.parseArgs () para obtener el org.apache.commons.cli.commandline en lugar de usar el Org.apache.hadoop.util.genericoptopionsparser.getcommandline (), que Honra la opción 'libjars'. Esto me llevó a caer en las herramientas de manejo genérico de Hadoop: Classpath y / o Hadoop_ClassPath. Esto es lo que funcionó:

  • Set Hadoop_classpath para incluir su tarro de aplicación y El frasco de núcleo de Gigraph, usando un colon delimitador.
  • pass -libjars usando esa misma capacidad de clase pero con una coma delimitador.

Por ejemplo, en mi máquina:

$ export GIRAPH_HOME=/share/apps/giraph
$ export HADOOP_CLASSPATH=/home/<me>/kdl_hadoop_play.jar:$GIRAPH_HOME/giraph-ex.jar:$HADOOP_CLASSPATH
$ export LIBJARS=/home/<me>/kdl_hadoop_play.jar,$GIRAPH_HOME/giraph-core.jar
$ hadoop fs -rm -R goutput/shortestpathsC2
$ hadoop jar $GIRAPH_HOME/giraph-ex.jar org.apache.giraph.GiraphRunner \
-Dgiraph.zkList=<myhost>:2181 \
-libjars ${LIBJARS} \
KdlSimpleShortestPathsVertex \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip /user/cornell/ginput/tiny_graph.txt \
-of org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/cornell/goutput/shortestpathsC2 \
-ca SimpleShortestPathsVertex.source=2 \
-w 1
...
$ hadoop fs -cat goutput/shortestpathsC2/p*

que da la producción y los resultados esperados.

Más generalmente, sería útil si el equipo de Giraph cambió el código para usar el analizador (aparentemente) más estándar.

espero que ayude!

Otros consejos

No sé por qué esto no funciona, pero hay una forma rápida y sencilla de solucionarlo.Intenta poner tu código giraph-examples/src/main/java/org/apache/giraph/examples/ directorio (donde se encuentra SimpleShortestPath).Y luego construya el frasco de ejemplos de jirafas ejecutando mvn -DskipTests --projects giraph-examples --also-make package.Luego simplemente ejecute su programa como lo hizo con SimpleShortestPath reemplazando SimpleShortestPath por su nombre de archivo.Espero que eso ayude.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top