Can't read mahout output of PFPGrowth

https://stackoverflow.com/questions/10026319

29-05-2021
|

题

i am successfully running Parallel FPGroth Algorithm of Apache mahout on top of hadoop. But the generetaed output text files are not readable as you can see below

SEQorg.apache.hadoop.io.TextDorg.apache.mahout.fpm.pfpgrowth.convertors.string.TopKStringPatterns��3G9��y'��e��1��2��1��t�5�1��t�4�1��1�4227��3�1��1�3476��t�1�1340��h�1�5795��N�1�2701��K�1�3610��@�1�2106�� ...

Running RecommenderJob and ItemSimilarityJob with the same input file generates correct output files.

Any ideas?

解决方案

These output files are sequence files, not text files. They contains key/value pairs of type <Text, TopKStrinPatterns>

You can get hadoop to read the sequence files and output the textual versions of these objects using the fs shell command, combined with -text and -libjars:

hadoop fs -libjars /path/to/mahout/lib.jar -text /path/to/hdfs/output/part*

If you want these files to be text rather than sequence then you'll need to amend the driver which runs the job, and change the job to use TextOutputFormat, rather than SequenceFileOutputFormat:

// job.setOutputFormat(SequenceFileOutputFormat.class);
job.setOutputFormat(TextOutputFormat.class);

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow