mahout lucene 文档聚类如何?
-
12-09-2019 - |
题
我读到我可以从 lucene 索引创建 mahout 向量,该向量可用于应用 mahout 聚类算法。http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
我想在 Lucene 索引中的文档中应用 K 均值聚类算法,但不清楚如何应用此算法(或层次聚类)来提取这些文档中有意义的聚类。
在此页面中 http://cwiki.apache.org/confluence/display/MAHOUT/k-Means表示该算法接受两个输入目录:一个用于数据点,一个用于初始聚类。我的数据点是文件吗?我如何“声明”这些是我的文档(或其向量),只需将它们进行聚类?
提前为我糟糕的语法感到抱歉
谢谢
解决方案
如果你有载体,可以运行KMeansDriver。以下是相同的帮助。
Usage:
[--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
--input (-i) input The Path for input Vectors. Must be a
SequenceFile of Writable, Vector
--clusters (-c) clusters The input centroids, as Vectors. Must be a
SequenceFile of Writable, Cluster/Canopy.
If k is also specified, then a random set
of vectors will be selected and written out
to this path first
--output (-o) output The Path to put the output in
--distance (-m) distance The Distance Measure to use. Default is
SquaredEuclidean
--convergence (-d) convergence The threshold below which the clusters are
considered to be converged. Default is 0.5
--max (-x) max The maximum number of iterations to
perform. Default is 20
--numReduce (-r) numReduce The number of reduce tasks
--k (-k) k The k in k-Means. If specified, then a
random selection of k Vectors will be
chosen as the Centroid and written to the
clusters output path.
--vectorClass (-v) vectorClass The Vector implementation class name.
Default is SparseVector.class
--overwrite (-w) If set, overwrite the output directory
--help (-h) Print out help
更新:从HDFS结果目录到本地FS。然后使用ClusterDumper实用程序获取的文件集群,并列出该集群。
其他提示
一个非常好的方法在这里:将 apache mahout 与 apache lucene 集成
@ maiky 你可以阅读更多关于阅读的输出,该页面使用clusterdump工具 - > HTTPS ://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper
不隶属于 StackOverflow