Mahout Lucene 문서 클러스터링 Howto?

https://stackoverflow.com/questions/1846060

12-09-2019
|

문제

Mahout 클러스터링 알고리즘을 적용하는 데 사용할 수있는 Lucene 지수에서 Mahout 벡터를 만들 수 있다는 것을 읽고 있습니다.http://cwiki.apache.org/confluence/display/mahout/creating+ vectors+from +text

Lucene Index의 문서에 K-Means 클러스터링 알고리즘을 적용하고 싶지만이 문서와 함께 의미있는 클러스터를 추출하기 위해이 알고리즘 (또는 계층 적 클러스터링)을 어떻게 적용 할 수 있는지는 확실하지 않습니다.

이 페이지에서 http://cwiki.apache.org/confluence/display/mahout/k-means알고리즘은 두 개의 입력 디렉토리를 수용한다고 말합니다. 하나는 데이터 포인트와 하나는 초기 클러스터 용입니다. 내 데이터 포인트가 문서입니까? 이것들이 내 문서 (또는 그 벡터)임을 어떻게 "선언"할 수 있습니까?

가난한 문법에 대해 미리 죄송합니다

고맙습니다

해결책

벡터가있는 경우 kmeansdriver를 실행할 수 있습니다. 다음은 동일한 도움이 있습니다.

Usage:
 [--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
  --input (-i) input                The Path for input Vectors. Must be a
                                    SequenceFile of Writable, Vector
  --clusters (-c) clusters          The input centroids, as Vectors.  Must be a
                                    SequenceFile of Writable, Cluster/Canopy.
                                    If k is also specified, then a random set
                                    of vectors will be selected and written out
                                    to this path first
  --output (-o) output              The Path to put the output in
  --distance (-m) distance          The Distance Measure to use.  Default is
                                    SquaredEuclidean
  --convergence (-d) convergence    The threshold below which the clusters are
                                    considered to be converged.  Default is 0.5
  --max (-x) max                    The maximum number of iterations to
                                    perform.  Default is 20
  --numReduce (-r) numReduce        The number of reduce tasks
  --k (-k) k                        The k in k-Means.  If specified, then a
                                    random selection of k Vectors will be
                                    chosen as the Centroid and written to the
                                    clusters output path.
  --vectorClass (-v) vectorClass    The Vector implementation class name.
                                    Default is SparseVector.class
  --overwrite (-w)                  If set, overwrite the output directory
  --help (-h)                       Print out help

업데이트 : HDFS에서 Local Fs 로의 결과 디렉토리를 가져옵니다. 그런 다음 ClusterDumper 유틸리티를 사용하여 해당 클러스터에서 클러스터와 문서 목록을 가져옵니다.

다른 팁

여기있는 꽤 좋은 방법 :Apache Mahout과 Apache Lucene과 통합

@ maiky이 페이지에서 출력을 읽고 ClusterDump 유틸리티 사용에 대한 자세한 내용을 읽을 수 있습니다 -> https://cwiki.apache.org/confluence/display/mahout/cluster+dumper

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow