セット内のパターンを見つける

https://stackoverflow.com/questions/270541

06-07-2019
|

質問

一連の文字列の共通文字を決定するために使用できるアルゴリズムは何ですか？

例を簡単にするために、1行に2文字以上の文字があり、それが2つ以上のサンプルに表示される場合にのみ気にします。例えば：

0000abcde0000
0000abcd00000
000abc0000000
00abc000de000

知りたい：

00は1,2,3,4
で使用されました 000は1,2,3,4
で使用されました 0000は1,2,3で使用されていました
00000は2,3で使用されていました
abは1,2,3,4
で使用されました abcは1,2,3,4
で使用されました abcdは1,2で使用されていました
bcは1,2,3,4
で使用されました bcdは1,2で使用されていました
cdは1,2で使用されていました
deは1,4で使用されました

解決

これは宿題ではないと仮定しています。（もしそうなら、あなたはあなた自身の盗作の1つです！;-）

以下は、手っ取り早い解決策です。時間の複雑さは O（m ** 2 * n）です。ここで、 m は文字列の平均長、 n は配列のサイズです文字列。

Occurrence のインスタンスは、指定された文字列を含むインデックスのセットを保持します。 commonOccurrences ルーチンは文字列配列をスキャンし、null以外の文字列ごとに captureOccurrences を呼び出します。 captureOccurrences ルーチンは、指定された文字列の可能なサブストリングごとに、現在のインデックスを Occurrence に入れます。最後に、 commonOccurrences は、少なくとも2つのインデックスを持つ Occurrences のみを選択して結果セットを形成します。

サンプルデータには、質問で特定したものよりも多くの一般的な部分文字列があることに注意してください。たとえば、各入力文字列には＆quot; 00ab＆quot; が含まれます。コンテンツ（すべての数字、すべてのアルファベットなど）に基づいて興味深い文字列を選択するための追加のフィルターは、彼らが言うように、読者のための演習として残されています。 ;-）

高速でダーティーなJavaソース：

package com.stackoverflow.answers;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;

public class CommonSubstringFinder {

    public static final int MINIMUM_SUBSTRING_LENGTH = 2;

    public static class Occurrence implements Comparable<Occurrence> {
        private final String value;
        private final Set<Integer> indices;
        public Occurrence(String value) {
            this.value = value == null ? "" : value;
            indices = new TreeSet<Integer>();
        }
        public String getValue() {
            return value;
        }
        public Set<Integer> getIndices() {
            return Collections.unmodifiableSet(indices);
        }
        public void occur(int index) {
            indices.add(index);
        }
        public String toString() {
            StringBuilder result = new StringBuilder();
            result.append('"').append(value).append('"');
            String separator = ": ";
            for (Integer i : indices) {
                result.append(separator).append(i);
                separator = ",";
            }
            return result.toString();
        }
        public int compareTo(Occurrence that) {
            return this.value.compareTo(that.value);
        }
    }

    public static Set<Occurrence> commonOccurrences(String[] strings) {
        Map<String,Occurrence> work = new HashMap<String,Occurrence>();
        if (strings != null) {
            int index = 0;
            for (String string : strings) {
                if (string != null) {
                    captureOccurrences(index, work, string);
                }
                ++index;
            }
        }
        Set<Occurrence> result = new TreeSet<Occurrence>();
        for (Occurrence occurrence : work.values()) {
            if (occurrence.indices.size() > 1) {
                result.add(occurrence);
            }
        }
        return result;
    }

    private static void captureOccurrences(int index, Map<String,Occurrence> work, String string) {
        final int maxLength = string.length();
        for (int i = 0; i < maxLength; ++i) {
            for (int j = i + MINIMUM_SUBSTRING_LENGTH; j < maxLength; ++j) {
                String partial = string.substring(i, j);
                Occurrence current = work.get(partial);
                if (current == null) {
                    current = new Occurrence(partial);
                    work.put(partial, current);
                }
                current.occur(index);
            }
        }
    }

    private static final String[] TEST_DATA = {
        "0000abcde0000",
        "0000abcd00000",
        "000abc0000000",
        "00abc000de000",
    };
    public static void main(String[] args) {
        Set<Occurrence> found = commonOccurrences(TEST_DATA);
        for (Occurrence occurrence : found) {
            System.out.println(occurrence);
        }
    }

}

サンプル出力：（実際には行ごとに1つのオカレンスしかなかったことに注意してください。blockquoteマークアップが行をマージするのを防ぐことはできないようです）

＆quot; 00＆quot ;: 0,1,2,3   ＆quot; 000＆quot ;: 0,1,2,3
  ＆quot; 0000＆quot ;: 0,1,2   ＆quot; 0000a＆quot ;: 0,1
  ＆quot; 0000ab＆quot ;: 0,1   ＆quot; 0000abc＆quot ;: 0,1
  ＆quot; 0000abcd＆quot ;: 0,1   ＆quot; 000a＆quot ;: 0,1,2
  ＆quot; 000ab＆quot ;: 0,1,2   ＆quot; 000abc＆quot ;: 0,1,2
  ＆quot; 000abcd＆quot ;: 0,1   ＆quot; 00a＆quot ;: 0,1,2,3
  ＆quot; 00ab＆quot ;: 0,1,2,3   ＆quot; 00abc＆quot ;: 0,1,2,3
  ＆quot; 00abc0＆quot ;: 2,3   ＆quot; 00abc00＆quot ;: 2,3
  ＆quot; 00abc000＆quot ;: 2,3   ＆quot; 00abcd＆quot ;: 0,1
  ＆quot; 0a＆quot ;: 0,1,2,3   ＆quot; 0ab＆quot ;: 0,1,2,3
  ＆quot; 0abc＆quot ;: 0,1,2,3   ＆quot; 0abc0＆quot ;: 2,3
  ＆quot; 0abc00＆quot ;: 2,3   ＆quot; 0abc000＆quot ;: 2,3
  ＆quot; 0abcd＆quot ;: 0,1   ＆quot; ab＆quot ;: 0,1,2,3   ＆quot; abc＆quot ;: 0,1,2,3   ＆quot; abc0＆quot ;: 2,3   ＆quot; abc00＆quot ;: 2,3
  ＆quot; abc000＆quot ;: 2,3   ＆quot; abcd＆quot ;: 0,1   ＆quot; bc＆quot ;: 0,1,2,3   ＆quot; bc0＆quot ;: 2,3   ＆quot; bc00＆quot ;: 2,3
  ＆quot; bc000＆quot ;: 2,3   ＆quot; bcd＆quot ;: 0,1   ＆quot; c0＆quot ;: 2,3   ＆quot; c00＆quot ;: 2,3   ＆quot; c000＆quot ;: 2,3   ＆quot; cd＆quot ;: 0,1
  ＆quot; de＆quot ;: 0,3   ＆quot; de0＆quot ;: 0,3   ＆quot; de00＆quot ;: 0,3
  ＆quot; e0＆quot ;: 0,3   ＆quot; e00＆quot ;: 0,3

他のヒント

これはおそらくNP困難な問題です。これは、複数の配列アライメントに似ています。基本的に、多次元 Smith-Waterman （=ローカルシーケンスアラインメント）をニーズに合わせて調整できます。。ただし、より効率的なアルゴリズムがあるかもしれません。

ツリーのパスが文字シーケンスであるツリーを構築します。各ノードに「セット」を含める文字列参照が通過時に追加されます（または単にカウントを保持します）。次に、単語内のN個の場所を追跡します。Nは、最も長いシーケンスです（たとえば、各文字で新しいハンドルを開始し、各ステップですべてのハンドルを下に移動し、Nステップ後に各ハンドルを中止します）。

これは、小さく、有限で、高密度のアルファベットでうまく機能します（DNAが最初に使用することを考えた場所でした）。

編集：気になるパターンが事前にわかっている場合は、事前にツリーを構築してから、実際にツリーにいるかどうかを確認するだけで、上記の動作を変更できます。拡張するよりも。

例

入力

abc
abd
abde
acc
bde

ツリー

a : 4
  b : 3
    c : 1
    d : 2
      e : 1
  c : 1
    c : 1
b : 4
  d : 3
    e : 2
  c : 1
c : 3
  c : 1
d : 3
  e : 2

「接尾辞ツリー」を検索します。ウェブ上で。または、「文字列、ツリー、およびシーケンスのアルゴリズム」を選択します。ダン・ガスフィールド。確認する本はありませんが、サフィックスツリーのウィキペディアページにはそのページ205には、「セット内の少なくともk個の文字列に共通する最も長い部分文字列を見つける」という問題の解決策が含まれています。

「値」を知っていますか事前に検索する必要がありますか？または、文字列を解析し、投稿したような統計情報を提供するコードが必要ですか？

前もって探しているものがわかっている場合、ボイヤー・ムーアアルゴリズムを使用すると、部分文字列が存在するかどうか（さらにはそれらを見つけることも）を知ることができます。

距離行列の分析を使用できます。斜めの動き（コストの変更なし）は完全に一致します。

接尾辞配列は、接尾辞ツリーよりも単純で効率的です。データ内の共通部分文字列の頻度-それらが十分に共通している場合、より洗練された接尾辞配列構築アルゴリズムが必要になります。（単純な方法は、ライブラリのソート関数を使用することです。）

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow