Wortgrenze Erkennung von Text

https://stackoverflow.com/questions/3640743

30-09-2019
|

Frage

Ich habe dieses Problem mit Wortgrenze Identifikation. Ich entfernte alle das Markup des wikipedia Dokument, jetzt habe ich eine Liste von Unternehmen erhalten möchten. (Sinnvolle Begriffe). Ich plane, Bi-Gramm, Tri-Gramm des Dokuments zu nehmen und prüfen, ob es im Wörterbuch vorhanden ist (wordnet). Gibt es einen besseren Weg, dies zu erreichen.

Im Folgenden finden Sie der Beispieltext. Ich möchte Entitäten identifizieren (durch doppelte Anführungszeichen als umgeben dargestellt)

Vulcans are a humanoid species in the fictional "Star Trek" universe who evolved on the planet Vulcan and are noted for their attempt to live by reason and logic with no interference from emotion They were the first extraterrestrial species officially to make first contact with Humans and later became one of the founding members of the "United Federation of Planets"

Lösung

Ich denke, was du redest wirklich ist immer noch ein Thema der Forschung von aufkeimenden eher als eine einfache Frage der Anwendung von gut etablierten Algorithmen.

Ich kann Ihnen nicht einfach geben „tun, um diese“ Antwort, aber hier sind einige Hinweise aus der Spitze von meinem Kopf:

denke ich WordNet mit arbeiten konnte (nicht sicher, wo Bigrams / trigrams hinein kommen aber), aber Sie sollten WordNet sehen Lookups als Teil eines Hybridsystems, nicht das A und O, um benannten Entitäten Spek
Starten Sie dann durch ein paar einfache, einleuchtende Kriterien (Sequenzen von aktivierten Wörter Anwendung; try und bieten Platz für häufige Kleinfunktionswörter wie ‚von‘ in diese; Sequenzen bestehend aus „bekannten Titel“ plus capitalisd (e) );
Look für Sequenzen von Wörtern, die statistisch würden Sie nicht nebeneinander zufällig als Kandidaten für Entitäten erscheinen erwarten;
Sie können in dynamischen Web-Lookup bauen? (Ihr System sieht die aktivierten Sequenz „IBM“ und sieht, wenn er feststellt, zum Beispiel eines Wikipedia-Eintrag mit dem Textmuster „IBM ist ... [Organisation | Unternehmen | ...]“.
sehen, ob alles hier und in der „Informationsextraktion“ Literatur im Allgemeinen gibt Ihnen einige Ideen: http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html

Die Wahrheit ist, dass, wenn Sie zu welcher Literatur findet sie dort gibt es aus, es scheint nicht, wie die Menschen furchtbar anspruchsvoll verwenden, gut etablierte Algorithmen. Deshalb denke ich, viel Platz gibt es für auf Ihre Daten, Exploration suchen und zu sehen, was man mit ... Viel Glück kann kommen!

Andere Tipps

Wenn ich richtig verstehe, Sie suchen Substrings durch doppelte Anführungszeichen begrenzt zu extrahieren ( ") Sie Capture-Gruppen in regulären Ausdrücken verwenden könnten.

    String text = "Vulcans are a humanoid species in the fictional \"Star Trek\"" +
        " universe who evolved on the planet Vulcan and are noted for their " +
        "attempt to live by reason and logic with no interference from emotion" +
        " They were the first extraterrestrial species officially to make first" +
        " contact with Humans and later became one of the founding members of the" +
        " \"United Federation of Planets\"";
    String[] entities = new String[10];                 // An array to hold matched substrings
    Pattern pattern = Pattern.compile("[\"](.*?)[\"]"); // The regex pattern to use
    Matcher matcher = pattern.matcher(text);            // The matcher - our text - to run the regex on
    int startFrom   = text.indexOf('"');                // The index position of the first " character
    int endAt       = text.lastIndexOf('"');            // The index position of the last " character
    int count       = 0;                                // An index for the array of matches
    while (startFrom <= endAt) {                        // startFrom will be changed to the index position of the end of the last match
        matcher.find(startFrom);                        // Run the regex find() method, starting at the first " character
        entities[count++] = matcher.group(1);           // Add the match to the array, without its " marks
        startFrom = matcher.end();                      // Update the startFrom index position to the end of the matched region
    }

oder schreiben Sie einen "Parser" mit String-Funktionen:

    int startFrom = text.indexOf('"');                              // The index-position of the first " character
    int nextQuote = text.indexOf('"', startFrom+1);                 // The index-position of the next " character
    int count = 0;                                                  // An index for the array of matches
    while (startFrom > -1) {                                        // Keep looping as long as there is another " character (if there isn't, or if it's index is negative, the value of startFrom will be less-than-or-equal-to -1)
        entities[count++] = text.substring(startFrom+1, nextQuote); // Retrieve the substring and add it to the array
        startFrom = text.indexOf('"', nextQuote+1);                 // Find the next " character after nextQuote
        nextQuote = text.indexOf('"', startFrom+1);                 // Find the next " character after that
    }

In den beiden, die Probe-Text fest codiert zum Zwecke des Beispiels und die gleiche Variable angenommen wird, vorhanden ist (der String-Variable namens text) zu sein.

Wenn Sie den Inhalt des entities Array testen wollen:

    int i = 0;
    while (i < count) {
        System.out.println(entities[i]);
        i++;
    }

Ich muss Sie warnen, da sein Probleme können mit Rand / Grenzfälle (dh, wenn ein "Zeichen am Anfang oder Ende eines Strings ist. Diese Beispiele nicht wie erwartet funktionieren, wenn die Parität von „Zeichen ist ungleichmäßig (dh wenn es eine ungerade Anzahl von“ Zeichen im Text) Sie eine einfache Paritätsprüfung vor-Hand benutzen konnte.

    static int countQuoteChars(String text) {
        int nextQuote = text.indexOf('"');              // Find the first " character
        int count = 0;                                  // A counter for " characters found
        while (nextQuote != -1) {                       // While there is another " character ahead
            count++;                                    // Increase the count by 1
            nextQuote = text.indexOf('"', nextQuote+1); // Find the next " character
        }
        return count;                                   // Return the result
    }

    static boolean quoteCharacterParity(int numQuotes) {
        if (numQuotes % 2 == 0) { // If the number of " characters modulo 2 is 0
            return true;          // Return true for even
        }
        return false;             // Otherwise return false
    }

Beachten Sie, dass, wenn numQuotes passiert sein 0 diese Methode noch true gibt (da 0 Modulo eine beliebige Zahl 0, so (count % 2 == 0) true sein wird) wenn Sie nicht mit dem Parsing wollen würde, gehen Sie vor, wenn es keine "Zeichen sind, so würden Sie wollen für diesen Zustand irgendwo überprüfen.

Hope, das hilft!

Jemand fragte sonst eine ähnliche Frage über

Lizenziert unter: CC-BY-SA mit Zuschreibung

Nicht verbunden mit StackOverflow