Pregunta

So for a side hobby I'm doing some basic meta data gathering using text mining on the Project Gutenberg version of Herodotus but I'm stuck at the point of transferring the tagged text strings into excel. Essentially what I'm trying to do is create is a master list of all People, Places and Groups/Organizations mentioned in Herodotus and how many times each is mentioned in the text. I want to then use this list to populate some data visualizations in Tableau and/or Powerview, I have both.

I've already run the text through the Stanford NER which did a good job of at least identifying nearly all Persons, Organizations and Locations. I then manually checked over the document in notepadd++ to fix the numerous errors the NER made when analyzing ancient Greek names and places. I also removed the footnotes from the text because I don't care about them, only the original text. If you download the attached .txt you'll see that each proper noun is marked /PERSON, /LOCATION or /ORGANIZATION.

Now where I'm stuck is trying to get the tagged text strings into excel so I can use the data. A simple ctr+f reveals that in just book1 there are like 880 /PERSON tagged words. Essentially what I'm trying to do is grab each and every string that precedes one of the /PERSON, /LOCATION, or /ORGANIZATION and copy them into excel.

I looked into Regex expressions for notepad++ to see if I could select all text strings where the string ends in /PERSON but I cannot seem to figure it out. I can get the regex to select all "/PERSON" but I don't understand regex well enough to get it to select all "name/PERSON" or "place/LOCATION" strings in their entirety if that makes sense.

EDIT: I forgot to ask about using SQL or Python to help me solve this problem. From my work I'm familiar with using SQL queries on databases. So this is a stupid question but can you even use SQL to directly query a .txt file? If so then I could pretty easily write a SQL statement to extract the tagged text strings.

I'm less familiar with Python but is it possible to extract the info I'm looking for via some python scripting?

Finally the question I should have asked in the original question. Am I going about this all wrong? I think using Notepad++ to correct the Stanford NER tags was necessary but maybe going straight from the tagged .txt to excel is the wrong approach.

https://www.dropbox.com/s/k5m8yag6tpae05w/HerodotusB1NER.txt

2ND EDIT: So I finally got around to playing with the regex expressions both of you provided and they are almost working perfectly. However, I think its trimming off some of the result set actually.

A perfect example is the character "Deïokes" who is being trimmed into just "okes/PERSON" after I run the regex search. I think the a-z part of the regex is ignoring special letters like the umlaut over the i in Deïokes.

How would I tweak the regex search to tolerate those sorts of special characters? If the regex cannot accomodate those special characters then I think it wouldn't be too manually intensive to go in and fix the special characters where they show up here and there.

¿Fue útil?

Solución 2

I gave this another try and found an awfully more easy solution to just copy the stuff to Excel. I don't have Notepad++, but I do use PSPad occasionally if my IDE is not around. It offers pretty much the same features as Notepad++. Some things it does better and others it doesn't. The regex search is pretty good, and the search dialogue has a button that says Copy.

Find dialogue

I copied your file and used my regex from the other answer without the capture groups. We don't need them as it will copy the complete match. Remember the \b is a word boundary and not a real character that will be copied.

Copied search results

And voila, here we go. A list of names with their classification that should be easy enough to copy to Excel and split into columns there.

Otros consejos

Even if you manage to search/replace all those names with Notepad++, I don't know how you intend to copy them over to Excel but one by one. Since SO is mainly about programming, I'll provide a code solution. This is Perl, and if you don't know how it works or how to run it, do not despair. It's probably not your language of choice for Windows anyway. You can build this in any programming language really.

#!/usr/bin/perl
use strictures;
use Data::Dump;

my $counts;

while (my $row = <DATA>) {
  while ($row =~ m{\b(\w+)/([A-Z]+)}g) {
    $counts->{$2}->{$1}++;
  }
}

dd $counts;
__DATA__
This is the Showing forth of the Inquiry of Herodotus/PERSON of Halicarnassos/LOCATION,

Output for first paragraph:

{
  LOCATION => { Halicarnassos => 1 },
  ORGANIZATION => { Barbarians => 1, Hellenes => 1 },
  PERSON => { Herodotus => 1 },
}

Let's start with the __DATA__ section at the bottom. I've pasted your complete text file there, but omitted it here for practical reasons. Basically it just reads the file line by line in the first while loop. The second while loop applies a regular expression match to each line with the /g modifier, that lets the regex match multiple times. The pattern means:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                             more times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  /                        '/'
--------------------------------------------------------------------------------
  (                        group and capture to \2:
--------------------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \2

The two capture groups (..) end up in the variables $1 and $2. For every word that is found, we put count a value in our data structure $counts. This is like a GROUP BY count in SQL. The first key ($2) is the type (PERSON, LOCATION...) and the second key is the actual word. The ++ operator increments by one.

When we are done, we print it using the Data::Dump module's function dd, which gives us a nice output of counts grouped by type.


Thanks for bearing with me on that little technical ex-course. If it was too technical, try the excellent javascript regex tool regex101.com, where I set it up for you. You should be able to copy/paste from there to Excel. I recommend a browser plugin that lets you copy table columns.

Why not just extract the actual names only: [a-zA-Z]+?(?=\/PERSON)? Remove the (?=) if you want to have the /PERSON match too.

You could even go so far as to extract everything into groups using: ([a-zA-Z]+?)\/([A-Z]+). Then you could output the captured groups however you want. In any decent text editor such as SublimeText you could find [\s\S]*?([a-zA-Z]+?)\/([A-Z]+)[\s\S]*? and replace with { $2: $1 }, for example to make a nice array of JS objects.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top