Question

I have a text file, and each line is of the form:

TAB WORD TAB PoS TAB FREQ#

Word    PoS Freq
the Det 61847
of  Prep    29391
and Conj    26817
a   Det 21626
in  Prep    18214
to  Inf 16284
it  Pron    10875
is  Verb    9982
to  Prep    9343
was Verb    9236
I   Pron    8875
for Prep    8412
that    Conj    7308
you Pron    6954

Would one of you regex wizards kindly assist me in isolating the WORDS from the file? I'll do a find and replace in TextPad, hopefully, and that will be that. Multiple find and replaces is fine. One thing: notice that searching for "verb" would also turn up the WORD of "verb," not just the part of speech, so be carefull. In the end I want to end up with 1 word per line.

Thanks so much!

Was it helpful?

Solution

I think microsoft excel can help you that better...

Just copy the whole text on excel and it will be formatted as table then go ahead and select the appropriate column cells for the word, finally copy them on notepad.

I bet this is the easiest path.

If in case excel stores all values in a single column, in a separate column extract the word by:

=Trim(LEFT(C1,maxchar))

OTHER TIPS

You could just use awk to remove the first column, as in

awk '{print $1}' /path/to/filename

Skip the first line by using

awk 'NR!=1 {print $1}' /path/to/filename

There's not really any need to use a regular expression for this. For example, you can use cut:

cut -f1 <inputfile

Something like \s*([a-zA-z]+)\s*([a-zA-z]+) would return the word and PoS as groups. You can then use them in the replace statement as $1 and $2 to output as you want.

If you only want the WORD part you can just use $1 in the replace.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top