Question

I grabbed text corpora from the nltk and now want to process it to make sure every line in file ends with a punctuation marker.

Her mother
had died too long ago for her to
remember her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Should become:

Her mother had died too long ago for her to remember her caresses; 
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

I tried sed to match if no punctuation at end of line, but can't figure out how to move up the next line. Would appreciate any help!

Was it helpful?

Solution

What if you use paste and sed like this?

paste prints all the text in the same line.

$ paste -s -d' ' file
Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

sed that adds a new line after every . and ;.

$ paste -s -d' ' file | sed -r 's/(\.|\;) /\1\n/g'
Her mother had died too long ago for her to remember her caresses;
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.

OTHER TIPS

In Python:

import string # for string.punctuation

with open("path/to/file") as f:
    output = ""
    for line in f:
        sanitized = line.strip()
        output += sanitized
        if sanitized[-1] in string.punctuation:
            output += "\n"

After the with block terminates, output will be the file as intended. You can then overwrite the file with output if you need it to stay that way.

With NLTK's sent_tokenize():

>>> from nltk import sent_tokenize
>>> text = """Her mother
... had died too long ago for her to
... remember her caresses; and her place had been supplied
... by an excellent woman as governess, who had fallen little short
... of a mother in affection."""
>>> sent_tokenize(text.replace("\n", " "))
['Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.']
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top