Frage

How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.

Input:
It was never going to work, he thought. He did not play so well, so he had to practice some more.

Desired output:
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.

Any idea how to solve this?

War es hilfreich?

Lösung

To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:

import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]', 
       lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)), 
       string,
       flags=re.IGNORECASE)

Will print (demo here)

It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !

Explanation

  • The first step is to select the parts of your string you're interested in. This is done with

    \b(?:not|never|no)\b[\w\s]+[^\w\s]
    

    Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).

    Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.

  • Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with

    (\s+)(\w+)
    

    And replace them with what you want

    \1NEG_\2
    

Andere Tipps

I would not do this with regexp. Rather I would;

  • Split the input on punctuation characters.
  • For each fragment do
  • Set negation counter to 0
  • Split input into words
  • For each word
  • Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
  • If original word is in {No,Never,Not} increase negation counter by one.

You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):

  • First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.

  • Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.

  • Join the string together again and insert the result in your original string in the place of the first regex's match.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top