문제

How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.

Input:
It was never going to work, he thought. He did not play so well, so he had to practice some more.

Desired output:
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.

Any idea how to solve this?

도움이 되었습니까?

해결책

To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:

import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]', 
       lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)), 
       string,
       flags=re.IGNORECASE)

Will print (demo here)

It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !

Explanation

  • The first step is to select the parts of your string you're interested in. This is done with

    \b(?:not|never|no)\b[\w\s]+[^\w\s]
    

    Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).

    Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.

  • Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with

    (\s+)(\w+)
    

    And replace them with what you want

    \1NEG_\2
    

다른 팁

I would not do this with regexp. Rather I would;

  • Split the input on punctuation characters.
  • For each fragment do
  • Set negation counter to 0
  • Split input into words
  • For each word
  • Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
  • If original word is in {No,Never,Not} increase negation counter by one.

You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):

  • First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.

  • Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.

  • Join the string together again and insert the result in your original string in the place of the first regex's match.

라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top