Python: Replace string with prefixStringSuffix keeping original case, but ignoring case when searching for match

StackOverflow https://stackoverflow.com/questions/818691

Question

So what I'm trying to do is replace a string "keyword" with "<b>keyword</b>" in a larger string.

Example:

myString = "HI there. You should higher that person for the job. Hi hi."

keyword = "hi"

result I would want would be:

result = "<b>HI</b> there. You should higher that person for the job. <b>Hi</b> <b>hi</b>."

I will not know what the keyword until the user types the keyword and won't know the corpus (myString) until the query is run.

I found a solution that works most of the time, but has some false positives, namely it would return "<b>hi<b/>gher"which is not what I want. Also note that I am trying to preserve the case of the original text, and the matching should take place irrespective of case. so if the keyword is "hi" it should replace HI with <b>HI</b> and hi with <b>hi</b>.

The closest I have come is using a slightly derived version of this: http://code.activestate.com/recipes/576715/ but I still could not figure out how to do a second pass of the string to fix all of the false positives mentioned above.

Or using the NLTK's WordPunctTokenizer (which simplifies some things like punctuation) but I'm not sure how I would put the sentences back together given it does not have a reverse function and I want to keep the original punctuation of myString. Essential, doing a concatenation of all the tokens does not return the original string. For example I would not want to replace "7 - 7" with "7-7" when regrouping the tokens into its original text if the original text had "7 - 7".

Hope that was clear enough. Seems like a simple problem, but its a turned out a little more difficult then I thought.

Was it helpful?

Solution

This ok?

>>> import re
>>> myString = "HI there. You should higher that person for the job. Hi hi."
>>> keyword = "hi"
>>> search = re.compile(r'\b(%s)\b' % keyword, re.I)
>>> search.sub('<b>\\1</b>', myString)
'<b>HI</b> there. You should higher that person for the job. <b>Hi</b> <b>hi</b>.'

The key to the whole thing is using word boundaries, groups and the re.I flag.

OTHER TIPS

You should be able to do this very easily with re.sub using the word boundary assertion \b, which only matches at a word boundary:

import re

def SurroundWith(text, keyword, before, after):
  regex = re.compile(r'\b%s\b' % keyword, re.IGNORECASE)
  return regex.sub(r'%s\0%s' % (before, after), text)

Then you get:

>>> SurroundWith('HI there. You should hire that person for the job. '
...              'Hi hi.', 'hi', '<b>', '</b>')
'<b>HI</b> there. You should hire that person for the job. <b>Hi</b> <b>hi</b>.'

If you have more complicated criteria for what constitutes a "word boundary," you'll have to do something like:

def SurroundWith2(text, keyword, before, after):
  regex = re.compile(r'([^a-zA-Z0-9])(%s)([^a-zA-Z0-9])' % keyword,
                     re.IGNORECASE)
  return regex.sub(r'\1%s\2%s\3' % (before, after), text)

You can modify the [^a-zA-Z0-9] groups to match anything you consider a "non-word."

I think the best solution would be regular expression...

import re
def reg(keyword, myString) :
   regx = re.compile(r'\b(' + keyword + r')\b', re.IGNORECASE)
   return regx.sub(r'<b>\1</b>', myString)

of course, you must first make your keyword "regular expression safe" (quote any regex special characters).

Here's one suggestion, from the nitpicking committee. :-)

myString = "HI there. You should higher that person for the job. Hi hi."

myString.replace('higher','hire')
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top