Question

So I may have a string 'Bank of China', or 'Embassy of China', and 'International China'

I want to replace all country instances except when we have an 'of ' or 'of the '

Clearly this can be done by iterating through a list of countries, checking if the name contains a country, then checking if before the country 'of ' or 'of the ' exists.

If these do exist then we do not remove the country, else we do remove country. The examples will become:

'Bank of China', or 'Embassy of China', and 'International'

However iteration can be slow, particularly when you have a large list of countries and a large lists of texts for replacement.

Is there a faster and more conditionally based way of replacing the string? So that I can still use a simple pattern match using the Python re library?

My function is along these lines:

def removeCountry(name):
    for country in countries:
        if country in name:
            if 'of ' + country in name:
                return name
            if 'of the ' + country in name:
                return name
            else:
                name =  re.sub(country + '$', '', name).strip()
                return name
    return name

EDIT: I did find some info here. This does describe how to do an if, but I really want a if not 'of ' if not 'of the ' then replace...

Was it helpful?

Solution

You could compile a few sets of regular expressions, then pass your list of input through them. Something like: import re

countries = ['foo', 'bar', 'baz']
takes = [re.compile(r'of\s+(the)?\s*%s$' % (c), re.I) for c in countries]
subs = [re.compile(r'%s$' % (c), re.I) for c in countries]

def remove_country(s):
    for regex in takes:
        if regex.search(s):
            return s
    for regex in subs:
        s = regex.sub('', s)
    return s

print remove_country('the bank of foo')
print remove_country('the bank of the baz')
print remove_country('the nation bar')

''' Output:
    the bank of foo
    the bank of the baz
    the nation
'''

It doesn't look like anything faster than linear time complexity is possible here. At least you can avoid recompiling the regular expressions a million times and improve the constant factor.

Edit: I had a few typos, bu the basic idea is sound and it works. I've added an example.

OTHER TIPS

I think you could use the approach in Python: how to determine if a list of words exist in a string to find any countries mentioned, then do further processing from there.

Something like

countries = [
    "Afghanistan",
    "Albania",
    "Algeria",
    "Andorra",
    "Angola",
    "Anguilla",
    "Antigua",
    "Arabia",
    "Argentina",
    "Armenia",
    "Aruba",
    "Australia",
    "Austria",
    "Azerbaijan",
    "Bahamas",
    "Bahrain",
    "China",
    "Russia"
    # etc
]

def find_words_from_set_in_string(set_):
    set_ = set(set_)
    def words_in_string(s):
        return set_.intersection(s.split())
    return words_in_string

get_countries = find_words_from_set_in_string(countries)

then

get_countries("The Embassy of China in Argentina is down the street from the Consulate of Russia")

returns

set(['Argentina', 'China', 'Russia'])

... which obviously needs more post-processing, but very quickly tells you exactly what you need to look for.

As pointed out in the linked article, you must be wary of words ending in punctuation - which could be handled by something like s.split(" \t\r\n,.!?;:'\""). You may also want to look for adjectival forms, ie "Russian", "Chinese", etc.

Not tested:

def removeCountry(name):
    for country in countries:
          name =  re.sub('(?<!of (the )?)' + country + '$', '', name).strip()

Using negative lookbehind re.sub just matches and replaces when country is not preceded by of or of the

The re.sub function accepts a function as replacement text, which is called in order to get the text that should be substituted in the given match. So you could do this:

import re

def make_regex(countries):
    escaped = (re.escape(country) for country in countries)
    states = '|'.join(escaped)
    return re.compile(r'\s+(of(\sthe)?\s)?(?P<state>{})'.format(states))

def remove_name(match):
    name = match.group()
    if name.lstrip().startswith('of'):
        return name
    else:
        return name.replace(match.group('state'), '').strip()

regex = make_regex(['China', 'Italy', 'America'])
regex.sub(remove_name, 'Embassy of China, International Italy').strip()
# result: 'Embassy of China, International'

The result might contain some spurious space (in the above case a last strip() is needed). You can fix this modifying the regex to:

\s*(of(\sthe)?\s)?(?P<state>({}))

To catch the spaces before of or before the country name and avoid the bad spacing in the output.

Note that this solution can handle a whole text, not just text of the form Something of Country and Something Country. For example:

In [38]: regex = make_regex(['China'])
    ...: text = '''This is more complex than just "Embassy of China" and "International China"'''

In [39]: regex.sub(remove_name, text)
Out[39]: 'This is more complex than just "Embassy of China" and "International"'

an other example usage:

In [33]: countries = [
    ...:     'China', 'India', 'Denmark', 'New York', 'Guatemala', 'Sudan',
    ...:     'France', 'Italy', 'Australia', 'New Zealand', 'Brazil', 
    ...:     'Canada', 'Japan', 'Vietnam', 'Middle-Earth', 'Russia',
    ...:     'Spain', 'Portugal', 'Argentina', 'San Marino'
    ...: ]

In [34]: template = 'Embassy of {0}, International {0}, Language of {0} is {0}, Government of {0}, {0} capital, Something {0} and something of the {0}.'

In [35]: text = 100 * '\n'.join(template.format(c) for c in countries)

In [36]: regex = make_regex(countries)
    ...: result = regex.sub(remove_name, text)

In [37]: result[:150]
Out[37]: 'Embassy of China, International, Language of China is, Government of China, capital, Something and something of the China.\nEmbassy of India, Internati'
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top