Question

I'm trying to extract the position (index) of a substring using regex. I need to use regex because the string won't be exactly the same. I want to get the position of the substring (either starting or ending position), so I can take the 1,000 characters following that substring.

For example, if I had "while foreign currencies are traded frequently, very little money is made by most." I want to find the position of "foreign currencies" so I can get all the words after.

f5 is the text.

I've tried:

p = re.compile("((^\s*|\.\s*)foreign\s*(currency|currencies))?")
for m in p.finditer(f5):
    print m.start(), m.group()

to get the location. This gives me (0,0) even though I've checked to make sure the regex picks up what I'm looking for in the text.

I've also tried:

location = re.search(r"((^\s*|\.\s*)foreign\s*(currency|currencies))?", f5)
print location

Output is <_sre.SRE_Match at 0x297d3328>

If I try

location.span() 

I get (0,0) again.

Basically, I want to convert <_sre.SRE_Match at 0x297d3328> into an integer that gives the location of the search term.

I've spent half a day searching for a solution. Thanks for any help.

Was it helpful?

Solution

In addition to previous solutions/comments, if you want all the words after, you can just do something like:

>>> location = re.search(r".*foreign\s*currenc(y|ies)(.*)", f5)
>>> location.group(2)
' are traded frequently, very little money is made by most.'

the .group(2) part matches the (.*) in the regexp.

OTHER TIPS

Your pattern includes everything before the word "foreign". So python will consider that part of your match. If you want to discard that, simply remove it from your search string.

Try:

 p = re.compile('foreign\s+(currency|currencies)?')
 m = p.search(s)
 m.start()

This also works with finditer:

 for m in p.finditer(s):
     m.start()

Don't have much experience in Python, so I can't directly answer your question. But if you want the substring starting with the match, why don't you just match the rest of the string OR remove everything before the match.

Example 1:

Match foreign currenc(y|ies) followed by every other character in the String. I used the s modifier so that the dot matches new lines as well.

foreign\s+currenc(?:y|ies).*

Example 2:

Replace this expression with an empty String. This will lazily match everything up until the lookahead of foreign currenc(y|ies) is matched.

.*?(?=foreign\s+currenc(?:y|ies))

Note: I changed (currency|currencies) to currenc(?:y|ies) because it is slightly more efficient.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top