Python - Regex - findall duplicates

https://stackoverflow.com/questions/22292425

12-06-2023
|

Question

I'm trying to match e-mails in html text using the following code in python

my_second_pat = '((\w+)( *?))(@|[aA][tT]|\([aA][tT]\))(((( *?)(\w+)( *?))(\.|[dD][oO][tT]|\([dD][oO][tT]\)))+)([eE][dD][uU]|[cC][oO][mM])'


matches = re.findall(my_second_pat,line)
for m in matches:
    s = "".join(m)
    email = "".join(s.split())
    res.append((name,'e',email))

when I run it on a line = shoham@stanford.edu

I get:

[('shoham', 'shoham', '', '@', 'stanford.', 'stanford.', 'stanford', '', 'stanford', '', '.', 'edu')]

what I expect:

[('shoham','@', 'stanford.', 'edu')]

It's matched as a one string on regexpal.com, so I guess I'm having trouble with re.findall

I'm new to both regex, and python. Any optimization/modifications is welcomed.

Solution

Try this:

(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)([^@\s.]{2,})

Regular expression visualization

Debuggex Demo

If you need to limit to .com and .edu:

(?i)([^@\s]{2,})(?:@|\s*at\s*)([^@\s.]{2,})(?:\.|\s*dot\s*)(com|edu)

Regular expression visualization

Debuggex Demo

Note that I have used the case-insensitive flag (?i) at the start of the regex, instead of using syntax like [Ee].

OTHER TIPS

It is matching all of your capture groups, which contain optional matches.

Try this:

((?:(?:\w+)(?: *?))(?:@|[aA][tT]|\(?:[aA][tT]\))(?:(?:(?:(?: *?)(?:\w+)(?: *?))(?:\.|[dD][oO][tT]|\(?:[dD][oO][tT]\)))+)(?:[eE][dD][uU]|[cC][oO][mM]))

See this link to debug your expression:

http://regex101.com/r/jW4mP1

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow