I'm trying to match e-mails in html text using the following code in python
my_second_pat = '((\w+)( *?))(@|[aA][tT]|\([aA][tT]\))(((( *?)(\w+)( *?))(\.|[dD][oO][tT]|\([dD][oO][tT]\)))+)([eE][dD][uU]|[cC][oO][mM])'
matches = re.findall(my_second_pat,line)
for m in matches:
s = "".join(m)
email = "".join(s.split())
res.append((name,'e',email))
when I run it on a line = shoham@stanford.edu
I get:
[('shoham', 'shoham', '', '@', 'stanford.', 'stanford.', 'stanford', '', 'stanford', '', '.', 'edu')]
what I expect:
[('shoham','@', 'stanford.', 'edu')]
It's matched as a one string on regexpal.com, so I guess I'm having trouble with re.findall
I'm new to both regex, and python. Any optimization/modifications is welcomed.