Use Grep to extract just email addresses from text file list of email addresses with names

StackOverflow https://stackoverflow.com/questions/21918404

  •  14-10-2022
  •  | 
  •  

Question

this is a similar question to some that are already out there, but couldn't find one that answered my question specifically, so thank you for any assistance/insight.

So I have a text file that I've opened in TextWrangler (popular Mac text editor) with email names and addresses. sample records:

Timmy Turner <tturner@example.com>
"jamminjeff@example.com" <jamminjeff@example.com>
Susan Alder <suesblues@example.com>,
sallyartist@example.com

So some email addresses with names preceding them, most emails enclosed by <> brackets, and some emails just by themselves, already correct, and some with commas after. I want to do a global process that will automate the process of getting this end result, either via Grep or something similar:

tturner@example.com
jamminjeff@example.com
suesblues@example.com
sallyartist@example.com

Thanks for any insight!

No correct solution

OTHER TIPS

sed might work better. You can use a regex to remove the patterns that you don't want:

sed -e "s|.*<||" -e "s|>.*||"  your_file.txt  > new_file.txt

TL;DR

Search:

^.*<?\b([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@((?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])\b>?.*$

Replace:

\1@\2

Explanation:

According to this article, the RFC 5322 specification gives an official definition for a valid email address.

Their string, simplified for use in TextWrangler, would be:

Search:

([a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@((?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Replace:

\1@\2

By itself, it would match:

Timmy Turner <tturner@example.com>
"jamminjeff@example.com" <jamminjeff@example.com>
Susan Alder <suesblues@example.com>,
sallyartist@example.com

While that DOES match your example email strings, it doesn't give you the exact result you want, since it's also including "jamminjeff@example.com", which should be stripped out.

You can use some filtering before and after it, if you know a few things:

  1. Is it okay do discard everything before the email string?
  2. Is it okay do discard everything after the email string?
  3. Will any other text be found which butts up against the email string that needs to be removed as well?

If yes to 1 and 2, and no to 3, prepend that string with ^.*<?\b, and append it with \b>?.*$.

This starts at the beginning of the line, searches for 0 or more characters, an optional opening bracket, and then a word boundary that starts the actual email address.

Then afterward, look for the word boundary on the last character of the email address, an optional closing bracket, and zero or more characters till the end of the line.

Replacing that with \1@\2 will clean up the entire line to only contain the email address.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top