Question

I want to extract some information from an email using regex in c#.

Here is a short snippet from the email:

...with mapi id 14.02.0387.000; Thu, 6 Feb 2014 09:09:33 +0100
From: site <site@company.dk>
To: "nonexistingmail@doesnotexist127.dk" <nonexistingmail@doesnotexist127.dk>
Subject: can this bounce
Thread-Topic: can this bounce
Thread-Index: Ac8jEr8t3k2RouQ1RaGPCXGFcE5oNg==Date:...

I want to extract the "from" address between the <>, the "To" address between <> and the subject (in the example, the subject is "can this bounce")

I'm not very familiar with regex, so I would appreciate any help.

(and btw, if there is a simpler more neat solution i'd be happy to hear!)

Was it helpful?

Solution

A solution using LINQ:

var fromAddress = new string(msg.SkipWhile(c => c != '<').Skip(1).TakeWhile(c => c != '>').ToArray());

var toAddress = new string(msg.Substring(msg.IndexOf("To")).SkipWhile(c => c != '<').Skip(1).TakeWhile(c => c != '>').ToArray());

var subject = new string(msg.Substring(msg.IndexOf("Subject")).SkipWhile(c => c != ' ').Skip(1).TakeWhile(c => c != 'T').ToArray());

OTHER TIPS

Full running example using regex:
I used pattern with 3 groups:
@"[Ff]rom:[^<]*\<([^@]+@[^>]+)>[Tt]o:[^<]*\<([^@]+@[^>]+)>[Ss]ubject: ?(.*)Thread-Topic")

string source = "...with mapi id 14.02.0387.000; Thu, 6 Feb 2014 09:09:33 +0100From: site <site@company.dk>To: \"nonexistingmail@doesnotexist127.dk\" <nonexistingmail@doesnotexist127.dk>Subject: can this bounceThread-Topic: can this bounceThread-Index: Ac8jEr8t3k2RouQ1RaGPCXGFcE5oNg==Date:...";
Regex pattern = new Regex("[Ff]rom:[^<]*\\<([^@]+@[^>]+)>[Tt]o:[^<]*\\<([^@]+@[^>]+)>[Ss]ubject: ?(.*)Thread-Topic");
MatchCollection mc = pattern.Matches(source);
string partFrom = ""; string partTo = ""; string subject = "";
if(mc.Count>0)
{
    partFrom = mc[0].Groups[1].Value;
    partTo = mc[0].Groups[2].Value;
    subject = mc[0].Groups[3].Value;
}
Console.WriteLine("From: " + partFrom + " To: " + partTo + " Subject: " + subject);

I check if inside mail exist at sign (@) in my expression and extract all parts in single pattern.
If you want find only mail addresses, you can use this regex:

@"\<[^>@]+@[^>]+>"

\<(.*?)>

  • \< : < is a meta char and needs to be escaped if you want to match it literally.
  • (.*?) : match everything in a non-greedy way and capture it.
  • > : > is a meta char and needs to be escaped if you want to match it literally.

I tried this in RegexBuddy with the .NET flavour using your source text, it breaks it into named capture groups so you can use match.Groups["FROM"].Value etc.

You can then iterate over the matches to determine if your matches contain a value from the specified capture group. I've used this approach before when matching documents which may be incomplete.

(?:From: .+<(?<FROM>.+)>)?(?:To: .+<(?<TO>.+)>)?(?:Subject: (?<SUBJECT>.+))?

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top