Question

I am working on a project, where I need to identify emails sent by real humans as opposed to bulk mails, notifications and newsletters. Is there any definite way of doing that? Is there any information in email header which can help. I am working on top of Gmail IMAP so I already have non-spam emails.

Any help in this regard is appreciated. Thanks!

Was it helpful?

Solution

There isn't a clear way to distinguish bulk mail from personalised mailings. Unlike with spam, most bulk mail is requested/expected, so the sender doesn't do odd things to get round spam filters, which means these emails often blend in fairly well.

However, there are some trends that you can look for. If you want to do it reliably, you will probably need to apply some scoring system, like spam-filters do.

You will also need to accept that you are bound to get a substantial proportion of false positives and false negatives.

Some things that are common to bulk mail that appear less often in personalised correspondence:

  1. "To" and "Cc" addresses do not contain a local recipient. Sometimes the sender will send to "mailList@mydomain.com" instead of "recipientA@recipientAdomain.com", "recipientB@recipientBdomain.com", etc. In these cases, it is also likely that only one address appears in "To" and nothing appears in "Cc"
  2. "From" address is "noreply@", "newsletter@", "do-not-reply@", "mailinglist@", even less common terms like "support@" or "sales@" (but remember, they could cause false positives)
  3. The presence of a "List-Unsubscribe:" header
  4. The message contains an unsubscribe link. Run pattern matching to find common phrases in the final few lines of the email. Look for links, or words such as "unsubscribe", "opt out", etc.
  5. Mailing lists tend to have rich content. Check for heavy use of CSS and lots of images, the entire message being contained within a <table></table> or <ul><li></li></ul> structure. i.e. the stuff that something like Dreamweaver would put in, rather than a mail client.
  6. Headers or bold content at the top of the message. If the first bit of a message resembles a newsletter, it's probably a newsletter.
  7. Lots of links or frequent linking to the same (or same few) websites. Newsletters will try to guide the user to the company's site(s), as much as they can. You may score this even more highly if the linked domain matches (or resembles) the sender domain.
  8. Heavy references to social media. If it's a newsletter containing several articles, each story may have its own "Tweet this", "Like this" link. Personal users are likely to contain (at most) one reference to Twitter, Facebook, etc (in their signature)
  9. Notifications and other auto-generated messages will often follow the same basic format. If you have the capabilities, run some kind of diffing or other comparison against previous messages. A strong match would imply automation.
  10. There is no greeting, or a generic greeting. However, personal emails will often skip the "Dear Fred" bit too, so this isn't a good enough detection by itself; but things like "Dear User" or "Dear Customer" are almost certainly generic.
  11. Unlikely to end in "Regards, Ian" or "Yours Sincerely, John Doe"
  12. The sender has scored highly before. Keep a record. If a sender triggers a high score several times, they are almost certainly bulk mailing.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top