Question

I'm looking for good/working/simple to use php code for parsing raw email into parts.

I've written a couple of brute force solutions, but every time, one small change/header/space/something comes along and my whole parser fails and the project falls apart.

And before I get pointed at PEAR/PECL, I need actual code. My host has some screwy config or something, I can never seem to get the .so's to build right. If I do get the .so made, some difference in path/environment/php.ini doesn't always make it available (apache vs cron vs cli).

Oh, and one last thing, I'm parsing the raw email text, NOT POP3, and NOT IMAP. It's being piped into the php script via a .qmail email redirect.

I'm not expecting SOF to write it for me, I'm looking for some tips/starting points on doing it "right". This is one of those "wheel" problems that I know has already been solved.

Was it helpful?

Solution

What are you hoping to end up with at the end? The body, the subject, the sender, an attachment? You should spend some time with RFC2822 to understand the format of the mail, but here's the simplest rules for well formed email:

HEADERS\n
\n
BODY

That is, the first blank line (double newline) is the separator between the HEADERS and the BODY. A HEADER looks like this:

HSTRING:HTEXT

HSTRING always starts at the beginning of a line and doesn't contain any white space or colons. HTEXT can contain a wide variety of text, including newlines as long as the newline char is followed by whitespace.

The "BODY" is really just any data that follows the first double newline. (There are different rules if you are transmitting mail via SMTP, but processing it over a pipe you don't have to worry about that).

So, in really simple, circa-1982 RFC822 terms, an email looks like this:

HEADER: HEADER TEXT
HEADER: MORE HEADER TEXT
  INCLUDING A LINE CONTINUATION
HEADER: LAST HEADER

THIS IS ANY
ARBITRARY DATA
(FOR THE MOST PART)

Most modern email is more complex than that though. Headers can be encoded for charsets or RFC2047 mime words, or a ton of other stuff I'm not thinking of right now. The bodies are really hard to roll your own code for these days to if you want them to be meaningful. Almost all email that's generated by an MUA will be MIME encoded. That might be uuencoded text, it might be html, it might be a uuencoded excel spreadsheet.

I hope this helps provide a framework for understanding some of the very elemental buckets of email. If you provide more background on what you are trying to do with the data I (or someone else) might be able to provide better direction.

OTHER TIPS

Try the Plancake PHP Email parser: https://github.com/plancake/official-library-php-email-parser

I have used it for my projects. It works great, it is just one class and it is open source.

I cobbled this together, some code isn't mine but I don't know where it came from... I later adopted the more robust "MimeMailParser" but this works fine, I pipe my default email to it using cPanel and it works great.

#!/usr/bin/php -q
<?php
// Config
$dbuser = 'emlusr';
$dbpass = 'pass';
$dbname = 'email';
$dbhost = 'localhost';
$notify= 'services@.com'; // an email address required in case of errors
function mailRead($iKlimit = "") 
    { 
        // Purpose: 
        //   Reads piped mail from STDIN 
        // 
        // Arguements: 
        //   $iKlimit (integer, optional): specifies after how many kilobytes reading of mail should stop 
        //   Defaults to 1024k if no value is specified 
        //     A value of -1 will cause reading to continue until the entire message has been read 
        // 
        // Return value: 
        //   A string containing the entire email, headers, body and all. 

        // Variable perparation         
            // Set default limit of 1024k if no limit has been specified 
            if ($iKlimit == "") { 
                $iKlimit = 1024; 
            } 

            // Error strings 
            $sErrorSTDINFail = "Error - failed to read mail from STDIN!"; 

        // Attempt to connect to STDIN 
        $fp = fopen("php://stdin", "r"); 

        // Failed to connect to STDIN? (shouldn't really happen) 
        if (!$fp) { 
            echo $sErrorSTDINFail; 
            exit(); 
        } 

        // Create empty string for storing message 
        $sEmail = ""; 

        // Read message up until limit (if any) 
        if ($iKlimit == -1) { 
            while (!feof($fp)) { 
                $sEmail .= fread($fp, 1024); 
            }                     
        } else { 
            while (!feof($fp) && $i_limit < $iKlimit) { 
                $sEmail .= fread($fp, 1024); 
                $i_limit++; 
            }         
        } 

        // Close connection to STDIN 
        fclose($fp); 

        // Return message 
        return $sEmail; 
    }  
$email = mailRead();

// handle email
$lines = explode("\n", $email);

// empty vars
$from = "";
$subject = "";
$headers = "";
$message = "";
$splittingheaders = true;
for ($i=0; $i < count($lines); $i++) {
    if ($splittingheaders) {
        // this is a header
        $headers .= $lines[$i]."\n";

        // look out for special headers
        if (preg_match("/^Subject: (.*)/", $lines[$i], $matches)) {
            $subject = $matches[1];
        }
        if (preg_match("/^From: (.*)/", $lines[$i], $matches)) {
            $from = $matches[1];
        }
        if (preg_match("/^To: (.*)/", $lines[$i], $matches)) {
            $to = $matches[1];
        }
    } else {
        // not a header, but message
        $message .= $lines[$i]."\n";
    }

    if (trim($lines[$i])=="") {
        // empty line, header section has ended
        $splittingheaders = false;
    }
}

if ($conn = @mysql_connect($dbhost,$dbuser,$dbpass)) {
  if(!@mysql_select_db($dbname,$conn))
    mail($email,'Email Logger Error',"There was an error selecting the email logger database.\n\n".mysql_error());
  $from    = mysql_real_escape_string($from);
  $to    = mysql_real_escape_string($to);
  $subject = mysql_real_escape_string($subject);
  $headers = mysql_real_escape_string($headers);
  $message = mysql_real_escape_string($message);
  $email   = mysql_real_escape_string($email);
  $result = @mysql_query("INSERT INTO email_log (`to`,`from`,`subject`,`headers`,`message`,`source`) VALUES('$to','$from','$subject','$headers','$message','$email')");
  if (mysql_affected_rows() == 0)
    mail($notify,'Email Logger Error',"There was an error inserting into the email logger database.\n\n".mysql_error());
} else {
  mail($notify,'Email Logger Error',"There was an error connecting the email logger database.\n\n".mysql_error());
}
?>

There are Mailparse Functions you could try: http://php.net/manual/en/book.mailparse.php, not in default php conf, however.

There is a library for parsing raw email message into php array - http://flourishlib.com/api/fMailbox#parseMessage.

The static method parseMessage() can be used to parse a full MIME email message into the same format that fetchMessage() returns, minus the uid key.

$parsed_message = fMailbox::parseMessage(file_get_contents('/path/to/email'));

Here is an example of a parsed message:

array(
    'received' => '28 Apr 2010 22:00:38 -0400',
    'headers'  => array(
        'received' => array(
            0 => '(qmail 25838 invoked from network); 28 Apr 2010 22:00:38 -0400',
            1 => 'from example.com (HELO ?192.168.10.2?) (example) by example.com with (DHE-RSA-AES256-SHA encrypted) SMTP; 28 Apr 2010 22:00:38 -0400'
        ),
        'message-id' => '<4BD8E815.1050209@flourishlib.com>',
        'date' => 'Wed, 28 Apr 2010 21:59:49 -0400',
        'from' => array(
            'personal' => 'Will Bond',
            'mailbox'  => 'tests',
            'host'     => 'flourishlib.com'
        ),
        'user-agent'   => 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.9) Gecko/20100317 Thunderbird/3.0.4',
        'mime-version' => '1.0',
        'to' => array(
            0 => array(
                'mailbox' => 'tests',
                'host'    => 'flourishlib.com'
            )
        ),
        'subject' => 'This message is encrypted'
    ),
    'text'      => 'This message is encrypted',
    'decrypted' => TRUE,
    'uid'       => 15
);

This https://github.com/zbateson/MailMimeParser works for me, and don't need mailparse extension.

<?php
echo $message->getHeaderValue('from');          // user@example.com
echo $message
    ->getHeader('from')
    ->getPersonName();                          // Person Name
echo $message->getHeaderValue('subject');       // The email's subject

echo $message->getTextContent();                // or getHtmlContent

The Pear lib Mail_mimeDecode is written in plain PHP that you can see here: Mail_mimeDecode source

You're probably not going to have much fun writing your own MIME parser. The reason you are finding "overdeveloped mail handling packages" is because MIME is a really complex set of rules/formats/encodings. MIME parts can be recursive, which is part of the fun. I think your best bet is to write the best MIME handler you can, parse a message, throw away everything that's not text/plain or text/html, and then force the command in the incoming string to be prefixed with COMMAND: or something similar so that you can find it in the muck. If you start with rules like that you have a decent chance of handling new providers, but you should be ready to tweak if a new provider comes along (or heck, if your current provider chooses to change their messaging architecture).

Parsing email in PHP isn't an impossible task. What I mean is, you don't need a team of engineers to do it; it is attainable as an individual. Really the hardest part I found was creating the FSM for parsing an IMAP BODYSTRUCTURE result. Nowhere on the Internet had I seen this so I wrote my own. My routine basically creates an array of nested arrays from the command output, and the depth one is at in the array roughly corresponds to the part number(s) needed to perform the lookups. So it handles the nested MIME structures quite gracefully.

The problem is that PHP's default imap_* functions don't provide much granularity...so I had to open a socket to the IMAP port and write the functions to send and retrieve the necessary information (IMAP FETCH 1 BODY.PEEK[1.2] for example), and that involves looking at the RFC documentation.

The encoding of the data (quoted-printable, base64, 7bit, 8bit, etc.), length of the message, content-type, etc. is all provided to you; for attachments, text, html, etc. You may have to figure out the nuances of your mail server as well since not all fields are always implemented 100%.

The gem is the FSM...if you have a background in Comp Sci it can be really really fun to make this (they key is that brackets are not a regular grammar ;)); otherwise it will be a struggle and/or result in ugly code, using traditional methods. Also you need some time!

Hope this helps!

I'm not sure if this will be of help to you - hope so - but it will surely help others interested in finding out more about email. Marcus Bointon did one of the best presentations entitled "Mail() and life after Mail()" at the PHP London conference in March this year and the slides and MP3 are online. He speaks with some authority, having worked extensively with email and PHP at a deep level.

My perception is that you are in for a world of pain trying to write a truly generic parser.

EDIT - The files seem to have been removed on the PHP London site; found the slides on Marcus' own site: Part 1 Part 2 Couldn't see the MP3 anywhere though

yeah, ive been able to write a basic parser, based off that rfc and some other basic tutorials. but its the multipart mime nested boundaries that keep messing me up.

i found out that MMS (not SMS) messages sent from my phone are just standard emails, so i have a system that reads the incoming email, checks the from (to only allow from my phone), and uses the body part to run different commands on my server. its sort of like a remote control by email.

because the system is designed to send pictures, its got a bunch of differently encoded parts. a mms.smil.txt part, a text/plain (which is useless, just says 'this is a html message'), a application/smil part (which the part that phones would pic up on), a text/html part with a advertisement for my carrier, then my message, but all wrapped in html, then finally a textfile attachment with my plain message (which is the part i use) (if i shove an image as an attachment in the message, its put at attachment 1, base64 encoded, then my text portion is attached as attachment 2)

i had it working with the exact mail format from my carrier, but when i ran a message from someone elses phone through it, it failed in a whole bunch of miserable ways.

i have other projects i'd like to extend this phone->mail->parse->command system to, but i need to have a stable/solid/generic parser to get the different parts out of the mail to use it.

my end goal would be to have a function that i could feed the raw piped mail into, and get back a big array with associative sub-arrays of headers var:val pairs, and one for the body text as a whole string

the more and more i search on this, the more i find the same thing: giant overdeveloped mail handling packages that do everything under the sun thats related to mails, or useless (to me, in this project) tutorials.

i think i'm going to have to bite the bullet and just carefully write something my self.

I met the same problem so I wrote the following class: Email_Parser. It takes in a raw email and turns it into a nice object.

It requires PEAR Mail_mimeDecode but that should be easy to install via WHM or straight from command line.

Get it here : https://github.com/optimumweb/php-email-reader-parser

Simple PhpMimeParser https://github.com/breakermind/PhpMimeParser Yuo can cut mime messages from files, string. Get files, html and inline images.

$str = file_get_contents('mime-mixed-related-alternative.eml');

// MimeParser
$m = new PhpMimeParser($str);

// Emails
print_r($m->mTo);
print_r($m->mFrom);

// Message
echo $m->mSubject;
echo $m->mHtml;
echo $m->mText;

// Attachments and inline images
print_r($m->mFiles);
print_r($m->mInlineList);
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top