How to parse mailbox file in Ruby?

Question 1

The good news is the Mbox format is really dead simple, though it's simplicity is why it was eventually replaced. Parsing a large mailbox file to extract a single message is not specially efficient.

If you can split apart the mailbox file into separate strings, you can pass these strings to the Mail library for parsing.

An example starting point:

def parse_message(message)
  Mail.new(message)

  do_other_stuff!
end

message = nil

while (line = STDIN.gets)
  if (line.match(/\AFrom /))
    parse_message(message) if (message)
    message = ''
  else
    message << line.sub(/^\>From/, 'From')
  end
end

The key is that each message starts with "From " where the space after it is key. Headers will be defined as From: and any line that starts with ">From" is to be treated as actually being "From". It's things like this that make this encoding method really inadequate, but if Maildir isn't an option, this is what you've got to do.

Question 2

You can use tmail parsing email boxes, but it was replaced by mail, but I can't really find a class that substitutes it. So you might want to keep along with tmail.

EDIT: as @tadman pointed out, it should not be working with ruby 1.9. However you can port this class (and put it on github for everyone else use :-) )

Question 3

The mbox format is about as simple as you can get. It's simply the concatenation of all the messages, separated by a blank line. The first line of each message starts with the five characters "From "; when messages are added to the file, any line which starts "From" has a > prefixed, so you can reliably use the fact that a line starts with "From" as an indicator that it is the start of a message.

Of course, since this is an old format and it was never standardized, there are a number of variants. One variant uses the Content-Length header to determine the length of a message, and some implementations of this variant fail to insert the '>'. However, I think this is rare in practice.

A big problem with mbox format is that the file needs to be modified in place by mail agents; consequently, every implementation has some locking procedure. Of course, there is no standardization there, so you need to watch out for other processes modifying the mailbox while you are reading it. In practice, many mail systems solved this problem by using maildir format instead, in which a mailbox is actually a directory and every message is a single file.

Other things you might want to do include MIME decoding, but you should be able to find utilities which do that.