Question

I just started learning Parsec and ... this is a bit brain bending. I have a text email. I need to extract the From: header and the body text. Now, I started searching for tutorials and examples from which to learn. I found three, all dealing with parsing CSV files as if there is nothing else in the world to parse.

In theory, it seems very simple: skip lines till you hit a line starting with "From:" and get text between "From: " and new line. In practise, I've been fighting with this for a couple of days.

Return-Path: <j.doe@gmail.com>
X-Original-To: j.doe@somedomain.biz
Delivered-To: j.doe@somedomsin.biz
blah ... blah ...
Subject: Test subject
From: John Doe <j.doe@gmail.com>
To: j.doe@somedomain.biz
Content-Type: multipart/alternative; boundary=047d7b2e4e3cdc627304eb094bfe

--047d7b2e4e3cdc627304eb094bfe
Content-Type: text/plain; charset=UTF-8

Email body

--047d7b2e4e3cdc627304eb094bfe

I can define a line like

let line = do{many1 (noneOf "\n"); many1 newline}

I don't understand how to cycle through lines till I hit a line with a certain string at the beginning?

p = do
  manyTill line (string "From:")
  string "From: "
  b <- many anyChar
  newline
  many line
  eof
  return b

This does not work. Can someone show me how to do it or point to a simple tutorial (not CSV parsing tutorial).

How do I extract the body, which is the text between boundary tokens and starts after the first empty line? I suppose extracting the body is even more complex so any help is appreciated.

Thanks

Was it helpful?

Solution

Parsec doesn't by default backtrack so many anyChar will just slurp the rest of your text. Instead consider something like

manyTill line $ try (string "From: ")
b <- manyTill anyChar newline
many line
eof
return b

Note that since we want to backtrack if our end parser fails with manyTill, it's important to make sure that it backtracks properly, thus we use try.

Now this still fails because your email doesn't end in a newline, so line starts to succeed, then fails, causing the whole parser to fail rather than backtracking. If you can't change this than change it to

many (try line)

To clarify, parsec considers a parser to have failed if it fails without consuming any input by default. If it consumes even one character and then fails, your whole parser dies. If you want backtracking behaviour so this doesn't happen, use try.

For extracting the body,

getBody = do
  manyTill anyChar (try $ string "boundary=")
  boundary <- manyTill anyChar newline
  manyTill anyChar (try $ string boundary) -- Get to the boundary
  manyTill anyChar (try $ string boundary) -- Read the body
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top