Multiline Matching in Haskell Posix

https://stackoverflow.com/questions/1028764

06-07-2019
|

Question

I can't seem to find decent documentation on haskell's POSIX implementation. Specifically the module Text.Regex.Posix.

Can anyone point me in the right direction of using multiline matching on a string?

A snippet for the curious:

> extractToken body = body =~ "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>" :: String

I'm trying to extract the source of wikipedia pages, however this method clearly falls over when more than one line is involved.

Solution

You may need to import Text.Regex.Base.RegexLike for access to makeRegexOpts and friends.

extractToken body = match regex body where
    regex = makeRegexOpts (defaultCompOpt - compNewline) defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"

Well, since Text.Regex.Posix's defaultCompOpt = compExtended + compNewline, that works out equivalently as

extractToken body = match regex body where
    regex = makeRegexOpts compExtended defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"

To pull out just the first group, use one of the other instances of RegexLike. One possibility is

extractToken body = head groups where
    (preMatch, inMatch, postMatch, groups) =
        match regex body :: (String, String, String, [String])
    regex = makeRegexOpts compExtended defaultExecOpt
              "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"

OTHER TIPS

You may need to use the PCRE backend instead if you want to do anything more flexible, or with better performance, than Posix regexes.

pcre-light and regex-pcre are both fine.

I solved in this case by matching

((.*)|\n*)*

Although this may not always work depending on your expression. The above solution is probably the best way to go if you're able to.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow