Functional paragraphs

https://stackoverflow.com/questions/930675

06-09-2019
|

Question

Sorry I don't quite get FP yet, I want to split a sequence of lines into a sequence of sequences of lines, assuming an empty line as paragraph division, I could do it in python like this:

def get_paraghraps(lines):
    paragraphs = []
    paragraph = []
    for line in lines:
        if line == "": # I know it could also be  "if line:"
            paragraphs.append(paragraph)
            paragraph = []
        else:
            paragraph.append(line)
    return paragraphs

How would you go about doing it in Erlang or Haskell?

Solution

I'm only a beginning Haskell programmer (and the little Haskell I learnt was 5 years ago), but for a start, I'd write the natural translation of your function, with the accumulator ("the current paragraph") being passed around (I've added types, just for clarity):

type Line = String
type Para = [Line]

-- Takes a list of lines, and returns a list of paragraphs
paragraphs :: [Line] -> [Para]
paragraphs ls = paragraphs2 ls []

-- Helper function: takes a list of lines, and the "current paragraph"
paragraphs2 :: [Line] -> Para -> [Para]
paragraphs2 [] para = [para]
paragraphs2 ("":ls) para = para : (paragraphs2 ls [])
paragraphs2 (l:ls)  para = paragraphs2 ls (para++[l])

This works:

*Main> paragraphs ["Line 1", "Line 2", "", "Line 3", "Line 4"]
[["Line 1","Line 2"],["Line 3","Line 4"]]

So that's a solution. But then, Haskell experience suggests that there are almost always library functions for doing things like this :) One related function is called groupBy, and it almost works:

paragraphs3 :: [Line] -> [Para]
paragraphs3 ls = groupBy (\x y -> y /= "") ls

*Main> paragraphs3 ["Line 1", "Line 2", "", "Line 3", "Line 4"]
[["Line 1","Line 2"],["","Line 3","Line 4"]]

Oops. What we really need is a "splitBy", and it's not in the libraries, but we can filter out the bad ones ourselves:

paragraphs4 :: [Line] -> [Para]
paragraphs4 ls = map (filter (/= "")) (groupBy (\x y -> y /= "") ls)

or, if you want to be cool, you can get rid of the argument and do it the pointless way:

paragraphs5 = map (filter (/= "")) . groupBy (\x y -> y /= "")

I'm sure there is an even shorter way. :-)

Edit: ephemient points out that (not . null) is cleaner than (/= ""). So we can write

paragraphs = map (filter $ not . null) . groupBy (const $ not . null)

The repeated (not . null) is a strong indication that we really should abstract this out into a function, and this is what the Data.List.Split module does, as pointed out in the answer below.

OTHER TIPS

I'm also trying to learn Haskell. A solution for this question could be:

paragraphs :: [String] -> [[String]]
paragraphs [] = []
paragraphs lines = p : (paragraphs rest)
    where (p, rest) = span (/= "") (dropWhile (== "") lines)

where I'm using the functions from Data.List. The ones I'm using are already available from the Prelude, but you can find their documentation in the link.

The idea is to find the first paragraph using span (/= ""). This will return the paragraph, and the lines following. We then recurse on the smaller list of lines which I call rest.

Before splitting out the first paragraph, we drop any empty lines using dropWhile (== ""). This is important to eat the empty line(s) separating the paragraphs. My first attempt was this:

paragraphs :: [String] -> [[String]]
paragraphs [] = []
paragraphs lines = p : (paragraphs $ tail rest)
    where (p, rest) = span (/= "") lines

but this fails when we reach the final paragraph since rest is then the empty string:

*Main> paragraphs ["foo", "bar", "", "hehe", "", "bla", "bla"]
[["foo","bar"],["hehe"],["bla","bla"]*** Exception: Prelude.tail: empty list

Dropping empty lines solves this, and it also makes the code treat any number of empty lines as a paragraph separator, which is what I would expect as a user.

The cleanest solution would be to use something appropriate from the split package.

You'll need to install that first, but then Data.List.Split.splitWhen null should do the job perfectly.

Think recursively.

get_paragraphs []      paras para = paras ++ [para]
get_paragraphs ("":ls) paras para = get_paragraphs ls (paras ++ [para]) []
get_paragraphs (l:ls)  paras para = get_paragraphs ls paras (para ++ [l])

You want to group the lines, so groupBy from Data.List seems like a good candidate. It uses a custom function to determine which lines are "equal" so one can supply something that makes lines in the same paragraph "equal". For example:

import Data.List( groupBy )

inpara :: String -> String -> Bool
inpara _ "" = False
inpara _ _  = True

paragraphs :: [String] -> [[String]]
paragraphs = groupBy inpara

This has some limitations, since inpara can only compare two adjacent lines and more complex logic doesn't fit into the framework given by groupBy. A more elemental solution if is more flexible. Using basic recursion one can write:

paragraphs [] = []
paragraphs as = para : paragraphs (dropWhile null reminder)
  where (para, reminder) = span (not . null) as
                           -- splits list at the first empty line

span splits a list at the point the supplied function becomes false (the first empty line), dropWhile removes leading elements for which the supplied function is true (any leading empty lines).

Better late than never.

import Data.List.Split (splitOn)

paragraphs :: String -> [[String]]
paragraphs s = filter (not . null) $ map words $ splitOn "\n\n" s

paragraphs "a\nb\n\nc\nd"                == [["a", "b"], ["c", "d"]]
paragraphs "\n\na\nb\n\n\nc\nd\n\n\n"    == [["a", "b"], ["c", "d"]]
paragraphs "\n\na\nb\n\n \n  c\nd\n\n\n" == [["a", "b"], ["c", "d"]]

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow