Question

I needed a String tokenizer in Haskell but there is apparently nothing already defined in the Prelude or other modules. There is splitOn in Data.Text, but that's a pain to use because you need to wrap the String to Text.

The tokenizer is not too hard to do so I wrote one (it doesn't handle multiple adjacent delimiters, but it worked well for what I needed it). I feel something like this should be already in the modules somewhere..

This is my version

tokenizer :: Char -> String -> [String]
tokenizer delim str = tokHelper delim str []

tokHelper :: Char -> String -> [String] -> [String]
tokHelper d s acc 
    | null pos  = reverse (pre:acc)
    | otherwise = tokenizer d (tail pos) (pre:acc)
        where (pre, pos) = span (/=d) s

I searched the internet for more solutions and found some discussions, like this blog post.

The last comment (by Mahee on June 10, 2011) is particularly interesting. Why not make a version of the words function more generic to handle this? I tried searching for such a function but found none..

Is there a simpler way to this or is 'tokenizing' a string not a very recurring problem? :)

Was it helpful?

Solution

The split library is what you need. Install with cabal install split, then you have access to a lot of split/tokenizer style functions.

Some examples from the library:

 > import Data.List.Split
 > splitOn "x" "axbxc"
 ["a","b","c"]
 > splitOn "x" "axbxcx"
 ["a","b","c",""]
 > endBy ";" "foo;bar;baz;"
 ["foo","bar","baz"]
 > splitWhen (<0) [1,3,-4,5,7,-9,0,2]
 [[1,3],[5,7],[0,2]]
 > splitOneOf ";.," "foo,bar;baz.glurk"
 ["foo","bar","baz","glurk"]
 > splitEvery 3 ['a'..'z']
 ["abc","def","ghi","jkl","mno","pqr","stu","vwx","yz"]

The wordsBy function from the same library is a generic version of words like you wanted:

wordsBy (=='x') "dogxxxcatxbirdxx" == ["dog","cat","bird"]

OTHER TIPS

If you're parsing a Haskell-like language you can use the lex function from the Prelude: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Prelude.html#v:lex

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top