Data.ByteString output not correct

https://stackoverflow.com/questions/22160562

19-10-2022
|

Question

I'm writing a program which would take a list of text files as arguments and outputs a file in which each row is the intercalation of tabs between the corresponding rows in the files.

Assume all characters are ASCII encoded

import GHC.IO.Handle
import System.IO
import System.Environment
import Data.List

main = do
    (out:files) <- getArgs
    hs <- mapM (`openFile` ReadMode) files
    txts <- mapM B.hGetContents hs
    let final = map (B.intercalate (B.singleton '\t')) . transpose 
                . map (B.lines . B.filter (/= '\t')) $ txts
    withFile out WriteMode $ \out -> 
        B.hPutStr out (B.unlines final)
    putStrLn "Completed successfully"

The problem is that it outputs:

file1row1
    file2row1
file1row2
    file2row2
file1row3
    file2row3

instead of:

file1row1    file2row1
file1row2    file2row2
file1row3    file2row3

The same logic works correctly when tested by manually defining the functions in ghci. And the same code works correctly when using Data.Text.Lazy instead of lazy Bytestrings.

What's wrong with my approach?

Solution

There is a known bug in Data.ByteString.Lazy.UTF8 where newline conversion doesn't take place properly, even though the documentation says that it should. (See Data.ByteString.Lazy.Char8 newline conversion on Windows---is the documentation misleading?) This could be the cause of your problem.

OTHER TIPS

When I tested Data.ByteString.Lazy.UTF8.lines on a sample string, it didn't remove the '\r'....

ghci -XOverloadedStrings

> import Data.ByteString.Lazy.UTF8 as B

> B.lines "ab\n\rcd"
  ["ab","\rcd"]

> B.lines "ab\r\ncd"
  ["ab\r","cd"]

I am guessing this is your problem.

(to verify, you can look at the output using "xxd" or any other hex editor.... See if the extra character is in fact a "\r").

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow