Domanda

I want to split a huge (12GB), lazy ByteString with a Regexp that matches, among other things, a NUL \x00 byte.

I know that it should be possible, given that I've been able to split a sample string with python:

 >>> from re import split
 >>> split(b"\x00", b"a\x00b")
 [b'a', b'b']

I'm not sure that it could work, but I wanted to give it a try with Haskell, since it should be able to read the file lazily and work on it without allocating memory for the whole string. (It should be easier than working on it chunk-by-chunk, writing a parser or tweaking the original program to output something less broken).

Haskell regex matching on ByteStrings is easy enough:

("a\x01\&b" :: ByteString) =~ ("\x01" ::ByteString) :: (ByteString, ByteString, ByteString)
("a","\SOH","b")

But doing the same with a \x00 yields something weird:

("a\x00\&b" :: ByteString) =~ ("\x00" ::ByteString) :: (ByteString, ByteString, ByteString)
("","","a\NULb")

Please note that it's not failing to find a match (otherwise the first element of the tuple would be the original string), instead it's matching on an invisible/implicit \x00 apparently.

Any hints?

È stato utile?

Soluzione 2

From man 3 regex:

regcomp() is supplied with preg, a pointer to a pattern buffer storage area; regex, a pointer to the null-terminated string and cflags, flags used to determine the type of compilation.

So, the Regex "\x00", just like "\x00whatever", being evaluated as a null-terminated string, is de-facto equal to "", the empty string.

and matching anything on the empty string, will always yield ("", "", your_original_string).

The best solution is to probably use Text.Regex.TDFA, which doesn't exhibit this behavior as I mentioned in a previous comment

Altri suggerimenti

There is no need to use regexes here. Data.ByteString already provides the function split, which lets you split a ByteString on any byte value.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top