Are there any known parser combinator library's in F# that can parse binary (not text) files?

https://stackoverflow.com/questions/7800586

22-10-2019
|

문제

I am familiar with some of the basics of fparsec but it seems to be geared towards text files or streams.

Are there any other F# library's that can efficiently parse binary files? Or can fparsec be easily modified to work efficiently with binary streams?

해결책

You may be interested in pickler combinators. These are a bit like parser combinators, but are more focused at simpler binary formats (picklers allow you to produce binary data and unpicklers parse them). There is a quite readable article about the idea (PDF) by Andrew Kennedy (the author of units of measure).

I don't have much experience with these myself, but I just realized it may be relevant for you. The idea is used in the F# compiler for generating some binary resources (like quotations stored in resources). Although, I'm not sure if the F# compiler implementation is any good (it is one of those things from early days of the F# compiler).

다른 팁

The problem with working with binary streams is not a parser problem per se, it's a lexing problem. The lexer is what turns the raw data in to elements that the parse can handle.

Most any parsing system has few problems letting you supply your own lexer, and if that's the case you could, ideally, readily write a compliant lexer that works on your binary stream.

The problem, however, is that most parsing and lexing systems today are themselves created from a higher level tool. And THAT tool most likely is not designed to work with binary streams. That is, it's not practical for you specify the tokens and grammar of the binary stream that can be used to create the subsequent parsers and lexer. Also, there is likely no support whatsoever for the higher level concepts of multi byte binary numbers (shorts, longs, floats, etc.) that you are likely to encounter in a binary stream, nor for the generated parser to possibly work well upon them if you actually need to work on their actual value, again because the systems are mostly designed for text based tokens, and the underlying runtime handles the details of converting that text it something the machine can use (such as sequences of ascii numerals in to actual binary integers).

All that said, you can probably actually use the parsing section of the tool, since parsers work more on abstract tokens that are fed them by the lexer. Once you create your grammar, at a symbolic level, you would need to redo the lexer to create the problem tokens from the binary stream to feed in to the parser.

This is actually good, because the parser tends to be far more complicated than the basic lexer, so the toolkit would handle much of the "hard part" for you. But you would still need to deal with creating your own lexer and interfacing it properly to the generated parser. Not an insurmountable task, and if the grammar is of any real complexity, likely worth your effort in the long run.

If it's all mostly simple, then you're likely just better off doing it your self by hand. Of the top of my head, it's hard to imagine a difficult binary grammar, since the major selling point of a binary format is that it's much closer to the machine, which is in contradiction to the text that most parsers are designed to work with. But I don't know your use case.

But consider the case of a disassembler. That's a simple lexer that may be able to under stand at a high level the different instruction types (such as those operands that have no arguments, those that take a single byte as an argument, or a word), and feed that to a parser can then be used to convert the instructions in to their mnemonics and operands in the normal assembler syntax, as well as handle the label references and such.

It's a contrived case, as a disassembler typically doesn't separate the lexing and parsing phases, it's usually not complicated enough to bother, but it's one way to look at the problem.

Addenda:

If you have enough information to convert the binary stream in to text to feed to the engine, then you you have enough information to instead of creating text, you could create the actual tokens that the parser would want to see from the lexer.

That said, what you could do is take your text format, use that as the basis for your parsing tool and grammar, and have it create the lexer and parser machines for you, and then, by hand, you can test your parser and its processing using "text tests".

But when you get around to reading the binary, rather than creating text to then be lexed and parsed, simply create the tokens that the lexer would create (these should be simple objects), and pump the parser directly. This will save you the lex step and save you some processing time.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow