Frage

I'm trying to write a program that will read a VERY LARGE binary file and try to find the occurrence of 2 different strings and then print the indexes that matches the patterns. For the example's sake let's assume the character sequences are [H,e,l,l,o] and [H,e,l,l,o, ,W,o,r,l,d].

I was able to code this for small binary files because I was reading each character as a byte and then saving it in an Arraylist. Then starting from the beginning of the Arraylist, I was comparing the byte arraylist(byte[] data) with the byte[] pattern.

I need to find a way to do the same but WITHOUT writing the entire binary file in memory for comparison. That means I should be able to compare while reading each character (I should not save the entire binary file in memory). Assume the binary file only contains characters.

Any suggestions on how this can be achieved ? Thank you all in advance.

War es hilfreich?

Lösung 2

Google "finite state machine".

Or, read the file one byte at a time, if the byte just doesn't match the first character of the search term, go on to the next byte. If it does match, now you're looking for the next character in the sequence. I.e., your state has gone from 0, to 1. If your state equals (or passes) the length of the search string, you found it!

Implementation/debugging left to the reader.

Andere Tipps

Seems like you are really looking for Aho-Corasick string matching algorithm.

The algorithm builds an automaton from the given dictionary you have, and then allows you to find matches using a single scan of your input string.

The wikipedia article links to this java implementation

There are specialised algorithms for this but let's try a simple one first.

You can start with making the comparison on the fly, always after reading the next byte. Once you do that, it's easy to spot that you don't need to keep any bytes that are from earlier than your longest pattern.

So you can just use a buffer that is as long as your longest pattern, put new bytes in at one end and drop them at the other.

As I said, there are algorithms more effective than this but it's a good start.

Use a FileInputStream wrapped in a BufferedInputStream and compare each byte. Keep a buffer the length of the sequence you're looking for so you backtrack if it doesn't match at some point. If the sequence you're looking for is too large, you could save the offset and re-open the file for reading.

Or if you just want something to copy and paste you could look at this SO question.

Lizenziert unter: CC-BY-SA mit Zuschreibung
Nicht verbunden mit StackOverflow
scroll top