What is the right algorithm to match regex on a stream?

https://softwareengineering.stackexchange.com/questions/406733

08-03-2021
|

Question

I'd like to match a regex pattern on a stream, but I am not sure what algorithm to use. I certainly don't want to load the entire file into memory.

I tried to figure out how to do this, but I have only very basic ideas. For example I could concat the chunks into some sort of aggregate string as long as I find a match. After that I cut the aggregate at the end of the match and continue. But if there is no match I would end up loading the entire string into memory.

I could give a maximum size to the aggregate, but I might lose matches that way. Another issue that if there is a longer match starting in the same position, then I would miss that. So I guess this algorithm is incompatible with possessive quantifiers too.

Is there a better solution for this problem?

Solution

This is a hard problem. Regular expression engines come on twp flavors: NFA and DFA.

DFA (deterministic finite-state automata) run efficiently and in limited space, but they cannot do the sort of thing you presumably want, such as capturing and backreferencing substrings of the input. If you only need to know whether the input matches, not where, they're the way to go.
NFA (nondeterministic finite automata) fundamentally rely on being able to backtrack, i.e. to try something else with input that they have already consumed. And the problem is not limited to possessive quantifiers, but to most constructs: even (A|B) requires backtracking.

If there is any way you can limit the possible length of your matches, then it is almost certainly your best option to apply a normal match on overlapping chunks of size 2 * L than to change the way your regex engine works.

If you do have to change it, though, it would involve analyzing the regex or monitoring the progress of the engine to prove when previous input can no longer be part of any match, i.e. when the top-level backtracking has exhausted its alternatives. At that point you can discard characters and so limit the amount of memory required for the entire operation. But be aware that that would be a fascinating project, i.e. almost certainly outside the budget if you're just trying to get some random work task done.

OTHER TIPS

In general, it is not a good idea to apply a regex to such a huge file at whole. In most real-world cases, files of that size have some kind of structure which can be used to split it into smaller strings in a streaming manner. Applying the regexes only for these "building blocks" will usually result in a way more performant and memory efficient solution.

For example, in a comment, you wrote

I have a massive JSON and I want to extract URLs from it.

Even without knowing the precise structure of that JSON file, I assume URLs can only occur in string literals. Hence, what you can do here is to

utilize a JSON streaming library like Oboe.js
implement a callback which is called for each string literal where an URL might be contained in the JSON. In that callback, use a regex to detect the URLs within the strings.

But this is just an example, you can probably apply this approach to many comparable use cases.

I'm jumping the gun by posting here already since my code isn't published as a package yet (it will be @iter-tools/regex), but hey I'm bored and I'm working on my OSS projects full time so a v0.1.0 will be up soon.

I've implemented what I understand to be the (roughly) optimal strategy -- non-backtracking matching with NFAs and captures. In some of the materials here it seems to be referred to as Thompson NFA, though I came to my solution independently starting from simple boolean NFAs. Anyone who is interested can take a look at the code.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange