NLP - How would you parse highly noisy sentence (with Earley parser)

https://stackoverflow.com/questions/12366748

01-07-2021
|

Question

I need to parse a sentence. Now I have an implemented Earley parser and a grammar for it. And everything works just fine when a sentence has no misspellings. But the problem is a lot of sentences I have to deal with are highly noisy. I wonder if there's an algorithm which combines parsing with errors correction? Possible errors are:

typos 'cheker' instead of 'checker'
typos like 'spellchecker' instead of 'spell checker'
contractions like 'Ear par' instead 'Earley parser'

If you know an article which can answer my question I would appriciate a link to it.

Solution

I assume you are using a tagger (or lexer) stage that is applied before the Earley parser, i.e. an algorithm that splits the input string into tokens and looks each token up in a dictionary to determine its part-of-speech (POS) tag(s):

John     --> PN
loves    --> V
a        --> DT
woman    --> NN
named    --> JJ,VPP
Mary     --> PN

It should be possible to build some kind of approximate string lookup (aka fuzzy string lookup) into that stage, so when it is presented with a misspelled token, such as 'lobes' instead of 'loves', it will not only identify the tags found by exact string matching ('lobes' as a noun plural of 'lobe'), but also tokens that are similar in shape ('loves' as third-person singular of verb 'love').

This will imply that you generally get a larger number of candidate tags for each token, and therefore a larger number of possible parse results during parsing. Whether or not this will produce the desired result depends on how comprehensive the grammar is, and how good the parser is at identifying the correct analysis when presented with many possible parse trees. A probabilistic parser may be better for this, as it assigns every candidate parse tree a probability (or confidence score), which may be used to select the most likely (or best) analysis.

If this is the solution you'd like to try, there are several possible implementation strategies. Firstly, if the tokenization and tagging is performed as a simple dictionary lookup (i.e. in the style of a lexer), you may simply use a data structure for the dictionary that enables approximate string matching. General methods for approximate string comparison are described in Approximate string matching algorithms, while methods for approximate string lookup in larger dictionaries are discussed in Quickly compare a string against a Collection in Java.

If, however, you use an actual tagger, as opposed to a lexer, i.e. something that performs POS disambiguation in addition to mere dictionary lookup, you will have to build the approximate dictionary lookup into that tagger. There must be a dictionary lookup function, which is used to generate candidate tags before disambiguation is applied, somewhere in the tagger. That dictionary lookup will have to be replaced with one that enables approximate string lookup.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow