Question

Background:

Documents coming in as well as training set have gone through Apache Tika with Tesseract for inline images. This works well, except when it doesn't. Many docs are old, scanned images and what Tika extracts is gibberish.

Using Spark on Hadoop and either ML or MLlib (haven't settled, though I like ML better).

So far getting best results from a pipeline using Naive Bayes that removes Stopwords, tokenizes and Countvectorizes features (no Tf-Idf). Total bag-of-words approach. Next best is using ML to tokenize, Tf with Idf and into LogisticregressionwithLBFGS.

Anyway, the thought occurred to me that the model uses many docs that are junk. Literally just strings of gibberish like "mmmmmmmm aaannnammmmrrr hdhhdhhhhhjjj..."

This isn't good, but since I'm operating at scale it's just what happened. Certainly I could pick through 10,000 training docs and remove the bad examples, but there has to be an easier way. Is there?

The title of this question belies my brainstorm that there might be a way to discount, downweight or outright ignore tokens that aren't discernable by a dictionary. Is there?

Open to any and all advice or approaches to get better precision out of this model.

Thanks

Was it helpful?

Solution

Usually, those non-sensical words are not problematic because they appear in one or two documents, and people usually just filter words with such low frequencies.

Are you not able to do this? Or do your non-sensical words appear that much?

Anyhow, your suggestion is what people use:

outright ignore tokens that aren't discernable by a dictionary. Is there?

Exactly. There is no magic way to know if a word is English or not. What word processors do is to use a dictionary, as you yourself suggested.

In python, before stemming, you could filter based on pyenchant.

import enchant
d = enchant.Dict("en_US")
d.check('hello')  # True
d.check('mmmmmmmm')  # False

I bet this would be good enough.

But, if there are a lot of false negatives, you could ask the dictionary for the most similar words and then apply a word distance measure. Here I will use Levenshtein distance as implemented in python:

>>> r = d.suggest('helo')
>>> r
['hole', 'help', 'helot', 'hello', 'halo', 'hero', 'hell', 'held', 'helm', 'he lo', 'he-lo', 'heel', 'loathe', 'Helios', 'helicon']
>>> min([LD('helo', w) for w in r])
1

The distance will be higher for the non-sensical words you suggested:

>>> d.suggest('mmmmmmmm')
['mammogram', "Mammon's"]
>>> min([LD('mmmmmmmm', w) for w in d.suggest('mmmmmmmm')])
5
>>> min([LD('aaannnammmmrrr', w) for w in d.suggest('aaannnammmmrrr')])
12
>>> min([LD('hdhhdhhhhhjjj', w) for w in d.suggest('hdhhdhhhhhjjj')])
10

To recap:

  1. filter words that appear in less than 3 documents
  2. if that's not enough or it is not a possibility for your use-case, use a dictionary

I think this last suggestion of mine of using Levenshtein distance to the closest English words is overkill.

If your documents involve a lot of urban words or hip terms, you could also filter based on the number of Google results, instead of using a common dictionary. Your word hdhhdhhhhhjjj, for instance, gives me only two Google results, one of which is this stackexchange. Data science is about being creative. :)

There probably are heuristics or statistics based on the number of consonants in a word, and letter combinations, which you could use to probabilistic shrink your documents' dictionary, but I wouldn't go there. It will be too much work and very brittle.

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top