Question

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout".

Example: woijf qoeoifwjf oiiwjf oiwj pfowjfoiwjfo oiwjfoewoh

Is there any software that does this already (preferably free and open source) ?

If not, is there an active FOSS project whose goal is to achieve this?

If not, how would you suggest to implement such a software?

Was it helpful?

Solution

If two bigrams in analyzed text are close in QWERTY terms but have near zero statistical frequency in English language (like pairs "fg" or "cd") then there is chance that random keyboard hits are involved. If more such pairs are found then chance increases greatly.

If you want to take into account the use of both hands for bashing then test letters that are separated with another letter for QWERTY closeness, but two bigrams (or even trigrams) for bigram frequency. For example in text "flsjf" you would check F and S for QWERTY distance, but bigrams FL and LS (or trigram FLS) for frequency.

OTHER TIPS

Most keyboard mashing tends to be on the home row in my experience. It would be reasonably simple to check to see if a high proportion of the characters used are asdfjkl;.

Consider empirical distribution of sequences of two letters, ie "probability of having letter a given it follows letter b", all this probabilities fill a table of size 27x27 (considering space as a letter).

Now, compare this with historical data from a bunch of english/french/whatever texts. Use Kullback divergence for comparison.

Taking an approach based on keyboard layout will provide a good indicator. With a QWERTY layout you will find that around 52% of letters in any given text will be from the top line of keyboard characters. About 32% of characters will be from the middle line and 14% of will be from bottom line. While this varies slightly from one language to another, there remains a very clear pattern which can be detected. Use the same methodology to discover patterns in other keyboard layouts, then ensure you detect the layout used for any text entered before checking for gibberish. Even though the pattern is clear, it is best to use this method as one indicator only given that this methodology works best with longer scripts. Using other indicators such as non-alpha/numeric characters mixed with alpha/numeric, text length etc will provide further indicators which when applying weighting, can provide a pretty good overall indication of gibberish entry.

Fredley's answer can be extended to a grammar that would construct words from nearby letters.

For example asasasasasdf could be generated with a grammar that connects as, sa, sd and df.

With such grammar, expanded to all letters on the keyboard (with letters that are next to each other) could, after parsing, give you a measure of how much of a text can be generated with this 'gibberish' grammar.

Caveat: of course, any text discussing such grammar and listing examples of 'gibberish' text would score significantly higher then a regular spell-checked text.

Do note that the example approach would not catch vandalism in the form of 'h4x0r rulezzzzz!!!!!'.

Another approach here (which can be integrated with the above method) would be to statistically analyze a corpus of vandalized text and try to get common words in vandalized texts.

EDIT:
Since you are assuming QWERTY, I guess we could assume English, too?

What about KISS - run the text through english spell checker and if it fails miserably conclude that it is probably gibberish (the question is, why want to distinguish quickly typed gibberish from random nonsense or for that matter from very badly spelled text?)

Alternatively if other keyboard layouts (Dvorak, anyone?) and languages are to be considered, then maybe run the text through all available language spell checkers and then proceed (this would give language autodetect, too).

This would not be very efficient method, but could be used as a baseline test.

Note:
In the long run I imagine that vandals would adapt and start vandalizing with, for example excerpts from other wikipedia pages, which would be ultimately hard to automatically detect as vandalism (ok, existing texts could be checksummed and flag raised on duplicates, but if text came from some other source it would be ultimately hard).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top