Algorithms to identify Markov generated content?

https://stackoverflow.com/questions/1185369

19-09-2019
|

Question

Markov chains are a (almost standard) way to generate random gibberish which looks intelligent to untrained eye. How would you go about identifying markov generated text from human written text.

It would be awesome if the resources you point to are Python friendly.

Solution

You could use a "brute force" approach, whereby you compare the generated language to data collected on n-grams of higher order than the Markov model that generated it.

i.e. If the language was generated with a 2nd order Markov model, up to 3-grams are going to have the correct frequencies, but 4-grams probably won't.

You can get up to 5-gram frequencies from Google's public n-gram dataset. It's huge though - 24G compressed - you need to get it by post on DVDs from LDC.

EDIT: Added some implementation details

The n-grams have already been counted, so you just need to store the counts (or frequencies) in a way that's quick to search. A properly indexed database, or perhaps a Lucene index should work.

Given a piece of text, scan across it and look up the frequency of each 5-gram in your database, and see where it ranks compared to other 5-grams that start with the same 4 words.

Practically, a bigger obstacle might be the licensing terms of the dataset. Using it for a commercial app might be prohibited.

OTHER TIPS

One simple approach would be to have a large group of humans read input text for you and see if the text makes sense. I'm only half-joking, this is a tricky problem.

I believe this to be a hard problem, because Markov-chain generated text is going to have a lot of the same properties of real human text in terms of word frequency and simple relationships between the ordering of words.

The differences between real text and text generated by a Markov chain are in higher-level rules of grammar and in semantic meaning, which are hard to encode programmatically. The other problem is that Markov chains are good enough at generating text that they sometimes come up with grammatically and semantically correct statements.

As an example, here's an aphorism from the kantmachine:

Today, he would feel convinced that the human will is free; to-morrow, considering the indissoluble chain of nature, he would look on freedom as a mere illusion and declare nature to be all-in-all.

While this string was written by a computer program, it's hard to say that a human would never say this.

I think that unless you can give us more specific details about the computer and human-generated text that expose more obvious differences it will be difficult to solve this using computer programming.

I suggest a generalization of Evan's answer: make a Markov model of your own and train it with a big chunk of the (very large) sample you're given, reserving the rest of the sample as "test data". Now, see how well the model you've trained does on the test data, e.g. with a chi square test that will suggest situation in which "fit is TOO good" (suggesting the test data is indeed generated by this model) as well as ones in which the fit is very bad (suggesting error in model structure -- an over-trained model with the wrong structure does a notoriously bad job in such cases).

Of course there are still many issues for calibration, such as the structure of the model -- are you suspecting a simple model based on Ntuples of words and little more, or a more sophisticated one with grammar states and the like. Fortunately you can calibrate things pretty well by using large corpora of known-to-be-natural text and also ones you generate yourself with models of various structures.

A different approach is to use nltk to parse the sentences you're given -- a small number of mis-parses is to be expected even in natural text (as humans are imperfect and so is the parser -- it may not know that word X can be used as a verb and only classify it as a noun, etc, etc), but most Markov models (unless they're modeling essentially the same grammar structure your parser happens to be using -- and you can use several parsers to try and counteract that!-) will cause VASTLY more mis-parses than even dyslexic humans. Again, calibrate that on natural vs synthetic texts, and you'll see what I mean!-)

If you had several large Markov-generated texts, you could possibly determine that they were so by comparing the word frequencies between each of the samples. Since Markov chains depend on constant word probabilities, the proportions of any given word should be roughly equal from sample to sample.

Crowdsourcing. Use Mechanical Turk and get a number of humans to vote on this. There are even some libraries to help you pull this off. For example:

TurKit - Iterative Tasks on Mechanical Turk

Here's a blog post from O'Reilly Radar on tips for using Mechanical Turk to get your work done:

Mechanical Turk Best Practices

If you write a program which generates Markovian transition probabilities from any sequence of symbols, and then calculates the entropy rate of the markov matrix. (see http://en.wikipedia.org/wiki/Entropy_rate#Entropy_rates_for_Markov_chains) This is basically an estimate of how easily the text could be predicted using just the markov chain (higher entropy means harder to predict). Therefore I would think that the lower the entropy of the markov matrix is, the more likely that the sample of text is controlled by a markov matrix. If you have questions on how to write this code, I happen to have a program in python which does exactly this on my computer, so I can help you out

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow