Question

Let's imagine, I have two English language texts written by the same person. Is it possible to apply some Markov chain algorithm to analyse each: create some kind of fingerprint based on statistical data, and compare fingerprints gotten from different texts? Let's say, we have a library with 100 texts. Some person wrote text number 1 and some other as well, and we need to guess which one by analyzing his/her writing style. Is there any known algorithm doing it? Can be Markov chains applied here?

Was it helpful?

Solution

Absolutely it is possible, and indeed the record of success in identifying an author given a text or some portion of it, is impressive.

A couple of representative studies (warning: links are to pdf files):

To aid your web-search, this discipline is often called Stylometry (and occasionally, Stylogenetics).

So the two most important questions are i suppose: which classifiers are useful for this purpose and what data is fed to the classifier?

What i still find surprising is how little data is required to achieve very accurate classification. Often the data is just a word frequency list. (A directory of word frequency lists is available online here.)

For instance, one data set widely used in Machine Learning and available from a number of places on the Web, is comprised of data from four authors: Shakespeare, Jane Austen, Jack London, Milton. these works were divided into 872 pieces (corresponding roughly to chapters), in other words, about 220 different substantial pieces of text for each of the four authors; each of these pieces becomes a single data point in the data set. Next a word-frequency scan was performed on each text, and the 70 most common words were used for the study, the remainder of the results of the frequency scan were discarded. Here are the first 20 of that 70-word list.

['a', 'all', 'also', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'been',
  'but', 'by', 'can', 'do', 'down', 'even', 'every', 'for', 'from'] 

Each data point then is just a count of each word of the 70 words in each of the 872 chapters.

[78, 34, 21, 45, 76, 9, 23, 12, 43, 54, 110, 21, 45, 59, 87, 59, 34, 104, 93, 40]

Each of these data points is one instance of the author's literary fingerprint.

The final item in each data point is an integer (1-4) representing one of the four authors to whom that text belongs.

Recently, I ran this dataset through a simple unsupervised ML algorithm; the results were very good--almost complete separation of the four classes, which you can see in my Answer to a previous Q on StackOverflow related to text classification using ML generally, rather than author identification.

So what other algorithms are used? Apparently, most Machine Learning algorithms in the supervised category can successfully resolve this kind of data. Among these, multi-layer perceptrons (MLP, aka, neural networks) are often used (Author Attribution Using Neural Networks is one such frequently-cited study).

OTHER TIPS

You might start with a visit to the Apache Mahout web site. There is a giant literature on classification and clustering. Essentially, you want to run a clustering algorithm, and then hope that 'which writer' determines the clusters.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top