Question

I'm doing some personal research into text analysis, and have come up with close to 70 metrics (pronoun usage frequency, reading levels, vowel frequency, use of bullet points, etc) to "score" a piece of text.

Ideally, separate pieces of text from the same author would have similar scores. The ultimate goal is to index a great deal of authors, and use scores to guess at who wrote a separate, anonymous piece of text.

I'd like the scores to normalize from 0 to 100 and represent a percentage of how "similar" two pieces of text are in writing style. Questions like How to decide on weights? and How to calculate scores? describe the math behind scoring metrics and how to normalize, but assume every metric is weighted the same.

My question is this: how do I determine the proper weight to use when scoring each metric, to ensure that the cumulative score per-user most accurately describes the writing from that specific user?

Also, weights can be assigned per-user. If syllables per word most aptly describes who wrote a piece for Alice, while the frequency of two-letter words is the best for Bob, I'd like Alice's heaviest weight to be on syllables per word, and Bob's to be on frequency of two-letter words.

Was it helpful?

Solution

If you want to do it with weighted scores, have a look at http://en.wikipedia.org/wiki/Principal_component_analysis - you could plot the values of the first (largest) couple of principal components for different authors and see if you find a clustering. You can also take a plot of the smallest few principal components and see if anything stands out - if it does, it is probably from a glitch or a mistake - it tends to pick out exceptions from general rules.

Another option is http://en.wikipedia.org/wiki/Linear_discriminant_analysis

I suppose you could build per-author weights if you built weights for the classification Alice vs not-Alice, and weights for the classification Bob vs not-Bob.

Another way of trying to identify authors is to build a http://en.wikipedia.org/wiki/Language_model for each author.

It occurs to me that if you are prepared to claim that your different measures are independent, you can then combine them with http://en.wikipedia.org/wiki/Naive_Bayes_classifier. The log of the final Bayes factor will then be the sum of the logs of the individual Bayes factors, which gives you your sum of weighted scores.

OTHER TIPS

It seems like you're trying to combine a bunch of disparate attributes of writing style into a single number, which would then somehow be used to determine similarity between users' writing styles. How is that going to work out? Bob is 100, Alice is 50, etc?

What you really want is to use (some subset) of the metrics to form a feature vector for each writing style. Then you can say that a certain document is represented by (60% pronoun usage, 10th grade "reading level", 40% vowels, ...), another by (40% pronouns, 12th grade "reading level", 50% vowels, ...), where each of those attributes is a real number and the position in the vector tells you which attribute you're talking about.

Then you can label each of those vectors by the true author, so that you have a collection of feature vectors labeled for each author. Then you can compute similarities in any number of ways.


If you have a new document and you want to guess who wrote it, this is a standard supervised learning problem. One easy one is a k-nearest neighbor approach, in which you find the k nearest vectors to your test point under some distance metric and use their labels to vote for which author you think this is. If you have no idea which features are going to be most useful, you can use the Mahalanobis distance, which is equivalent to the standard Euclidean distance if you scale each component of the vector to have unit variance ((((x - y) / all_data_stacked.std(axis=0))**2).sum() in numpy notation).

But there are many, many other approaches to doing classification, many of them based on finding separating surfaces in your feature space that separate one author from another. To do it with many authors, you can find these decision surfaces between all pairs of authors, apply each of those num_authors * (num_authors - 1) / 2 classifiers to the test point, and vote among those labels. Another way is to train one classifier for each author that does this author vs anyone else, and then take the one that's most confident.

The best out-of-the-box supervised classification algorithm for most problems is called support vector machines (SVMs); LibSVM is a good implementation. There are many, many, many others, though.


If you're not actually trying to classify test points, though, and you just want a measure of "how similar are Alice and Bob's writing styles?", there are other approaches to take. What you're trying to do in that case, in the framework that I'm dealing with here, is take two sets of vectors and ask "how similar are they"?

There are some simple measures that people use for things like this, e.g. the minimum or mean distance between elements of the set, things like that. But that's not necessarily very helpful.

One ad-hoc measure is: how easy is it to confuse Alice's writing for Bob's? To test this, train an Alice-vs-Bob classifier with cross-validation and see how often the classifier confuses test points for Alice's vs for Bob's. That is, use all but k of the documents for Alice or Bob to train a classifier between the two, then evaluate that classifier on those k. Repeat so that every document is classified. If the error rate is high, then their writing style is similar; if not, they're not similar. Using k = 1 is best here, if you can afford it.

We can also come at this from a more formal approach. A research project that I happen to be involved with involves treating those sets of feature vectors as samples from an unknown probability distribution representing the writing style of an individual author. So when Alice writes a document, its features are chosen according to a probability distribution that represents the way in which she writes; Bob's documents are chosen from Bob's distribution. You can then estimate the Rényi-α divergence between those two distributions, which is one way of measuring how "similar" they are. (If you choose α near 1, it approximates the important Kullback-Leibler (KL) divergence.) Here's some papers introducing the technique, giving all the mathematical details on the estimator, and a preprint describing how to combine this estimator with SVMs to beat state of the art on computer vision problems. I have a C++ implementation here; let me know if you end up using it!

Another similar method people use is called maximum mean discrepancy.

All of these techniques (except the classifier one), unfortunately, have some reliance on you manually scaling the original features appropriately. (This isn't true for e.g. SVMs for classification; they can handle figuring out if some features are more important than others for a given user, though you should probably scale them all to have zero mean and unit variance as a first step.) That's a problem of feature selection, which is a hard problem that unfortunately still requires a fair bit of tweaking. Approaches based on mutual information and the like (intimately related to divergence estimation) can be helpful there. As mcdowella suggested, PCA can also be a decent place to start.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top