How advanced are author-recognition methods?

https://softwareengineering.stackexchange.com/questions/203133

29-09-2020
|

Question

From a written text by an author if a computer program analyses the text, how much can a computer program tell today about the author of some (long enough to be statistically significant) texts?

Can the computer program even tell with "certainty" whether a man or a woman wrote this text based solely on the contents of the text and not an investigation such as ip numbers etc?

I'm interested to know if there are algorithms in use for instance to automatically know whether an author was male or female or similar characteristics of an author that a computer program can decide based on analyses of the written text by an author.

It could be useful to know before you read a message what a computer analyses says about the author, do you agree? If I for instance get a longer message from my wife that she has had an accident in Nigeria and the computer program says that with 99 % probability the message was written by a male author in his sixties of non-caucasian origin or likewise, or by somebody who is not my wife, then the computer program could help me investigate why a certain message differs in characteristics.

There can also be other uses for instance just detecting outliers in a geographically or demographically bounded larger data set.

Scam detection is the obvious use I'm thinking of but there could also be other uses. Are there already such programs that analyse a written text to tell something about the author based on word choice, use of pronouns, unusual language usage, or likewise?

Solution

Yes, there are, and no, they don't work very well.

Deducing information about the author from a text is sub-discipline of natural language processing - most NLP applications are about extracting information about the content of a text rather than the author, but the goals, methods and state of the art are actually rather similar (currently this favors things suchas n=-gram counts, maximum-entropy classificators etc.). In the end, understanding a text and understanding its author are both small parts of the old computer science dream, artificial intelligence. Like most of the problems in AI, both have turned out to be much, much harder than expected, very much dependent on domains, circumstances and processing power, and progressing only slowly and arduously.

That said, there are established methods for tasks such as "sentiment analysis" (deciding whether a text, e.g. a customer review, is positive or negative), summarizing (extracting the key message from a passage of text) or question answering that work tolerably well under controlled conditions. Author detection is harder than either of these; you can sometimes detect a particular writer by characteristic phrases, constructions, topics or opinions, but often you can't, and the same indicators that work quite well for one author can be totally useless for others. That is even before considering that people can change their writing style deliberately, specifically to defeat being unmasked. In fact, if you had a reliable algorithm for detecting authors, this would be a very big help to someone trying to escape detection, since he would only have to keep paraphrasing until the algorithm no longer identifies him!

This is a general problem with text processing to defeat human intentions: the results can be used by both sides, which often annihilates any progress that the scientists make. For instance, many teachers use online plagiarism detection services, but that only works because the teachers put more effort into detecting plagiarism than the students put into plagiarizing in the first place. If someone really wants to get away with submitting someone else's work, they just have to subscribe to the same services and try out which solutions will be detected and which won't.

So altogether, the field is huge, frustrating but fascinating, and nowhere near ready for reliable use for what you have in mind.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange