Determining “Owner” of Text Edited by Multiple Users

https://stackoverflow.com/questions/424220

05-07-2019
|

Question

You may have noticed that we now show an edit summary on Community Wiki posts:

community wiki
220 revisions, 48 users

I'd like to also show the user who "most owns" the final content displayed on the page, as a percentage of the remaining text:

community wiki
220 revisions, 48 users
kronoz 87%

Yes, there could be top (n) "owners", but for now I want the top 1.

Assume you have this data structure, a list of user/text pairs ordered chronologically by the time of the post:

User Id     Post-Text
-------     ---------
12          The quick brown fox jumps over the lazy dog.
27          The quick brown fox jumps, sometimes.
30          I always see the speedy brown fox jumping over the lazy dog.

Which of these users most "owns" the final text?

I'm looking for a reasonable algorithm -- it can be an approximation, it doesn't have to be perfect -- to determine the owner. Ideally expressed as a percentage score.

Note that we need to factor in edits, deletions, and insertions, so the final result feels reasonable and right. You can use any stackoverflow post with a decent revision history (not just retagging, but frequent post body changes) as a test corpus. Here's a good one, with 15 revisions from 14 different authors. Who is the "owner"?

https://stackoverflow.com/revisions/327973/list

Click "view source" to get the raw text of each revision.

I should warn you that a pure algorithmic solution might end up being a form of the Longest Common Substring Problem. But as I mentioned, approximations and estimates are fine too if they work well.

Solutions in any language are welcome, but I prefer solutions that are

Fairly easy to translate into c#.
Free of dependencies.
Put simplicity before efficiency.

It is extraordinarily rare for a post on SO to have more than 25 revisions. But it should "feel" accurate, so if you eyeballed the edits you'd agree with the final decision. I encourage you to test your algorithm out on stack overflow posts with revision histories and see if you agree with the final output.

I have now deployed the following approximation, which you can see in action for every new saved revision on Community Wiki posts

do a line based diff of every revision where the body text changes
sum the insertion and deletion lines for each revision as "editcount"
each userid gets sum of "editcount" they contributed
first revision author gets 2x * "editcount" as initial score, as a primary authorship bonus
to determine final ownership percentage: each user's edited line count total divided by total number of edited lines in all revisions

(There are also some guard clauses for common simple conditions like 1 revision, only 1 author, etcetera. The line-based diff makes it fairly speedy to recalc for all revisions; in a typical case of say 10 revisions it's ~50ms.)

This works fairly well in my testing. It does break down a little when you have small 1 or 2 line posts that several people edit, but I think that's unavoidable. Accepting Joel Neely's answer as closest in spirit to what I went with, and upvoted everything else that seemed workable.

Solution

Saw your tweet earlier. From the display of the 327973 link, it appears you already have a single-step diff in place. Based on that, I'll focus on the multi-edit composition:

A, the original poster owns 100% of the post.
When B, a second poster, makes edits such that e.g. 90% of the text is unchanged, the ownership is A:90%, B:10%.
Now C, a third party, changes 50% of the text. (A:45%, B:5%, C:50%)

In other words, when a poster makes edits such that x% is changed and y = (100-x)% is unchanged, then that poster now owns x% of the text and all previous ownership is multiplied by y%.

To make it interesting, now suppose...
A makes a 20% edit. Then A owns a "new" 20%, and the residual ownerships are now multiplied by 80%, leaving (A:36%, B:4%, C:40%). The "net" ownership is therefore (A:56%, B:4%, C:40%).

Applying this to your specimen (327973) with everything rounded to the nearest percent:

Version 0: The original post.

Paul Oyster: 100%

Version 1: Your current diff tool shows pure addition of text, so all those characters belong to the second poster.

Paul Oyster: 91%
onebyone: 9%

Version 2: The diff shows replacement of a word. The new word belong to the third poster, and the remaining text belongs to the prior posters.

Paul Oyster: 90%
onebyone: 9%
Blogbeard: 1%

Version 3: Tag-only edit. Since your question was about the text, I'm ignoring the tags.

Paul Oyster: 90%
onebyone: 9%
Blogbeard: 1%

Version 4: Addition of text.

Paul Oyster: 45%
onebyone: 4%
Blogbeard: 1%
Mark Harrison: 50%

I hope that's enough to give the sense of this proposal. It does have a couple of limitations, but I'm sliding these in under your statement that an approximation is acceptable. ;-)

It brute-forcedly distributes the effect of change across all prior owners. If A posts, B does a pure addition, and C edits half of what B added, this simplistic approach just applies C's ownership across the entire post, without trying to parse out which prior ownership was changed the most.
It accounts for additions or changes, but doesn't give any ownership credit for deletion, because the deleter adds 0% to the remaining text. You can either regard this as a bug or a feature. I chose door number 2.

Update: A bit more about issue #1 above. I believe that fully-tracking the ownership of the part of a post that is edited would require one of two things (The margin of the web page is not big enough for a formal proof ;-):

Changing the way text is stored to reflect ownership of individual portions of the text (e.g. A owns words 1-47, B owns words 48-59, A owns words 60-94,...), applying the "how much remains" approach in my proposal to each portion, and updating the portion-ownership data.
Considering all versions from first to current (in effect, recomputing the portion-ownership data on the fly).

So this is a nice example of a trade-off between a quick-and-dirty approximation (at the cost of precision), a change to the entire database (at the cost of space), or every calculation having to look at the entire history (at the cost of time).

OTHER TIPS

I think the idea is fundamentally flawed.

If someone writes a brilliant analysis with awful spelling and unclear examples, and I copy edit it extensively, have I created 60 % of the work? Clearly not; the result is a derivative where most of the value comes from the initial poster. A useful measure is not possible based on character or word counts, but requires strong AI-level semantic analysis.

Quite apart from that, seeking credit based on “ownership” of articles would likely be entirely unhelpful and anti-wiki. On Wikipedia, for example, people who act as though they own articles are one of the most damaging influences.

Here's how I see the ownership of the example post (note, I forgot to add tags to the text at all, so simple tag-changes aren't counted in this example here, but can easily be added):

Slap forehead: Man, I used the revision numbers instead of the user numbers. Results redone below:

User 38193 owns 42% (922 / 2171) of the final post
User 2635 owns 28% (625 / 2171) of the final post
User 116 owns 24% (529 / 2171) of the final post
User 745 owns 3% (76 / 2171) of the final post
User 13005 owns 0% (11 / 2171) of the final post
User 18941 owns 0% (5 / 2171) of the final post
User 8562 owns 0% (3 / 2171) of the final post

53 ms

So according to my algorithm, user 38193 (@Paul Oyster) owns 42% of the post, whereas post 2635 (@Simucal) had 28%, and user 116 (@Mark Harrison) has 24%, the rest is negligible.

From the revisions, we can see that Paul, which is the original author, still owns most of the question, and Simucal and Mark comes in on good nr. 2 and 3. These corresponds to revisions nr. 1 (original post), nr. 14, which is a large edit by Simucal, and which looks like it shows the flaw in my algorithm quite nicely (see below), and nr. 5 where Mark added the bash script.

So how did I arrive at this answer? Well, the algorithm has a flaw, but I'll get back to it, but here's how it goes:

Basically each byte in the original post is assigned the user id of the user that wrote it. Then I use a diff algorithm that can handle copies out of order, which will then bring along the user id of bytes copied by a new author. Anything added by a new author is assigned the user id of the new author.

For instance, if the original author writes two sentences, these will be tagged with his user id. Then another author revises it and adds a third sentence between the original two. To the diff algorithm, this looks like the new author copied the first sentence, added new data, and copied the second sentence. The sentences will thus be correctly attributed to their authors.

Since the diff algorithm works on bytes, minor textual changes like adding missing punctuation or letters should have a negligible impact on ownership, and almost all the original text should still be attributed to the original author. However, in some cases it will use the "added data" operation even though just a single byte was added, due to internal optimizations. The algorithm and its implementation was originally created to handle file diffs, and producing the smallest possible patches between file versions, and will sometimes optimize away minor steps in favor of combining it into a sibling operation if it will reduce the file size.

The flaw in the algorithm comes from rollbacks. I note Jeff have writte in a comment that rollbacks won't be considered, but if a user edits a post instead of rolling it back, and just pastes in the old stuff, in effect reversing a change by the previous author, then all the text is attributed to the person "rolling back", instead of the original author that came up with the information.

The source code for this implementation can be found here, for Visual Studio 2008. Note that the solution does not do anything like a screenscrape or anything, and the post contents is hardcoded into the source in the TestData class, properly escaped for quotes, etc. To replace the text, you need to change that file, or implement a way to read contents from outside of the program.

Anyway, here's the algorithm in a bit more detail.

Create an array of integers, as long as the original revision (actually, I encoded it into bytes through UTF8, and this is the length I used). Fill this array with the user id of the original author he now owns 100% of the revision, every character/byte is his
Compare the first revision with the next revision. This comparison will produce a list of operations. These operations can be used to take the original revision, apply the operations to it, and you'll get the next revision (more on this below).
Pretend the original post is the array of user id's you have prepared (the one currently only containing a bunch of values equal to the first author's id), and apply the operations to this. This will produce a new list of id's, some of them will be the original author, some will be the second author.
Keep the second revision and this new list of user id's, and run through step 2+3 between this data, and the next revision, and the next revision, etc. until you get to the bottom

At this point, you have a list of user id's which tells you which user added each character.

The operations from the comparison is one of two:

Insert new bytes
Copy some bytes

How the comparison result is used is that you take the old contents (the first post), and apply the operations to it, and you then produce the next revision. It is basically a diff.

When I apply the operations to my list of user id's, when I copy, I simply copy, when I insert, I always insert a number of id's equal to the length stored in the operation.

Let me give an example:

Original post:

This is the first post

This is the next post, it is based on the first post.

The list of operations would thus be:

Copy 12 characters 'This is the '
Insert 'next'
Copy 5 characters ' post'
Insert 17 characters ', it is based on'
Copy 14 characters 'the first post'
Insert 1 character '.'

If I instead work with user id's, I would have first this array:

0000000000000000000000
This is the first post

Now I apply the operations, and for every insert, I insert a 1 instead:

00000000000011110000011111111111111111000000000000001
This is the next post, it is based on the first post.

Now I simply count how many 0's and 1's I have:

0's: 31
1's: 22

User 0 owns 31/(31+22) of the post, and user 1 owns 22/(31+22) of the post.

Translated into percentages: user 0 owns 58%, user 1 owns 42%.

Now, the problem with this algorithm is that of rollbacks, and adding back lost/deleted content.

For instance, if you have users A, B and C, and user A posts something that really ticks user B off, user B goes in and deletes everything, and adds just "THIS IS CRAP". When user C sees this, he edits the post, and adds back everything that A posted, possibly with fixes. User C now owns 100% of the post.

I don't know how to solve the above problem though.

I'll post the code that does this later tonight if it's interesting.

Applied to the 'quick brown fox' example, renumbering the users to 1-3, I get this:

User 3 owns 75% (45 / 60) of the final post
User 1 owns 25% (15 / 60) of the final post

Note that user 2, which only added the 'sometimes' part, which was subsequently removed, is removed from the list.

The id's for the posts are as such:

The quick brown fox jumps over the lazy dog.
11111111111111111111111111111111111111111111 (44 = 100%)

The quick brown fox jumps, sometimes.
1111111111111111111111111222222222222 (25 / 12 ~ 68% / 32%)

I always see the speedy brown fox jumping over the lazy dog.
333333333333333333333331111111111111113333333333333333333333 (45 / 15 = 75% / 25%)

Things the algorithm will cope with:

If I copy something when making my new post, the algorithm will properly attribute the copied elements, even if I now copy parts of which I also added. Example:

This is the first post, which is cool
1111111111111111111111111111111111111

This is the second post, which is the second post.
11111111111122222211111111111111112222222222111111
                               ^-- copied ------^

The only problem with this algorith, and this post in its entirety is that I'm not entirely sure it produces what I would call an intuitively fair result. It could be, I just hacked together the program, but far more testing with lots and lots of edge cases would probably be in order in order to determine if it actually produces something people would accept.

Also, if I simply move stuff around from the initial post, only minor bits, like a few percent, will be mine. Since the algorithm is a diff-algorithm at heart, sometimes the cost of outputting a copy operation for just 1 or a couple of bytes outweighs the cost of just inserting it raw, so sometimes it will treat a short sequence of bytes as inserted, though they could've been copied from the original post.

If I am understanding your question correctly, it looks like you are trying to do what IBM did awhile back with a Wikipedia research project. Namely, seeing who's revisions to the text where the most accepted by other users and how the overall text changed over time. The name of the project was history flow and the history flow - how it works gives a pretty nice overview of how their algorithm worked.

What about simply calculating the Levenshtein distance of each person's edit to the previous version. Then sum the distance scores for each user and calculate the user's percentage of the sum of all users' distance scores.

Here's some C# code to calculate the distances.

This will work if you want to implement/leverage a diff algorithm ala Python's difflib -- you will probably have to do some kind of diff in any case. This snippet calls the user with the most text churn the winner.

Excuse my hardcoding.

#!/usr/bin/env python

import collections
import difflib
import logging
import pprint
import urllib2
import re

class OwnageDeterminer(object):

    add_coefficient = 1
    remove_coefficient = .5

    def __init__(self, edits):
        self.edits = edits
        self.counts_by_username = {}

    def __call__(self):
        edits, counts_by_username = self.edits, self.counts_by_username
        for i, edit in enumerate(edits):
            username = edit['username']
            unique_counts = {'added': 0, 'removed': 0}
            existing_text = edits[i-1]['text'] if i > 0 else ''
            new_text = edits[i]['text']
            for char_diff in difflib.ndiff(existing_text, new_text):
                if char_diff.startswith('+'):
                    unique_counts['added'] += 1
                elif char_diff.startswith('-'):
                    unique_counts['removed'] += 1
            user_counts = counts_by_username.get(username, collections.defaultdict(int))
            user_counts['removed'] += self.remove_coefficient * unique_counts['removed']
            user_counts['added'] += self.add_coefficient * unique_counts['added']
            counts_by_username[username] = user_counts
        winner = None
        winning_score = 0
        score_by_username = {}
        for username, counts in counts_by_username.iteritems():
            score = counts['removed'] + counts['added']
            if score > winning_score:
                winner = username
                winning_score = score
            score_by_username[username] = score
        logging.debug('Scores: %s', pprint.pformat(score_by_username))
        return winner


if __name__ == '__main__':
    logging.basicConfig(level=logging.DEBUG)
    site = urllib2.urlopen('http://stackoverflow.com/revisions/327973/list')
    contents = site.read()
    regex = re.compile(r'(/revisions/viewmarkup/\d+).*?/users/\d+/([\w-]+)',
                       re.MULTILINE|re.DOTALL)
    revisions = regex.findall(contents)
    print revisions
    edits = []
    for reluri, username in sorted(revisions, key=lambda t: t[0]):
        text = urllib2.urlopen('http://stackoverflow.com{0}'.format(reluri)).read()
        edit = {'username': username, 'text': text}
        edits.append(edit)
    od = OwnageDeterminer(edits)
    print od()

The output:

DEBUG:root:Scores: {'blorgbeard': 0.5,
 'dave-markle': 0.5,
 'dbr': 1172.0,
 'gatekiller': 69.5,
 'joseph-ferris': 0.0,
 'lkessler': 0.0,
 'mark-harrison': 592.0,
 'mdb': 3.0,
 'onebyone-livejournal-com': 0.0,
 'paul-oyster': 482.0,
 'rob-wells': 0.0,
 'simucal': 1070.5,
 'skiphoppy': 0.0,
 'thesoftwarejedi': 701.0}
dbr

Difflib docs on complexity:

Timing: The basic Ratcliff-Obershelp algorithm is cubic time in the worst case and quadratic time in the expected case. SequenceMatcher is quadratic time for the worst case and has expected-case behavior dependent in a complicated way on how many elements the sequences have in common; best case time is linear.

Another nice thing is that this winner calculation is linear, so you can cache the original results and do incremental updates on new edits, despite the large initialization load.

Nobody owns it. Awarding ownership violates the spirit of "community wiki" and may lead to counterproductive edit wars.

Off the top of my head I'd do something like this:

I think counting words as opposed to lines or characters makes sense
Tokenize the original revision into words, and attach the author to each
Step through revision history and as words are added, attach the author to them.
If words are deleted, just forget them.
If words are changed, you can either count them as a delete/insert, or have some sort of character-based threshold so that spelling corrections aren't attributed to a new author.
You end up with a list of words against who originally wrote them

A question is whether removing characters is as valid as adding them. Using Diff may work well, but it doesn't always follow that the person who has added the most is the most valid editor (some people can write in 10 characters what others take a page to write).

So i think you have two factors here:

Quantity of change
Quality of change

Hence i would be tempted to write something that simply maps words added/removes to a points score. Adding to this some kind of quality factor (this would have to be an external parameter) and the combined value could then be used in sorting.

You could arguably generate some kind of primitive "quality factor" based on the distance it took before an item was edited. So if something i wrote wasn't changed until the 12th edit then it couldn't have been too wrong (relative to something of less quality changed as soon as it was added).

It's a wiki—why not let each editor choose the significance of his or her change? Provide a drop-down with something like...

Please qualify your edit:

A few minor edits (spelling, grammar, etc)
- Many minor edits
- A few significant edits (changes to specific facts, figures, etc)
- Many significant edits
- A few major edits (changes to thesis, conclusion, etc)
- Many major edits

Then use the combined responses to estimate ownership.

How about this idea?

Original poster's name is displayed, until the post has been changed more than x% from the original text
This percentage is displayed as something like "x users, y edits, z% of original"
The last edit thing currently present could also have a diff amount - so for example, it makes it clear the first edit to your question was a minor addition

The point is that after a certain amount of change it doesn't really matter who owns it. I agree with the users saying that trying to assign "owners" to wiki content is counter-productive.

Number of characters is tricky methinks.

How about:

Break latest revision into a set of sentences. 
//Sentence is any text fragment surrounded by punctuation

For each Sentence
    Find which user created that sentence. 
    Add 1 to the user who created the sentence 
    Add 1 to the number of sentences

For Each user 
    % ownership = Count for that user / Number of sentences.

Finding which user created that sentence.
Matching the sentence to the revision is easy if you want an exact match, but I'd be more happy with a partial match.

To do this partial match...

Strip Common words from the sentence fragment, and search for that stripped fragment in a stripped version of each revision. The earliest hit is the sentence written by the owner.

Options: (instead of common word stripping)

For each word in the sentence, condense each word to the first and last characters. This'll account for spelling mistakes, etc.
Only use the first and last two words in the sentence fragment.
Matching 80% of the words would probably be good enough to attribute ownership. (This is a difficult thing for the computer to do - and it can't be cached)

Stripping common words

You already do this for search, so the libraries, etc are already there. Even if you use someone elses formula, you should maybe CommonWordStrip each revision before starting.

The key to having a good solution to this problem is to get extra information about what is going on with the editing. The extra information available in this medium is the voting and response rate to questions. So if someone makes an edit that leads to the question getting a lot of upvotes, comments, and responses, their edit was very valuable. Special cases probably exist for a brand new unanswered question that gets edited before it has been on the site for a very long time.

You should look at how much the post has changed using the Levenshtein distance algorithm and then weight their edit by the number of votes, comments, and answers the question receives after their edit.

Let n = total revisions
Let m = the revision number of a poster
Let post[it] = array with text of post at revision 'it'
Let votes[it] = votes that revision 'it' received (also add bonus for comments/answers)
value = 0
for (it = m; it < n; ++it) {
  value += (Levenshtein(post[it-1], post[m]) / average_length_post) * (votes[it])
}

If you calculate the value for each post, the ownership of a post is the total value of all edits by that user divided by the sum of all edit values for that post.

Not sure if it is possible, but you could count the number of characters added.

Example:

User 1 submits a question with 400 characters (User 1 has 400 of 400)
User 2 deletes 40 and adds 60 (User 1 had 360 and User 2 has 60)

If you revert you should also revert to the previous user/character count.

But... maybe it is just simpler and more fair to name the original poster...

At second thought, I edit a lot, but I never consider myself "owner" of the tekst because I just alter the presentation (format and grammar) but not the contents.

The initial concept must be given a weight. Spelling corrections must be given a weight. Grammar/structure must be given a weight. Wording must be given a weight. Conciseness must be given a weight. Etc...

Weights are the only fair way to do it but also impossible to determine.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow