Smartly Tracking Words Written in a Rails App

Question 1

One very good solution but also a bit complex would be to use some extern software to compare the text "before and after" each update. Git would be a obvious choice, and you could then even have version history like github pages and wikis works now! However there is also a lot of other programs out there, with the only purpose of comparing texts and finding difference. Just search for "text comparison tool" on google.

Edit (git integration tools):

I Found these gems that can be used to call git commands from ruby:

Edit 2 (text comparison tools):

Here is some resources i found, that could be useful to compare texts:

Ruby Gems

Online APIs

Edit 3 (my answers to the last questions): Good solution with the Levensthtein algorithm! I will try to answer you last two question, but there is no right answer so this is just my opinion:

Should I just store this in my postgres db, or should I use another store like redis?

This is not really a key/value situation, and even if you changed the implementation can't i not see any reason to use Redis. Maybe if you later one experience problems with performance but i think for now that redis would be a premature and properly unnecessary optimization.
Would it be a Very Bad Idea to not expire daily words and even track more frequently than daily, like every hour that the user writes? This would allow me to give writers a very granular history of their writing and also help them keep track of when they're most productive.

No it's not a bad idea. Postgres and most SQL databases in general are optimized to query a LOT of rows. It's faster to query one table with a lot of rows then several table (eg. joins) with few rows.

However this also depends on how you are gonna use this data. Will you just query for the last day or so, or would you need to use the whole history of a users changes pretty often? Fx for making statistics or so? If that's the case, should you properly consider optimizing by having a table with summarized data over longer periods. I do this my self in some simple accounting software i have made, for showing stats over income and outcome (by showing summaries of each week instead of each single transaction separately).

Question 2

Our Solution

We do similar things on a large scale. If you're worried about scalability then keeping this code inside a Rails app going off of a basic postgres database is not your best choice.

If you're going to be adding a bunch of metrics like this and if you're going to be counting words and diffs in the words by user, you should consider starting up a stream processing or batch processing platform. These solutions are not trivial, but worth it if you're going to need scale.

Our solution uses twitter storm (http://storm-project.net) with the data counters in Mongo. In fact, their example is a word count application. Redis, as you've asked about isn't a bad choice, actually. I disagree with @jokklan because redis can implement counter storage with next-to-no effort.

We do select the data out of a SQL database, so to start, postgres isn't a bad choice, but that will probably be the first thing you rip out when you start to really scale this thing.

We also have forked storm deploy to help bring up storm servers more reliably. https://github.com/korrelate/storm-deploy

Other Options

Obviously, though, there are a bunch of different platforms to choose.

You can use Hadoop MapReduce (http://hadoop.apache.org/docs/stable/mapred_tutorial.html)
Pig which we use for other stuff through Mortar Data (http://www.mortardata.com)
Amazon EMR which would allow you to do basic MapReduce or Pig jobs but this is more of an platform choice, not a framework and implementation choice
Run some background jobs to compile this information using Sidekiq (https://github.com/mperham/sidekiq) or Resque (not really recommended given sidekiq's advancements) or Iron Worker which runs as a service (http://www.iron.io/worker)

Here's a good article on some of the choices I've mentioned and probably some others (http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/).

Recommendation

I can't honestly give you a good recommendation without more information about what sort of scale you're talking about. Given that, I might be able to help narrow down your choices a little better. How many users? Are you serious about giving all that granularity (that's fine if you are, just help determines scale)? Are there other things you'll want to do besides counting and diff'ing?

Question 3

This is a similar method to what you have proposed but would be based on saves. It would also make for a smaller table. You could have a model associated with the text with say DailyText just user_id, day, expiry date and number of words. Then you could have triggers on the table(s) that store your text that essentially do the following:

On save on a update or insert update daily_text set number_of_words += length(:new) - length(:old) where day = date.day and user_id = user.id

This would give you a little flexibility, you could set the length(:new) - length(:old) to not go below zero or even count remove words separately in a removed_words column.

Or you could have a method in whatever program you're using that stores the previous length and length afterwards and just updates this simple table after a save. It would essentially work the same way as a database trigger.

The expiry date would then just give you the ability to clear the database of old data.

Or if you wanted a really small table you could make day the day of the year 1 .. 365 then have a task that runs at midnight to clear the next days data.

Hope that makes sense

Smartly Tracking Words Written in a Rails App

The problem

My current solution

Edge cases I'm not handling (and am not sure how to handle)

EDIT:

Our Solution

Other Options

Recommendation