Question

I have a Ruby-on-Rails model:

class Candidate < ActiveRecord::Base
  validates_presence_of :application_essay
  validate :validate_length_of_application_essay

  protected

  def validate_length_of_application_essay
    return if application_essay.blank? # don't add a second error message if they didn't fill it out
    errors.add(:application_essay, :too_long), unless ...
  end
end

Without dropping into C, what is the fastest way to check that the application_essay contains no more than 500 words? You can assume that most essays will be at least 200 words, are unlikely to be more than 5000 words, and are in English (or the pseudo-English sometimes called "business-ese"). You can also classify anything you want as a "word" as long as your classification would be immediately obvious to a typical user. (NB: this is not the place to debate what a "typical user" is :) )

Was it helpful?

Solution

I would just use something like:

string.split(" ").length <= 500

What performance issue are you seeing? A string a 500 or so words shouldn't be much of a problem.

OTHER TIPS

In Rails3 using a :tokenizer with a lambda method works too.

validates_length_of :essay, :minimum => 100, :too_short => "Your essay must be at least 100 words."), :tokenizer => lambda {|str| str.scan(/\w+/) }

It may not be the fastest, but is certainly the cleanest way.

You're not going to get any faster than a linear search, sorry (unless this is for some sort of text-editor, and you can keep track incrementally)

You could estimate the typical size of a word and guess the amount of words by dividing.

some hints here:http://blogamundo.net/lab/wordlengths/

You could try like 5.1 and see how accurate you are by running a few tests.

Well probably dividing by 6.1 since you have whitespaces.

Keep in mind you would be assuming that your text is not just huge amount of white spaces or something. Well but if your really just interested to make sure it has not more than x words. You could try a low number on x maybe 5 and if it has less then x times 5 characters you can be pretty sure it does not have more then x words.

So you are maybe better off doing a linear search as stated in the other answers. A linear search isnt that bad at all. It just depends on what you want to do.

There's a plugin for that, havn't used it myself tho :)

http://code.google.com/p/validates-word-count/

That plugin switches all adjacent "word characters" into a single character, then removes all non-word characters and count them. Not sure if it's the fastest tho.

Here is a nice article that you might like

http://dotnetperls.com/word-count

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top