Question

My function parses texts and removes short words, such as "a", "the", "in", "on", "at", etc.

The list of these words might be modified in the future. Also, switching between different lists (i.e., for different languages) might also be an option.

So, where should I store such a list?

  • About 50-200 words
  • Many reads every minute
  • Almost no writes (modifications) - for example, once in a few months

I have these options in my mind:

  1. A list inside the code (fastest, but it doesn't sound like a good practise)
  2. A seperate file "stop_words.txt" (how fast is reading from a file? should I read the same data from the same file every few seconds I call the same function?)
  3. A database table. Would it be really efficient, when the list of words is supposed to be almost static?

I am using Ruby on Rails (if that makes any difference).

Was it helpful?

Solution

If it's only about 50-200 words, I'd store it in memory in a data structure that supports fast lookup, such as a hash map (I don't know what such a structure is called in Ruby).

You could use option 2 or 3 (persist the data in a file or database table, depending on what's easier for you), then read the data into memory at the start of your application. Store the time at which the data was read and re-read it from the persistent storage if a request comes in and the data hasn't been updated for X minutes.

That's basically a cache. It might be possible that Ruby on Rails already provides such a mechanism, but I know too little about it to answer that.

OTHER TIPS

Since look-up of the stop-words needs to be fast, I'd store the stop-words in a hash table. That way, verifying if a word is a stop-word has amortized O(1) complexity.

Now, since the list of stop-words may change, it makes sense to persist the list in a text file, and read that file upon program start (or every few minutes / upon file modification if your program runs continuously).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top