Question

I have a news site containing an archive with more than 1 million news. I created a word definitions database with about 3000 entries, consisting of word-definition pairs.

What I want to do is adding a definition next to every occurence of these words in the news. I cant make a static change as I can add a new keyword everyday, so i can make it realtime or cached.

The question is, a str_replace or a preg_replace would be very slow for searching 3 thousand keywords in a text and replacing them.

Are there any fast alternatives?

Was it helpful?

Solution

str_replace won't work for you (unless you want "perl" in "superlative" to be a keyword), you need something that takes word boundaries into account (e.g. preg_replace with \b). Of course, you cannot preg_replace all 3000 keywords at once, but one single document can hardly contain them all, therefore I'd suggest pre-indexing all documents, for example, by maintaining an index table doc_id->word_id. When serving a specific document, query the index and only replace keywords that the document actually contains (presumably no more than 100).

On the other side, if documents are short, maintaining the index table might not be worth the trouble. You can simply do pre-indexing on the fly, e.g. with strpos:

 $kw = array();
 foreach($all_keywords as $k) if(strpos($text, $k)) $kw[] = $k;

 // $kw contains only words that actually occur in the text
 // (and perhaps some more, but that doesn't matter)

 preg_replace_callback('/\b(' . implode('|', $kw) . ')\b/',  'insert_keyword', $text)

OTHER TIPS

str_replace is pretty zippy and is, to my knowledge, the fastest you will find for PHP. You should certainly keep a cache; that will bypass performance issues.

this is just a suggestion to speed up the process, reduce errors etc.

  1. Create a function that will batch the news archives.
  2. Create a function to replace the text. str_replace is my bet.
  3. Create a function to spawn php process. refer to this thread
  4. Add caching functions.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top