Question

I need advice regarding text analysis. The program is written in php.

My code needs to receive a URL and match the site words against the DB and seek for a match.

The tricky part is that the words aren't allways written in the DB as they appear in the text.

example:

Let's say my DB has these values: Word = letters

And the site has: Wordy thing

I'm supposed to output: Letters thing

My code makes several regex an after each one tries to match the searched word against the DB.

For each word that isn't found I make 8 queries to the DB. Most of the words don't have a match so when we talk about a whole website that has hundreds of words my CPU level makes a jump.

I thought about storing every word not found in the DB globaly as they appear ( HD costs less than CPU ) or maybe making an array or dictionary to store all of that.

I'm really confused with this project. It's supposed to serve a lot of users, with the current code the server will die after 10-20 user requests.

Any thoughts?

Edit: The searched words aren't English words and the code runs in a windows 2008 server

Was it helpful?

Solution 4

Thank you all for your answers. Unfortunately none of the answers helped me, maybe I wasn't clear enough.

I ended up solving the issue by creating a hash table with all of the words on the DB (about 6000 words), and checking against the hash instead of the DB.

The code started up with 4 sec execution time and now it's 0.5 sec! :-)

Thanks again

OTHER TIPS

Implement a trie and compute levenstein distance? See this blog for a detailed walkthrough of implementation: http://stevehanov.ca/blog/index.php?id=114

Seems to me like a job for Sphynx & stemming.

Possibly stupid question but have you considered using a LIKE clause in your SQL query? Something like this:

$sql = "SELECT * FROM `your_table` WHERE `your_field` LIKE 'your_search'":

I've usually found whenever I have to do too much string manipulation on return values from a query I can get it done easier on the SQL side.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top