You have a fairly small reference table (your lexicon) and an enormous corpus of text (table 1).
If I were you I would start your program by slurping the entire lexicon from the table into a php array in memory. Even if all your words are 20 characters in length this will only take a dozen or so megabytes of RAM.
Then do your step 4 by looking up the words in memory rather than by using a SQL query. Your inner loop (for each word) will be much faster, and equally as accurate.
Be careful about one thing, though. You'll need to normalize the words in your lexicon by converting them to lower case if you are to replicate the case-insensitive lookup behavior of MySQL.
Edit after seeing your code
Some pro tips:
- Indent your code properly so you can see the structure of your loops at a glance.
- Remember that passing data to functions takes time.
- PHP arrays are associative. You can do
$value = $array[$key]
. This is fast. You don't have to search an array linearly. You're doing that once per word !! - Prepared statements are good.
- Repeating an SQL statement when you could read the next row from its result set is bad.
- Streaming result sets is good.
- The
mysql_
set of function calls are deprecated and despised by their developers, and everybody else, for good reasons.
There's way too much going on in your loops.
What you need is this:
First of all, switch to using mysqli_
from using mysql_
interfaces. Just do it. mysql_
is too slow, old, and crufty.
$db = new mysqli("host", "user", "password", "database");
Second, change the way you are loading your lexicon, to optimize the whole associative-array dealio.
$lookup = array();
//Slurps the lexicon into an array, streaming it row by row
$sql = "SELECT word, score FROM concreteness";
$db->real_query($sql) || die $db->error;
$lkup = $db->use_result();
while ($row = $lkup->fetch_row()) {
$lookup[strtolower($row[0])] = $row[1];
}
$lkup->close();
This gives you an associative array called $lookup
. If you have a $word
, you can find its weight value this way. This is fast. What you have in your example code is very slow. Notice that the keys are all converted to lower case both when they are created and when words are looked up. Don't put this in a function if you can avoid it, for performance reasons.
if (array_key_exists( strtolower($word), $lookup )) {
$weight += $lookup[strtolower($word)]; /* accumulate weight */
$count ++; /* increment count */
}
else {
/* the word was not found in your lexicon. handle as needed */
}
Finally, you need to optimize your querying of the rows of your text corpus, and its updating. I believe you should do that using prepared statements.
Here's how that will go.
Near the beginning of your program, place this code.
$previouskey = -1;
if (/* you aren't starting at the beginning */) {
$previouskey = /* the last successfully processed row */
}
$get_stmt = $db->prepare('SELECT `key`, `tagged`
FROM speechesLCMcoded
WHERE `key` > ?
ORDER BY `key` LIMIT 1' );
$post_stmt = $db->prepare ('UPDATE speechesLCMcoded
SET weight=?,
count=?
WHERE `key`=?' );
These give you two ready-to-use statements for your processing.
Notice that the $get_stmt
retrieves the first key
you haven't yet processed. This will work even if you have some missing keys. Always good. This will be decently efficient because you have an index on your key
column.
So here's what your loop ends up looking like:
$weight = 0;
$count = 0;
$key = 0;
$tagged = '';
/* bind parameters and results to the get statement */
$get_stmt->bind_result($key, $tagged);
$get_stmt->bind_param('i', $previouskey);
/* bind parameters to the post statement */
$post_stmt->bind_param('iii',$weight, $count, $key);
$done = false;
while ( !$done ) {
$get_stmt->execute();
if ($get_stmt->fetch()) {
/* do everything word - by - word here on the $tagged string */
/* do the post statement to store the results */
$post_stmt->execute();
/* update the previous key prior to next iteration */
$previouskey = $key;
$get_stmt->reset();
$post_stmt->reset();
} /* end if fetch */
else {
/* no result returned! we are done! */
$done = true;
}
} /* end while not done */
This should get you down to subsecond processing per row.