Word matching in corpus of text is very slow

Question 1

You have a fairly small reference table (your lexicon) and an enormous corpus of text (table 1).

If I were you I would start your program by slurping the entire lexicon from the table into a php array in memory. Even if all your words are 20 characters in length this will only take a dozen or so megabytes of RAM.

Then do your step 4 by looking up the words in memory rather than by using a SQL query. Your inner loop (for each word) will be much faster, and equally as accurate.

Be careful about one thing, though. You'll need to normalize the words in your lexicon by converting them to lower case if you are to replicate the case-insensitive lookup behavior of MySQL.

Edit after seeing your code

Some pro tips:

Indent your code properly so you can see the structure of your loops at a glance.
Remember that passing data to functions takes time.
PHP arrays are associative. You can do $value = $array[$key]. This is fast. You don't have to search an array linearly. You're doing that once per word !!
Prepared statements are good.
Repeating an SQL statement when you could read the next row from its result set is bad.
Streaming result sets is good.
The mysql_ set of function calls are deprecated and despised by their developers, and everybody else, for good reasons.

There's way too much going on in your loops.

What you need is this:

First of all, switch to using mysqli_ from using mysql_ interfaces. Just do it. mysql_ is too slow, old, and crufty.

$db = new mysqli("host", "user", "password", "database");

Second, change the way you are loading your lexicon, to optimize the whole associative-array dealio.

$lookup = array();
//Slurps the lexicon into an array, streaming it row by row
$sql = "SELECT word, score FROM concreteness";
$db->real_query($sql) || die $db->error;
$lkup = $db->use_result();
while ($row = $lkup->fetch_row()) {
      $lookup[strtolower($row[0])] = $row[1];
}
$lkup->close();

This gives you an associative array called $lookup. If you have a $word, you can find its weight value this way. This is fast. What you have in your example code is very slow. Notice that the keys are all converted to lower case both when they are created and when words are looked up. Don't put this in a function if you can avoid it, for performance reasons.

if (array_key_exists( strtolower($word), $lookup )) {
    $weight += $lookup[strtolower($word)]; /* accumulate weight */
    $count ++;                             /* increment count   */
}
else {
  /* the word was not found in your lexicon. handle as needed */
}

Finally, you need to optimize your querying of the rows of your text corpus, and its updating. I believe you should do that using prepared statements.

Here's how that will go.

Near the beginning of your program, place this code.

$previouskey = -1;
if (/* you aren't starting at the beginning */) {
   $previouskey = /* the last successfully processed row */
}

$get_stmt = $db->prepare('SELECT `key`, `tagged` 
                           FROM speechesLCMcoded 
                          WHERE `key` > ?
                          ORDER BY `key` LIMIT 1' );

$post_stmt = $db->prepare ('UPDATE speechesLCMcoded 
                               SET weight=?, 
                                   count=? 
                             WHERE `key`=?' );

These give you two ready-to-use statements for your processing.

Notice that the $get_stmt retrieves the first key you haven't yet processed. This will work even if you have some missing keys. Always good. This will be decently efficient because you have an index on your key column.

So here's what your loop ends up looking like:

 $weight = 0;
 $count = 0;
 $key = 0;
 $tagged = '';

 /* bind parameters and results to the get statement */
 $get_stmt->bind_result($key, $tagged);
 $get_stmt->bind_param('i', $previouskey);

 /* bind parameters to the post statement */
 $post_stmt->bind_param('iii',$weight, $count, $key);

 $done = false;
 while ( !$done ) {
    $get_stmt->execute();
    if ($get_stmt->fetch()) {

        /* do everything word - by - word  here on the $tagged string */

        /* do the post statement to store the results */
        $post_stmt->execute();

        /* update the previous key prior to next iteration */
        $previouskey = $key; 
        $get_stmt->reset();
        $post_stmt->reset();
    } /* end if fetch */
    else {
       /* no result returned! we are done! */
       $done = true;
    }
 } /* end while not done */

This should get you down to subsecond processing per row.

Question 2

First and obvious optimization is like this:

include "functions.php";
set_time_limit(0); // NOTE: no time limit
if (!$conn)
    die('Not connected : ' . mysql_error());
$remove = array('{J}','{/J}','{N}','{/N}','{V}','{/V}','{RB}','{/RB}'); // tags to remove       
$db = 'senate';
mysql_select_db($db);

$resultconcreteness = mysql_query('SELECT `word`, `score` FROM `concreteness`') or die(mysql_error());
$array = array(); // NOTE: init score cache
while($row = mysql_fetch_assoc($resultconcreteness))
    $array[strtolower($row['word'])] = $row['score']; // NOTE: php array as hashmap
mysql_free_result($resultconcreteness);

$data = mysql_query('SELECT `key`, `tagged` FROM `speechesLCMcoded`') or die(mysql_error()); // NOTE: single query instead of multiple
while ($row = mysql_fetch_assoc($data)) {
    $key = $row['key'];
    $tagged = $row['tagged'];
    $weight = $count = 0;
    $speech = explode(' ', $tagged);
    foreach ($speech as $word) {
        if (preg_match('/({V}|{J}|{N}|{RB})/', $word, $matches)) {
            $weight += $array[strtolower(str_replace($remove, '', $word))]; // NOTE: quick access to word's score
            $count++;
        }
    }
    mysql_query('UPDATE `speechesLCMcoded` SET `weight`='.$weight.', `count`='.$count.' WHERE `key`='.$key, $conn) or die(mysql_error());
}
mysql_free_result($data);

Check the comments with NOTE:

But for 400K rows it sill will take some time, at least because you have to update each row, this means 400K updates.

Possible future optimizations:

Make this script get arguments like start offset and length (pass them to mysql LIMIT), so you will be able to run several scripts to process different blocks of table at the same time
Instead of updates - generate file with data, then use LOAD DATA INFILE to replace your table, it could be faster neither 400K updates