Pergunta

I have two tables. In Table 1, I have around 400K rows where each row includes a paragraph of text that can be up to 50 sentences. In Table 2 I have a lexicon of 80k words with a score that I need for the coding of each word of each paragraph.

The whole point of my php script is to explode each paragraph of text into as many words as needed, then lookup into the lexicon for each word what the score is, and end up calculating the total score for all of the words for each row.

My strategy so far was to have a script that did the following:

  1. Connect to the database, Table 1
  2. While Loop, one row after another
  3. For the current row, explode the paragraph.
  4. For each word, look up into Table 2 if the word exists and return the score.
  5. Ending up with a total score for the current row.
  6. Updating Table 1 with the total score for the current paragraph.
  7. Going back to point number 2.

My code works but is not efficient. The problem is that the script is so slow that letting it run for an hour just calculates the first 500 rows. That's a problem because I have 400K rows. I will need this script for other projects.

What would you advise me to do to make this process less intense?

<?php

//Include functions
        include "functions.php";
        ini_set('max_execution_time', 9000000);
        echo 'Time Limit = ' . ini_get('max_execution_time');
        $db='senate';   
//Function to search into the array lexicon     
        function searchForId($id, $array) {
        foreach ($array as $key2 => $val) {
        if ($val['word'] === $id) {
        return $key2;
       } 
   }
   return null;
}       

// tags to remove
        $remove   = array('{J}','{/J}','{N}','{/N}','{V}','{/V}','{RB}','{/RB}');       
        $x=1;
//Conecting the database
        if (!$conn) {
        die('Not connected : ' . mysql_error());}


// Choose the current db
        mysql_select_db($db);

//Slurps the lexicon into an array
$sql = "SELECT word, score FROM concreteness";
$resultconcreteness = mysql_query($sql) or die(mysql_error());
$array = array();
while($row = mysql_fetch_assoc($resultconcreteness)) {
$array[] = $row;
}

//loop      
        while($x<=500000) {
        $data = mysql_query("SELECT `key`, `tagged` FROM speechesLCMcoded WHERE `key`='$x'") or die(mysql_error());

// puts the "data" info into the $info array 
        $info = mysql_fetch_array( $data);
        $tagged=$info['tagged'];
        unset($weight);
        unset($count);
        $weight=0;
        $count=0;

// Print out the contents of the entry 
        Print "<b>Key:</b> ".$info['key'] .  " <br>";

// Explodes the sentence
        $speech = explode(" ", $tagged);

// Loop every word  
        foreach($speech as $word) {

//Check if string contains our tag

if(!preg_match('/({V}|{J}|{N}|{RB})/', $word, $matches)) {} else{

//Removes our tags
        $word = str_replace($remove, "", $word);

        $id = searchForId($word, $array);
//      print "ID: " . $id . "<br>";
//      print "Word: " . $array[$id]['word'] . "<br>";
//      print "Score: " . $array[$id]['score'] . "<br>";
        $weight=$weight+$array[$id]['score'];
        $count=$count +1;
//      print "Weight: " . $weight . "<br>";
 //     print "Count: " . $count . "<br>";
        }
}
        $sql = "UPDATE speechesLCMcoded SET weight='$weight', count='$count' WHERE `key`='$x';" ;
        $retval = mysql_query( $sql, $conn );
        if(! $retval )
        {die('Could not update data: ' . mysql_error());}
        echo "Updated data successfully\n";
        ob_flush();
        flush();   

        //Increase the loop by one
        $x=$x+1;

}?>

Here is the index:

CREATE TABLE `speechesLCMcoded` (
 `key` int(11) NOT NULL AUTO_INCREMENT,
 `speaker_state` varchar(100) NOT NULL,
 `speaker_first` varchar(100) NOT NULL,
 `congress` varchar(100) NOT NULL,
 `title` varchar(100) NOT NULL,
 `origin_url` varchar(100) NOT NULL,
 `number` varchar(100) NOT NULL,
 `order` varchar(100) NOT NULL,
 `volume` varchar(100) NOT NULL,
 `chamber` varchar(100) NOT NULL,
 `session` varchar(100) NOT NULL,
 `id` varchar(100) NOT NULL,
 `raw` mediumtext NOT NULL,
 `capitolwords_url` varchar(100) NOT NULL,
 `speaker_party` varchar(100) NOT NULL,
 `date` varchar(100) NOT NULL,
 `bills` varchar(100) NOT NULL,
 `bioguide_id` varchar(100) NOT NULL,
 `pages` varchar(100) NOT NULL,
 `speaker_last` varchar(100) NOT NULL,
 `speaker_raw` varchar(100) NOT NULL,
 `tagged` mediumtext NOT NULL,
 `adjectives` varchar(10) NOT NULL,
 `verbs` varchar(10) NOT NULL,
 `nouns` varchar(10) NOT NULL,
 `weight` varchar(50) NOT NULL,
 `count` varchar(50) NOT NULL,
 PRIMARY KEY (`key`)
) ENGINE=InnoDB AUTO_INCREMENT=408344 DEFAULT CHARSET=latin1
Foi útil?

Solução

You have a fairly small reference table (your lexicon) and an enormous corpus of text (table 1).

If I were you I would start your program by slurping the entire lexicon from the table into a php array in memory. Even if all your words are 20 characters in length this will only take a dozen or so megabytes of RAM.

Then do your step 4 by looking up the words in memory rather than by using a SQL query. Your inner loop (for each word) will be much faster, and equally as accurate.

Be careful about one thing, though. You'll need to normalize the words in your lexicon by converting them to lower case if you are to replicate the case-insensitive lookup behavior of MySQL.

Edit after seeing your code

Some pro tips:

  • Indent your code properly so you can see the structure of your loops at a glance.
  • Remember that passing data to functions takes time.
  • PHP arrays are associative. You can do $value = $array[$key]. This is fast. You don't have to search an array linearly. You're doing that once per word !!
  • Prepared statements are good.
  • Repeating an SQL statement when you could read the next row from its result set is bad.
  • Streaming result sets is good.
  • The mysql_ set of function calls are deprecated and despised by their developers, and everybody else, for good reasons.

There's way too much going on in your loops.

What you need is this:

First of all, switch to using mysqli_ from using mysql_ interfaces. Just do it. mysql_ is too slow, old, and crufty.

$db = new mysqli("host", "user", "password", "database");

Second, change the way you are loading your lexicon, to optimize the whole associative-array dealio.

$lookup = array();
//Slurps the lexicon into an array, streaming it row by row
$sql = "SELECT word, score FROM concreteness";
$db->real_query($sql) || die $db->error;
$lkup = $db->use_result();
while ($row = $lkup->fetch_row()) {
      $lookup[strtolower($row[0])] = $row[1];
}
$lkup->close();

This gives you an associative array called $lookup. If you have a $word, you can find its weight value this way. This is fast. What you have in your example code is very slow. Notice that the keys are all converted to lower case both when they are created and when words are looked up. Don't put this in a function if you can avoid it, for performance reasons.

if (array_key_exists( strtolower($word), $lookup )) {
    $weight += $lookup[strtolower($word)]; /* accumulate weight */
    $count ++;                             /* increment count   */
}
else {
  /* the word was not found in your lexicon. handle as needed */
}

Finally, you need to optimize your querying of the rows of your text corpus, and its updating. I believe you should do that using prepared statements.

Here's how that will go.

Near the beginning of your program, place this code.

$previouskey = -1;
if (/* you aren't starting at the beginning */) {
   $previouskey = /* the last successfully processed row */
}

$get_stmt = $db->prepare('SELECT `key`, `tagged` 
                           FROM speechesLCMcoded 
                          WHERE `key` > ?
                          ORDER BY `key` LIMIT 1' );

$post_stmt = $db->prepare ('UPDATE speechesLCMcoded 
                               SET weight=?, 
                                   count=? 
                             WHERE `key`=?' );

These give you two ready-to-use statements for your processing.

Notice that the $get_stmt retrieves the first key you haven't yet processed. This will work even if you have some missing keys. Always good. This will be decently efficient because you have an index on your key column.

So here's what your loop ends up looking like:

 $weight = 0;
 $count = 0;
 $key = 0;
 $tagged = '';

 /* bind parameters and results to the get statement */
 $get_stmt->bind_result($key, $tagged);
 $get_stmt->bind_param('i', $previouskey);

 /* bind parameters to the post statement */
 $post_stmt->bind_param('iii',$weight, $count, $key);

 $done = false;
 while ( !$done ) {
    $get_stmt->execute();
    if ($get_stmt->fetch()) {

        /* do everything word - by - word  here on the $tagged string */

        /* do the post statement to store the results */
        $post_stmt->execute();

        /* update the previous key prior to next iteration */
        $previouskey = $key; 
        $get_stmt->reset();
        $post_stmt->reset();
    } /* end if fetch */
    else {
       /* no result returned! we are done! */
       $done = true;
    }
 } /* end while not done */

This should get you down to subsecond processing per row.

Outras dicas

First and obvious optimization is like this:

include "functions.php";
set_time_limit(0); // NOTE: no time limit
if (!$conn)
    die('Not connected : ' . mysql_error());
$remove = array('{J}','{/J}','{N}','{/N}','{V}','{/V}','{RB}','{/RB}'); // tags to remove       
$db = 'senate';
mysql_select_db($db);

$resultconcreteness = mysql_query('SELECT `word`, `score` FROM `concreteness`') or die(mysql_error());
$array = array(); // NOTE: init score cache
while($row = mysql_fetch_assoc($resultconcreteness))
    $array[strtolower($row['word'])] = $row['score']; // NOTE: php array as hashmap
mysql_free_result($resultconcreteness);

$data = mysql_query('SELECT `key`, `tagged` FROM `speechesLCMcoded`') or die(mysql_error()); // NOTE: single query instead of multiple
while ($row = mysql_fetch_assoc($data)) {
    $key = $row['key'];
    $tagged = $row['tagged'];
    $weight = $count = 0;
    $speech = explode(' ', $tagged);
    foreach ($speech as $word) {
        if (preg_match('/({V}|{J}|{N}|{RB})/', $word, $matches)) {
            $weight += $array[strtolower(str_replace($remove, '', $word))]; // NOTE: quick access to word's score
            $count++;
        }
    }
    mysql_query('UPDATE `speechesLCMcoded` SET `weight`='.$weight.', `count`='.$count.' WHERE `key`='.$key, $conn) or die(mysql_error());
}
mysql_free_result($data);

Check the comments with NOTE:

But for 400K rows it sill will take some time, at least because you have to update each row, this means 400K updates.

Possible future optimizations:

  1. Make this script get arguments like start offset and length (pass them to mysql LIMIT), so you will be able to run several scripts to process different blocks of table at the same time
  2. Instead of updates - generate file with data, then use LOAD DATA INFILE to replace your table, it could be faster neither 400K updates
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top