質問

I am trying to read a 12MB+ file which has a large HTML table which looks like this:

<table>
    <tr>
        <td>a</td>
        <td>b</td>
        <td>c</td>
        <td>d</td>
        <td>e</td>
    </tr>
    <tr>
        <td>a</td>
        <td>b</td>
        <td>c</td>
        <td>d</td>
        <td>e</td>
    </tr>
    <tr>..... up to 20,000+ rows....</tr>
</table>

Now this is how I'm scraping it:

<?

require_once 'phpQuery-onefile.php';

$d = phpQuery::newDocumentFile('http://localhost/test.html');

$last_index = 20000;

for ($i = 1; $i <= $last_index; $i++)
{
    $set['c1']  = $d['tr:eq('.$i.') td:eq(0)']->text();
    $set['c2']  = $d['tr:eq('.$i.') td:eq(1)']->text();
    $set['c3']  = $d['tr:eq('.$i.') td:eq(2)']->text();
    $set['c4']  = $d['tr:eq('.$i.') td:eq(3)']->text();
    $set['c5']  = $d['tr:eq('.$i.') td:eq(4)']->text();
}

// code to insert to db here... 

?>

My benchmark says it takes around 5.25 hours to scrape and insert 1,000 rows to db. Given that data, it will take around 5 days just to finish the whole 20,000+ rows.

My local machine is running on:

  • XAMPP
  • Win 7
  • proc, i3 2100 3.1GHz
  • ram, G.Skill RipJaws X 4GB dual
  • HDD, old SATA

Is there any way I can speed up the process? Maybe I'm scraping it the wrong way? Note that the file is accessible locally hence I used http://localhost/test.html

Slightly faster solution:

for ($i = 1; $i <= $last_index; $i++)
{
    $r = $d['tr:eq('.$i.')'];

    $set['c1']  = $r['td:eq(0)']->text();
    $set['c2']  = $r['td:eq(1)']->text();
    $set['c3']  = $r['td:eq(2)']->text();
    $set['c4']  = $r['td:eq(3)']->text();
    $set['c5']  = $r['td:eq(4)']->text();
}

// code to insert to db here... 

?>
役に立ちましたか?

解決

I have never worked with phpQuery, but that looks like a very sub-optimal way to parse a huge document: It's possible that phpQuery has to walk through the whole thing every time you make it load a row using tr:eq('.$i.').

The much more straightforward (and probably also much faster) way would be to simply walk through each tr element of the document, and deal with each element's children in a foreach loop. You wouldn't even need phpQuery for that.

See How to Parse XML File in PHP for a variety of solutions.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top