Domanda

i want to insert some elements in my database, but i want that $pavadinimas and %kaina be in one line, not different. Moreover it will be pretty cool if i could generate my elements in all pages from website, but then I insert more than 2 links i get error from refreshing my web that page could not load. Here is my code. Thanks for help!

<?php // example of how to modify HTML contents


include_once('simple_html_dom.php');

// Create DOM from URL or file

$html = file_get_html('https://www.varle.lt/mobilieji-telefonai/');

foreach($html->find('span[class=inner]') as $pavadinimas) {
    $pavadinimas = str_replace("<span class=", " ", $pavadinimas);
    $pavadinimas = str_replace("inner>", " ", $pavadinimas);
    $pavadinimas = str_replace("<span>", " ", $pavadinimas);
    $pavadinimas = str_replace("</span></span>", " ", $pavadinimas);
    $pavadinimas = str_replace('"inner">   ', " ", $pavadinimas);
}

foreach($html->find('span[class=price]') as $kaina) {
    $kaina = str_replace("Lt", " ", $kaina);
    $kaina = str_replace("<span class=", " ", $kaina);
    $kaina = str_replace("price", " ", $kaina);
    $kaina = str_replace("</span>", " ", $kaina);
    $kaina = str_replace(",<sup>99</sup>", " ", $kaina);
    $kaina = str_replace(",<sup>99</sup>", " ", $kaina);
    $kaina = str_replace("               ", " ", $kaina);
    $kaina = str_replace('" ">', " ", $kaina);
    $kaina = str_replace("              ", " ", $kaina);
    $query = "insert into telefonai (pavadinimas,kaina) VALUES (?,?)";
    $this->db->query($query, array($pavadinimas,$kaina));
}
?>
È stato utile?

Soluzione

Proceed step by step...

Start by getting all the wanted info from one page (the 1st for example)... The idea is to:

  • Get all phone blocks: $phones = $html->find('a[data-id]');
  • In a loop, get the wanted info (name, price) from each block
  • Insert these info in the db (I cant help with db since I didnt use one for a while, but you can do this on your own it's not that hard)

Now that you have the code working for one page, let's try to make it work for all pages knowing that:

  • All pages have the same structure, so we can extract data with the same method/code above
  • The link of the next page to scrape is included in the Next button, so we'll stop when this link cannot be found

So here's a code summarizing all what we said above:

$url = "https://www.varle.lt/mobilieji-telefonai/";

// Start from the main page
$nextLink = $url;

// Loop on each next Link as long as it exsists
while ($nextLink) {
    echo "<hr>nextLink: $nextLink<br>";
    //Create a DOM object
    $html = new simple_html_dom();
    // Load HTML from a url
    $html->load_file($nextLink);

    /////////////////////////////////////////////////////////////
    /// Get phone blocks and extract info (also insert to db) ///
    /////////////////////////////////////////////////////////////
    $phones = $html->find('a[data-id]');

    foreach($phones as $phone) {
        // Get the link
        $linkas = $phone->href;

        // Get the name
        $pavadinimas = $phone->find('span[class=inner]', 0)->plaintext;

        // Get the name price and extract the useful part using regex
        $kaina = $phone->find('span[class=price]', 0)->plaintext;
        // This captures the integer part of decimal numbers: In "123,45" will capture "123"... Use @([\d,]+),?@ to capture the decimal part too
        preg_match('@(\d+),?@', $kaina, $matches);
        $kaina = $matches[1];

        echo $pavadinimas, " #----# ", $kaina, " #----# ", $linkas, "<br>";

        // INSERT INTO DB HERE
        // CODE
        // ...
    }
    /////////////////////////////////////////////////////////////
    /////////////////////////////////////////////////////////////

    // Extract the next link, if not found return NULL
    $nextLink = ( ($temp = $html->find('div.pagination a[class="next"]', 0)) ? "https://www.varle.lt".$temp->href : NULL );

    // Clear DOM object
    $html->clear();
    unset($html);
}

Output

nextLink: https://www.varle.lt/mobilieji-telefonai/
Samsung Phone I9300 Galaxy SIII Juodas #----# 1099 #----# https://www.varle.lt/mobilieji-telefonai/samsung-phone-i9300-galaxy-siii-juodas.html
Samsung Galaxy S2 Plus I9105 Pilkai mėlynas #----# 739 #----# https://www.varle.lt/mobilieji-telefonai/samsung-galaxy-s2-plus-i9105-pilkai-melynas.html
Samsung Phone S7562 Galaxy S Duos baltas #----# 555 #----# https://www.varle.lt/mobilieji-telefonai/samsung-phone-s7562-galaxy-s-duos-baltas--457135.html
...

nextLink: https://www.varle.lt/mobilieji-telefonai/?p=2
LG T375 Mobile Phone Black #----# 218 #----# https://www.varle.lt/mobilieji-telefonai/lg-t375-mobile-phone-black.html
Samsung S6802 Galaxy Ace Duos black #----# 579 #----# https://www.varle.lt/mobilieji-telefonai/samsung-s6802-galaxy-ace-duos-black.html
Mobilus telefonas Samsung Galaxy Ace Onyx Black | S5830 #----# 559 #----# https://www.varle.lt/mobilieji-telefonai/mobilus-telefonas-samsung-galaxy-ace-onyx-black.html
...

...
...

Working DEMO

Notice that the code may take a while to parse all the pages, so php may return this error Fatal error: Maximum execution time of 30 seconds exceeded .... Then, simply extend the maximum execution time like this:

ini_set('max_execution_time', 300); //300 seconds = 5 minutes
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top