Question

This question takes a bit of time to introduce, bear with me. It will be fun to solve if you can get there. This scrape would be replicated over thousands of pages on this website using a loop.

I'm trying to scrape the website http://www.digikey.com/product-detail/en/207314-1/A25077-ND/ looking to capture the data in the table with Digi-Key Part Number, Quantity Available etc.. including the right hand side with Price Break, Unit Price, Extended Price.

Using the R function readHTMLTable() doesn't work and only returns NULL values. The reason for this (I believe) is because the website has hidden it's content using the tag "aspNetHidden" in the html code.

For this reason I also found difficulty using htmlTreeParse() and xmlTreeParse() with the whole section parented by not appearing in the results.

Using the R function scrape() from the scrapeR package

require(scrapeR)

URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")

does return the full html code including the lines of interest:

<th align="right">Digi-Key Part Number</th>
<td id="reportpartnumber">
<meta itemprop="productID" content="sku:A25077-ND">A25077-ND</td>

<th>Price Break</th>
<th>Unit Price</th>
<th>Extended Price
</th>
</tr>
<tr>
<td align="center">1</td>
<td align="right">2.75000</td>
<td align="right">2.75</td>

However, I haven't been able to select the nodes out of this block of code with the error being returned:

no applicable method for 'xpathApply' applied to an object of class "list"

I've received that error using different functions such as:

xpathSApply(URL,'//*[@id="pricing"]/tbody/tr[2]')

getNodeSet(URL,"//html[@class='rd-product-details-page']")

I'm not the most familiar with xpath but have been identifying the xpath using inspect element on the webpage and copy xpath.

Any help you can give on this would be much appreciated!

Était-ce utile?

La solution

You've not read the help for scrape have you? It returns a list, you need to get parts of that list (if parse=TRUE) and so on.

Also I think that web page is doing some heavy heavy browser detection. If I try and wget the page from the command line I get an error page, the scrape function gets something usable (but seems different to you) and Chrome gets the full junk with all the encoded stuff. Yuck. Here's what works for me:

> URL<-scrape("http://www.digikey.com/product-detail/en/207314-1/A25077-ND/")
> tables = xpathSApply(URL[[1]],'//table')
> tables[[2]]
<table class="product-details" border="1" cellspacing="1" cellpadding="2">
  <tr class="product-details-top"/>
  <tr class="product-details-bottom">
    <td class="pricing-description" colspan="3" align="right">All prices are in US dollars.</td>
  </tr>
  <tr>
    <th align="right">Digi-Key Part Number</th>
    <td id="reportpartnumber"><meta itemprop="productID" content="sku:A25077-ND"/>A25077-ND</td>
    <td class="catalog-pricing" rowspan="6" align="center" valign="top">
      <table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
        <tr>
          <th>Price Break</th>
          <th>Unit Price</th>
          <th>Extended Price&#13;
</th>
        </tr>
        <tr>
          <td align="center">1</td>
          <td align="right">2.75000</td>
          <td align="right">2.75</td>

Adjust to your use-case, here I'm getting all the tables and showing the second one, which has the info you want, some of it in the pricing table which you can get directly with:

pricing = xpathSApply(URL[[1]],'//table[@id="pricing"]')[[1]]

> pricing
<table id="pricing" frame="void" rules="all" border="1" cellspacing="0" cellpadding="1">
  <tr>
    <th>Price Break</th>
    <th>Unit Price</th>
    <th>Extended Price&#13;
</th>
  </tr>
  <tr>
    <td align="center">1</td>
    <td align="right">2.75000</td>
    <td align="right">2.75</td>
  </tr>

and so on.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top