Question

I would like to try and summarise the data that is available on the Alsop website (http://www.auction.co.uk/residential/onlineCatalogue.asp)

Ideally I would like to end up with a data.frame that has the following fields from the website.

Lot number, Type, Location/Full address, Guide Price, Number of bedrooms, url for any photos.

I tried to use google chrome to inspect element and htmlParse (normally of the links) but I get the same URL for each Lot number i.e. http://www.auction.co.uk/residential/LotDetails.asp?A=877&MP=24&ID=877000001&S=L&O=A

So for me I am a bit stumped, as the my usual methods of scraping websites to look for links no longer works.

I have a preference towards R but understand if Python is more useful and am open to suggestions as to how this could potentially be achieved.

Était-ce utile?

La solution

You can get the data using selenium.

require(RSelenium)
RSelenium::startServer()
Sys.sleep(5)
appUrl <- "http://www.auction.co.uk/residential/onlineCatalogue.asp"
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.auction.co.uk/residential/onlineCatalogue.asp")
webElem <- remDr$findElement("css selector", '[href="onlineCatalogue.asp"]')
# check Element
webElem$highlightElement()
# click link
webElem$clickElement()
# get the pages to click thru
webElems <- remDr$findElements("css selector", "#Table7 a[href]")
appUrl <- c(appUrl, sapply(webElems, function(x){x$getElementAttribute("href")[[1]]}))
out <- lapply(appUrl, function(x){
  remDr$navigate(x)
  # get table data
  webElem <- remDr$findElement("id", "Table6")
  # get table html
  appData <- webElem$getElementAttribute("outerHTML")[[1]]
}
)
remDr$close()
remDr$closeServer()

Now we can process the html

# Process html Table
asDF <- lapply(out, function(x){
  appData <- x
  xData <- htmlParse(appData)
  require(selectr)
  lotAndLoc <- querySelectorAll(xData, "a.tooltip")
  alsopLot <- lapply(lotAndLoc[c(T,F)], function(x){
    lot <- getNodeSet(x, ".//span[@class = 'lotnum']")
    lot <- xmlValue(lot[[1]])
    img <- getNodeSet(x, ".//img")
    img <- xmlGetAttr(img[[1]], "src")
    data.frame(lot = lot, img = img)
  })
  alsopLot <- do.call(rbind.data.frame, alsopLot)
  alsopType <- xpathSApply(xData, "//tr/td[2]", xmlValue)[-1]
  alsopPrice <- xpathSApply(xData, "//tr/td[4]", xmlValue)[-1]
  alsopPrice <- gsub("ÂÂ", "", alsopPrice)
  alsopAddr <- xpathSApply(xData, "//tr/td[3]/*//span[@class='text']", function(x){
    Addr <- getChildrenStrings(x)[names(getChildrenStrings(x)) %in% c("text", "span")]
    Addr <- gsub("\\n\\s*", "", Addr)
    Addr <- Addr[Addr != ""]
    paste(Addr, collapse = "~")
  })

  alsopDf <- data.frame(type = alsopType, price = alsopPrice, address = alsopAddr)
  alsopDf <- cbind.data.frame(alsopLot, alsopDf)
  alsopDf
}
)
asDF <- do.call(rbind.data.frame, asDF)

You will need to tidy up the address but the rest of the data is as you want

> head(asDF)
  lot                                                                   img
1   1 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg
2   2 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp2.jpg
3   3 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp3.jpg
4   4 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp4.jpg
5   5 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp5.jpg
6   6 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp6.jpg
                            type               price
1        VACANT - Leasehold Flat           £225,000+
2        VACANT - Leasehold Flat           £160,000+
3     VACANT - Freehold Building           £250,000+
4        VACANT - Leasehold Flat           £180,000+
5                 Freehold House           £180,000+
6 INVESTMENT - Freehold Building £110,000 - £120,000
                                                                  address
1 1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR
2                                   2~London W3~17 York Road~Acton~W3 6TS
3                 3~London SE27~23 Thurlestone Road~West Norwood~SE27 0PE
4             4~London N16~Flat G~74 Darenth Road~Stoke Newington~N16 6ED
5                              5~Ilford~11 Cavenham Gardens~Essex~IG1 1XX
6                                  6~Ilford~52 Balfour Road~Essex~IG1 4JG

The dataframe asDF has the required number of lots:

> str(asDF)
'data.frame':   347 obs. of  5 variables:
 $ lot    : Factor w/ 347 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ img    : Factor w/ 347 levels "http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ type   : Factor w/ 102 levels "Freehold Building",..: 30 30 23 30 2 5 23 1 1 19 ...
 $ price  : Factor w/ 151 levels "£1.25M - £1.5M",..: 31 19 33 21 21 9 54 68 68 68 ...
 $ address: Factor w/ 347 levels "1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR",..: 1 14 27 38 49 60 71 82 94 2 ...
Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top