Question

I would like to try and summarise the data that is available on the Alsop website (http://www.auction.co.uk/residential/onlineCatalogue.asp)

Ideally I would like to end up with a data.frame that has the following fields from the website.

Lot number, Type, Location/Full address, Guide Price, Number of bedrooms, url for any photos.

I tried to use google chrome to inspect element and htmlParse (normally of the links) but I get the same URL for each Lot number i.e. http://www.auction.co.uk/residential/LotDetails.asp?A=877&MP=24&ID=877000001&S=L&O=A

So for me I am a bit stumped, as the my usual methods of scraping websites to look for links no longer works.

I have a preference towards R but understand if Python is more useful and am open to suggestions as to how this could potentially be achieved.

Was it helpful?

Solution

You can get the data using selenium.

require(RSelenium)
RSelenium::startServer()
Sys.sleep(5)
appUrl <- "http://www.auction.co.uk/residential/onlineCatalogue.asp"
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.auction.co.uk/residential/onlineCatalogue.asp")
webElem <- remDr$findElement("css selector", '[href="onlineCatalogue.asp"]')
# check Element
webElem$highlightElement()
# click link
webElem$clickElement()
# get the pages to click thru
webElems <- remDr$findElements("css selector", "#Table7 a[href]")
appUrl <- c(appUrl, sapply(webElems, function(x){x$getElementAttribute("href")[[1]]}))
out <- lapply(appUrl, function(x){
  remDr$navigate(x)
  # get table data
  webElem <- remDr$findElement("id", "Table6")
  # get table html
  appData <- webElem$getElementAttribute("outerHTML")[[1]]
}
)
remDr$close()
remDr$closeServer()

Now we can process the html

# Process html Table
asDF <- lapply(out, function(x){
  appData <- x
  xData <- htmlParse(appData)
  require(selectr)
  lotAndLoc <- querySelectorAll(xData, "a.tooltip")
  alsopLot <- lapply(lotAndLoc[c(T,F)], function(x){
    lot <- getNodeSet(x, ".//span[@class = 'lotnum']")
    lot <- xmlValue(lot[[1]])
    img <- getNodeSet(x, ".//img")
    img <- xmlGetAttr(img[[1]], "src")
    data.frame(lot = lot, img = img)
  })
  alsopLot <- do.call(rbind.data.frame, alsopLot)
  alsopType <- xpathSApply(xData, "//tr/td[2]", xmlValue)[-1]
  alsopPrice <- xpathSApply(xData, "//tr/td[4]", xmlValue)[-1]
  alsopPrice <- gsub("ÂÂ", "", alsopPrice)
  alsopAddr <- xpathSApply(xData, "//tr/td[3]/*//span[@class='text']", function(x){
    Addr <- getChildrenStrings(x)[names(getChildrenStrings(x)) %in% c("text", "span")]
    Addr <- gsub("\\n\\s*", "", Addr)
    Addr <- Addr[Addr != ""]
    paste(Addr, collapse = "~")
  })

  alsopDf <- data.frame(type = alsopType, price = alsopPrice, address = alsopAddr)
  alsopDf <- cbind.data.frame(alsopLot, alsopDf)
  alsopDf
}
)
asDF <- do.call(rbind.data.frame, asDF)

You will need to tidy up the address but the rest of the data is as you want

> head(asDF)
  lot                                                                   img
1   1 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg
2   2 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp2.jpg
3   3 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp3.jpg
4   4 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp4.jpg
5   5 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp5.jpg
6   6 http://www.auction.co.uk/residential/data/bigimages/may2014/arbp6.jpg
                            type               price
1        VACANT - Leasehold Flat           £225,000+
2        VACANT - Leasehold Flat           £160,000+
3     VACANT - Freehold Building           £250,000+
4        VACANT - Leasehold Flat           £180,000+
5                 Freehold House           £180,000+
6 INVESTMENT - Freehold Building £110,000 - £120,000
                                                                  address
1 1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR
2                                   2~London W3~17 York Road~Acton~W3 6TS
3                 3~London SE27~23 Thurlestone Road~West Norwood~SE27 0PE
4             4~London N16~Flat G~74 Darenth Road~Stoke Newington~N16 6ED
5                              5~Ilford~11 Cavenham Gardens~Essex~IG1 1XX
6                                  6~Ilford~52 Balfour Road~Essex~IG1 4JG

The dataframe asDF has the required number of lots:

> str(asDF)
'data.frame':   347 obs. of  5 variables:
 $ lot    : Factor w/ 347 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ img    : Factor w/ 347 levels "http://www.auction.co.uk/residential/data/bigimages/may2014/arbp1.jpg",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ type   : Factor w/ 102 levels "Freehold Building",..: 30 30 23 30 2 5 23 1 1 19 ...
 $ price  : Factor w/ 151 levels "£1.25M - £1.5M",..: 31 19 33 21 21 9 54 68 68 68 ...
 $ address: Factor w/ 347 levels "1~London E2~Flat 14 Bridge Wharf~230 Old Ford Road~Bethnal Green~E2 9PR",..: 1 14 27 38 49 60 71 82 94 2 ...
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top