Scraping experimentally measured physicochemical properties and synonyms from Chemspider in R

Question 1

Web scraping is always fraught. For one thing, you have no guarantee the the provider will not change their formatting at some point in the future. For another, the current formats are anything but standardized. Avoiding this was the whole point of SOAP and XML web services.

Having said all that, this should get you started:

library(XML)
# load and parse the document
csid     <- "733" # chemspider ID of glycerin
url      <- paste0("http://www.chemspider.com/Chemical-Structure.",csid,".html")
doc      <- htmlTreeParse(url,useInternal=T)

The data in the epi tab are actually in a text block (e.g. <pre>...</pre>), so the best we can do with XPath is to grab that text. From there you still need some kind of regex solution to parse out the parameters. The example below deals with MP, BP, and VP.

# parse epiTab
epiTab   <- xmlValue(getNodeSet(doc,'//div[@id="epiTab"]/pre')[[1]])
epiTab   <- unlist(strsplit(epiTab,"\n"))
params   <- c(MP="Melting Pt (deg C):",
              BP="Boiling Pt (deg C):",
              VP="VP(mm Hg,25 deg C):")
prop <- sapply(params,function(x){
  z <- epiTab[grep(x,epiTab,fixed=T)]
  r <- unlist(regexpr(":  \\d+\\.*\\d+E*\\+*\\-*\\d*",z))
  return(as.numeric(substr(z,r+3,r+attr(r,"match.length")-1)))
})
prop
#         MP         BP         VP 
# 1.9440e+01 2.3065e+02 7.9800e-05

The data in the acdLabs tab is actually in an HTML table, so we can navigate to the appropriate node and use readHTMLTable(...) to put that into a dataframe. The data frame still needs some tweaking though.

# parse acdLabsTab
acdLabsTab   <- getNodeSet(doc,'//div[@id="acdLabsTab"]/div/div')[[1]]
acdLabs      <- readHTMLTable(acdLabsTab)

Finally, the synonyms tab is a real nightmare. There is a baseline set of synonyms, and also a "more..." link which exposes an additional (more obscure) set. The code bbelow just grabs the baseline set.

# synonyms tab
synNodes <- getNodeSet(doc,'//div[@id="synonymsTab"]/div/div/div/p[@class="syn"]')
synonyms <- sapply(synNodes,function(x)xmlValue(getNodeSet(x,"./strong")[[1]]))
synonyms
#  [1] "1,2,3-Propanetriol" "Bulbold"            "Cristal"            "Glicerol"           "Glyceol"            "Glycerin"           "Glycerin"          
#  [8] "glycerine"          "glycerol"           "GlycÃƒÂ©rol"

Question 2

instead of parsing ChemSpider web page it's much better and easier to use REST API: http://parts.chemspider.com/JSON.ashx

So, in order to get list of synonyms, predicted and experimental properties for compound with ID 733 do this http://parts.chemspider.com/JSON.ashx?op=GetRecordsAsCompounds&csids[0]=733&serfilter=Compound[PredictedProperties|ExperimentalProperties|Synonyms]