Pregunta

To be able to access the NIST Chemistry Webbook database from R I need to be able to pass some query to a URL encoded web address. Most of the time this conversion works fine with URLencode(), but in some cases not. One case where it fails e.g. is for

query="Poligodial + 3-methoxy-4,5-methylenedioxyamphetamine (R,S) adduct, # 1"

which I tried to fetch using

library(XML)
library(RCurl)
url=URLencode(paste0('http://webbook.nist.gov/cgi/cbook.cgi?Name=',query,'&Units=SI'))
doc=htmlParse(getURL(url),encoding="UTF-8")

however if you try this url in your web browser http://webbook.nist.gov/cgi/cbook.cgi?Name=Poligodial%20+%203-methoxy-4,5-methylenedioxyamphetamine%20(R,S)%20adduct,%20%23%201&Units=SI it gives name not found. Apparently, if you try the query from http://webbook.nist.gov/chemistry/name-ser.html it is expecting the URL encoded string

"http://webbook.nist.gov/cgi/cbook.cgi?Name=Poligodial+%2B+3-methoxy-4%2C5-methylenedioxyamphetamine+%28R%2CS%29+adduct%2C+%23+1&Units=SI"

Does anybody have any idea what kind of gsub rules I should use to arrive at the same kind of URL encoding in this case? Or is there some other easy fix?

I tried with

url=gsub(" ","+",gsub(",","%2C",gsub("+","%2B",URLencode(paste('http://webbook.nist.gov/cgi/cbook.cgi?Name=',query,'&Units=SI', sep="")),fixed=T),fixed=T),fixed=T)

but that still wasn't quite right, and I have no idea what rules the owner of the web site could have used...

¿Fue útil?

Solución 3

@Richie Cotton's solution also solves for #, whereas URLencode() doesn't.

Here's a really simple example

# Useless...
URLencode("hi$there")
[1] "hi$there"

# This is good, but only if special characters are escaped first
URLencode("hi\\$there")
[1] "hi%5C$there"

# This works without escaping!
library(httr)
curlEscape("hi$there")
[1] "hi%24there"

Otros consejos

URLencode follows the RFC1738 specification (see section 2.2, page 3), which states that:

only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL.

That is, it doesn't encode plusses or commas or parentheses. So the URL it generate is correct in theory but not in practise.

The GET function in the httr package that Scott mentioned calls curlEscape from RCurl, which encodes these punctuation characters.

(GET calls handle_url which calls modify_url which calls build_url which calls curlEscape.)

The URL it generates is

paste0('http://webbook.nist.gov/cgi/cbook.cgi?Name=', curlEscape(query), '&Units=SI')
## [1] "http://webbook.nist.gov/cgi/cbook.cgi?Name=Poligodial%20%2B%203%2Dmethoxy%2D4%2C5%2Dmethylenedioxyamphetamine%20%28R%2CS%29%20adduct%2C%20%23%201&Units=SI"

This seems to work OK.

httr has nice features and you may want to start using it. The minimal change to your code to get things working is simply to swap URLencode for curlEscape.

Does this do what you want?

library(httr)
url <- 'http://webbook.nist.gov/cgi/cbook.cgi'
args <- list(Name = "Poligodial + 3-methoxy-4,5-methylenedioxyamphetamine (R,S) adduct, # 1",
         Units = 'SI')
res <- GET(url, query=args)
content(res)$children$html

Gives

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
  <meta http-equiv="Window-target" content="_top"/>

...etc.
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top