How to recognize data format - scraping in R

https://stackoverflow.com/questions/18123521

23-06-2022
|

Question

I am trying to use R to get data from an open data source in the Netherlands. The source is here.

When you open this in a browser (at least Chrome), it is presented as xml code. So I thought I can use the RCurl package to parse it, and then use XPath to extract the specific nodes I seek.

However, when trying to parse it, I run into problems. It does not seem to be straight xml, but has json in it.

How can I easily extract the information from the datasource? Not looking for the full solution, just guidance in the right direction.

If I try:

url <- "http://www.kiesbeter.nl/open-data/api/care/careproviders/?apikey=18a2b2b0-d232-4f48-8d10-5fc10ff04b17"
html <- getURL(url)
doc <- htmlParse(html,asText = TRUE)

It seems then that doc is in some JSON format still. I cannot seem to use the getNodeSet(doc, "//careproviders"). However, if I use fromJSON first, I get it in an awkward list format.

So question is how can I treat this data so that I easily can get the information out of this dataset (e.g. all care providers). And how do I recognize what format the data is in?

Solution

Use

html <- getURL(url, httpheader = c(Accept = "text/xml"))

with specified content-type to get XML with curl.

A little clarification. The service provides both XML and JSON data formats, with the default of JSON. Your browser sends text/xml (among others) in Accept header with request, thus service returns XML. The curl (by default) doesn't send anything so, service returns JSON format, which is a default type.

OTHER TIPS

The doc is in JSON format.

library(rjson)
library(RCurl)
ll <- fromJSON(getURL(url))

The json format is more friendly and faster than the xml one to parse the list. For example:

ll$careproviders$careprovider[[1]]
$id
[1] "1"

$friendly_name
[1] "ziekenhuizen"

$name
[1] "Ziekenhuizen"

$CareProviderCategoryId
[1] "8"

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow