Question

Is there any package for R that allows querying Wikipedia (most probably using Mediawiki API) to get list of available articles relevant to such query, as well as import selected articles for text mining?

Was it helpful?

Solution

Use the RCurl package for retreiving info, and the XML or RJSONIO packages for parsing the response.

If you are behind a proxy, set your options.

opts <- list(
  proxy = "136.233.91.120", 
  proxyusername = "mydomain\\myusername", 
  proxypassword = 'whatever', 
  proxyport = 8080
)

Use the getForm function to access the API.

search_example <- getForm(
  "http://en.wikipedia.org/w/api.php", 
  action  = "opensearch", 
  search  = "Te", 
  format  = "json",
  .opts   = opts
)

Parse the results.

fromJSON(rawToChar(search_example))

OTHER TIPS

There is WikipediR, 'A MediaWiki API wrapper in R'

library(devtools)
install_github("Ironholds/WikipediR")
library(WikipediR)

It includes these functions:

ls("package:WikipediR")
 [1] "wiki_catpages"      "wiki_con"           "wiki_diff"          "wiki_page"         
 [5] "wiki_pagecats"      "wiki_recentchanges" "wiki_revision"      "wiki_timestamp"    
 [9] "wiki_usercontribs"  "wiki_userinfo"  

Here it is in use, getting the contribution details and user details for a bunch of users:

library(RCurl)
library(XML)

# scrape page to get usernames of users with highest numbers of edits
top_editors_page <- "http://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits"
top_editors_table <- readHTMLTable(top_editors_page)
very_top_editors <- as.character(top_editors_table[[3]][1:5,]$User)

# setup connection to wikimedia project 
con <- wiki_con("en", project = c("wikipedia"))

# connect to API and get last 50 edits per user
user_data <- lapply(very_top_editors,  function(i) wiki_usercontribs(con, i) )
# and get information about the users (registration date, gender, editcount, etc)
user_info <- lapply(very_top_editors,  function(i) wiki_userinfo(con, i) )
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top