Question

I would like to scrape the Vancouver olympic games Wikipedia entry. Unfortunately its not a nice table format.

I am trying to create a data frame with 2 columns: Nation and number of athletes.

At this point I have

library(XML)
library(RCurl)

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue)
country<-tablehead[31]

where country is

> country
[1] "\n Afrique du Sud (2)\n Albanie (1)\n Algérie (1)\n Allemagne (153)\n Andorre (6)\n Argentine (7)\n Arménie (4)\n Australie (41)\n Autriche (82)\n Azerbaïdjan (2)\n Belgique (8)\n Bermudes (1)\n Biélorussie (50)\n Bosnie-Herzégovine (5)\n Brésil (5)\n Bulgarie (18)\n Canada (206)\n Chili (3)\n Chine (90)\n Chypre (2)\n Colombie (1)\n\n\n\n Corée du Nord (2)\n Corée du Sud (46)\n Croatie (18)\n Danemark (18)\n Espagne (18)\n Estonie (32)\n États-Unis (216)\n Éthiopie (1)\n Finlande (95)\n France (108)\n Géorgie (12)\n Ghana (1)\n Grande-Bretagne (52)\n Grèce (7)\n Hong Kong (1)\n Hongrie (16)\n Îles Caïmans (1)\n Inde (3)\n Iran (4)\n Irlande (6)\n Islande (4)\n\n\n\n Israël (3)\n Italie (109)\n Jamaïque (1)\n Japon (94)\n Kazakhstan (38)\n Kirghizistan (2)\n Lettonie (54)\n Liban (3)\n Liechtenstein (6)\n Lituanie (6)\n Macédoine (3)\n Moldavie (8)\n Maroc (1)\n Mexique (1)\n Monaco (3)\n Monténégro (1)\n Mongolie (2)\n Népal (1)\n Norvège (99)\n Nouvelle-Zélande (16)\n\n\n\n Ouzbékistan (3)\n Pakistan (1)\n Pays-Bas (34)\n Pérou (3)\n Pologne (50)\n Portugal (1)\n République tchèque (93)\n Roumanie (29)\n Russie (179)\n Saint-Marin (1)\n Sénégal (1)\n Serbie (10)\n Slovaquie (73)\n Slovénie (49)\n Suède (108)\n Suisse (146)\n Tadjikistan (1)\n Taipei chinois (1)\n Turquie (5)\n Ukraine (47)\n\n"

I have tried

str_detect(country,"\n")
country<-str_split(country,"\n")

But the data are very dirty, and it's not working well. Any suggestions?

Was it helpful?

Solution

A possibility is to use regular expressions. I've never done that with R but the library stringr seems to be recommended: Extract a regular expression match in R version 2.10 ( http://cran.r-project.org/web/packages/stringr/stringr.pdf )

EDIT: Code that appears to work for me

library(XML)
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
library(stringr)

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding = "UTF-8")
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue)
country<-tablehead[31]

country<-strsplit(country,"\n")

# extract country
bar <- function(x) str_trim(str_extract(x, "[^(]*"), side = "both")
res1 <- sapply(country[[1]], bar)    
# extract nb of athletes
foo <- function(x) str_trim(str_match(x, "\\((.*?)\\)")[[2]], side = "both")
res2 <- sapply(country[[1]], foo)
# build df
res2 <- as.numeric(res2)
df <- data.frame(res1, res2)
df <- df[res1 != "",]
# inspect df
nrow(df)
summary(df)

OTHER TIPS

Try

library(plyr)
country <- str_split(country,"\n")[[1]]
df <- ldply(country[[1]], function(z) data.frame(str_extract(z, "[A-Za-z]+")[[1]], str_extract(z, "[0-9]+")))
head(na.omit(df))

                                  a                        b
2                           Afrique                        2
3                           Albanie                        1
4                               Alg                        1
5                         Allemagne                      153
6                           Andorre                        6
7                         Argentine                        7
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top