Pregunta

I would like to scrape the Vancouver olympic games Wikipedia entry. Unfortunately its not a nice table format.

I am trying to create a data frame with 2 columns: Nation and number of athletes.

At this point I have

library(XML)
library(RCurl)

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE)

# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue)
country<-tablehead[31]

where country is

> country
[1] "\n Afrique du Sud (2)\n Albanie (1)\n Algérie (1)\n Allemagne (153)\n Andorre (6)\n Argentine (7)\n Arménie (4)\n Australie (41)\n Autriche (82)\n Azerbaïdjan (2)\n Belgique (8)\n Bermudes (1)\n Biélorussie (50)\n Bosnie-Herzégovine (5)\n Brésil (5)\n Bulgarie (18)\n Canada (206)\n Chili (3)\n Chine (90)\n Chypre (2)\n Colombie (1)\n\n\n\n Corée du Nord (2)\n Corée du Sud (46)\n Croatie (18)\n Danemark (18)\n Espagne (18)\n Estonie (32)\n États-Unis (216)\n Éthiopie (1)\n Finlande (95)\n France (108)\n Géorgie (12)\n Ghana (1)\n Grande-Bretagne (52)\n Grèce (7)\n Hong Kong (1)\n Hongrie (16)\n Îles Caïmans (1)\n Inde (3)\n Iran (4)\n Irlande (6)\n Islande (4)\n\n\n\n Israël (3)\n Italie (109)\n Jamaïque (1)\n Japon (94)\n Kazakhstan (38)\n Kirghizistan (2)\n Lettonie (54)\n Liban (3)\n Liechtenstein (6)\n Lituanie (6)\n Macédoine (3)\n Moldavie (8)\n Maroc (1)\n Mexique (1)\n Monaco (3)\n Monténégro (1)\n Mongolie (2)\n Népal (1)\n Norvège (99)\n Nouvelle-Zélande (16)\n\n\n\n Ouzbékistan (3)\n Pakistan (1)\n Pays-Bas (34)\n Pérou (3)\n Pologne (50)\n Portugal (1)\n République tchèque (93)\n Roumanie (29)\n Russie (179)\n Saint-Marin (1)\n Sénégal (1)\n Serbie (10)\n Slovaquie (73)\n Slovénie (49)\n Suède (108)\n Suisse (146)\n Tadjikistan (1)\n Taipei chinois (1)\n Turquie (5)\n Ukraine (47)\n\n"

I have tried

str_detect(country,"\n")
country<-str_split(country,"\n")

But the data are very dirty, and it's not working well. Any suggestions?

¿Fue útil?

Solución

A possibility is to use regular expressions. I've never done that with R but the library stringr seems to be recommended: Extract a regular expression match in R version 2.10 ( http://cran.r-project.org/web/packages/stringr/stringr.pdf )

EDIT: Code that appears to work for me

library(XML)
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
library(stringr)

path<-"https://fr.wikipedia.org/wiki/Jeux_olympiques_d%27hiver_de_2010"
webpage <- getURL(path)
webpage <- readLines(tc <- textConnection(webpage)); close(tc)

pagetree <- htmlTreeParse(webpage, error=function(...){}, useInternalNodes = TRUE, encoding = "UTF-8")
# Extract table header and contents
tablehead <- xpathSApply(pagetree, "//*/table/tr", xmlValue)
country<-tablehead[31]

country<-strsplit(country,"\n")

# extract country
bar <- function(x) str_trim(str_extract(x, "[^(]*"), side = "both")
res1 <- sapply(country[[1]], bar)    
# extract nb of athletes
foo <- function(x) str_trim(str_match(x, "\\((.*?)\\)")[[2]], side = "both")
res2 <- sapply(country[[1]], foo)
# build df
res2 <- as.numeric(res2)
df <- data.frame(res1, res2)
df <- df[res1 != "",]
# inspect df
nrow(df)
summary(df)

Otros consejos

Try

library(plyr)
country <- str_split(country,"\n")[[1]]
df <- ldply(country[[1]], function(z) data.frame(str_extract(z, "[A-Za-z]+")[[1]], str_extract(z, "[0-9]+")))
head(na.omit(df))

                                  a                        b
2                           Afrique                        2
3                           Albanie                        1
4                               Alg                        1
5                         Allemagne                      153
6                           Andorre                        6
7                         Argentine                        7
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top