Вопрос

I'm currently trying to scrape text from an HTML tree that I've parsed as follows:-

require(RCurl)
require(XML)

query.IMDB <- getURL('http://www.imdb.com/title/tt0096697/epdate') #Simpsons episodes, rated and ordered by broadcast date
names(query.IMDB)

query.IMDB

query.IMDB <- htmlParse(query.IMDB)
df.IMDB <- getNodeSet(query.IMDB, "//*/div[@class='rating rating-list']")

My first attempt was just to use grep on the resulting vector, but this fails.

data[grep("Users rated this", "", df.IMDB)]
#Error in data... object of type closure is not subsettable

My next attempt was to use grep on the individual points in the query.IMDB vector:-

vect <- numeric(length(df.IMDB))

for (i in 1:length(df.IMDB)){

      vect[i] <- data[grep("Users rated this", "", df.IMDB)]

  }

but this also throws the closure not subsettable error.

Finally trying the above function without data[] around the grep throws

Error in df.IMDB[i] <- grep("Users rated this", "", df.IMDB[i]) : replacement has length zero

I'm actually hoping to eventually replace everything except a number of the form [0-9].[0-9] following the given text string with blank space, but I'm doing a simpler version first to get the thing working.

Can anyone advise what function I should be using to edit the text in each point on my query.IMDB vector

Это было полезно?

Решение

No need to use grep here (AVoid regular expression with HTML files). Use the handy function readHTMLTable from XML package:

library(XML)
head(readHTMLTable('http://www.imdb.com/title/tt0096697/epdate')[[1]][,c(2:4)])
                            Episode UserRating UserVotes
1 Simpsons Roasting on an Open Fire        8.2     2,694
2                   Bart the Genius        7.8     1,167
3                   Homer's Odyssey        7.5     1,005
4     There's No Disgrace Like Home        7.9     1,017
5                  Bart the General        8.0       992
6                      Moaning Lisa        7.4       988

This give you the table of ratings,... Maybe you should convert UserVotes to a numeric.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top