質問

I have html (locally) file that looks like this:

enter image description here

Would be someone so kind and show me how to approach this, scraping few lines given this kind of layout?

This is one of many unsuccessful trials:

library(XML)
example.html <- scan(file=file.choose(),what="character")
parse.html <- htmlTreeParse(example.html, useInternalNodes = TRUE)
xpath.val <- xpathApply(parse.html, '//div', xmlValue)
g.val <- gsub('\\s', '', xpath.val)

If someone would be interested to see the html file itself is here

EDIT: Of course I don't expect anyone to solve this whole issue. I would be happy with any thought as to where to look.

役に立ちましたか?

解決

Okay, this doesn't get you quite all the way there, but maybe this helps

library(XML)
library(stringr)
namespaces=c(xmlns="http://www.xbrl.org/2008/inlineXBRL")
parse.html <- htmlTreeParse("~/Downloads/html1.html", useInternalNodes=TRUE)
tt <- xpathApply(parse.html, '//tr[@class="iris_table_row"]', namespaces=namespaces)
foo <- function(x){
  vals <- sapply(xmlChildren(x), xmlValue)
  str_trim(vals[names(vals) %in% "td" & sapply(vals, nchar)>0], "both")
}
rows <- lapply(tt, foo)
rows[170:175]

[[1]]
 td 
"%" 

[[2]]
                td                 td 
"Class of shares:"          "holding" 

[[3]]
        td         td 
"Ordinary"   "100.00" 

[[4]]
            td             td 
      "Page 5" "continued..." 

[[5]]
                                                      td 
"Whitton Park Estates Limited (Registered number: 00231549)" 

[[6]]
                                         td 
"Notes to the Abbreviated Accounts - continued" 
ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top