I have html (locally) file that looks like this:

enter image description here

Would be someone so kind and show me how to approach this, scraping few lines given this kind of layout?

This is one of many unsuccessful trials:

library(XML)
example.html <- scan(file=file.choose(),what="character")
parse.html <- htmlTreeParse(example.html, useInternalNodes = TRUE)
xpath.val <- xpathApply(parse.html, '//div', xmlValue)
g.val <- gsub('\\s', '', xpath.val)

If someone would be interested to see the html file itself is here

EDIT: Of course I don't expect anyone to solve this whole issue. I would be happy with any thought as to where to look.

有帮助吗?

解决方案

Okay, this doesn't get you quite all the way there, but maybe this helps

library(XML)
library(stringr)
namespaces=c(xmlns="http://www.xbrl.org/2008/inlineXBRL")
parse.html <- htmlTreeParse("~/Downloads/html1.html", useInternalNodes=TRUE)
tt <- xpathApply(parse.html, '//tr[@class="iris_table_row"]', namespaces=namespaces)
foo <- function(x){
  vals <- sapply(xmlChildren(x), xmlValue)
  str_trim(vals[names(vals) %in% "td" & sapply(vals, nchar)>0], "both")
}
rows <- lapply(tt, foo)
rows[170:175]

[[1]]
 td 
"%" 

[[2]]
                td                 td 
"Class of shares:"          "holding" 

[[3]]
        td         td 
"Ordinary"   "100.00" 

[[4]]
            td             td 
      "Page 5" "continued..." 

[[5]]
                                                      td 
"Whitton Park Estates Limited (Registered number: 00231549)" 

[[6]]
                                         td 
"Notes to the Abbreviated Accounts - continued" 
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top