Scrape all values from local html file in R [closed]

https://stackoverflow.com/questions/21271086

30-09-2022
|

题

Closed. This question needs debugging details. It is not currently accepting answers.

Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.

Closed 7 years ago.

Improve this question

I have html (locally) file that looks like this:

enter image description here

Would be someone so kind and show me how to approach this, scraping few lines given this kind of layout?

This is one of many unsuccessful trials:

library(XML)
example.html <- scan(file=file.choose(),what="character")
parse.html <- htmlTreeParse(example.html, useInternalNodes = TRUE)
xpath.val <- xpathApply(parse.html, '//div', xmlValue)
g.val <- gsub('\\s', '', xpath.val)

If someone would be interested to see the html file itself is here

EDIT: Of course I don't expect anyone to solve this whole issue. I would be happy with any thought as to where to look.

解决方案

Okay, this doesn't get you quite all the way there, but maybe this helps

library(XML)
library(stringr)
namespaces=c(xmlns="http://www.xbrl.org/2008/inlineXBRL")
parse.html <- htmlTreeParse("~/Downloads/html1.html", useInternalNodes=TRUE)
tt <- xpathApply(parse.html, '//tr[@class="iris_table_row"]', namespaces=namespaces)
foo <- function(x){
  vals <- sapply(xmlChildren(x), xmlValue)
  str_trim(vals[names(vals) %in% "td" & sapply(vals, nchar)>0], "both")
}
rows <- lapply(tt, foo)
rows[170:175]

[[1]]
 td 
"%" 

[[2]]
                td                 td 
"Class of shares:"          "holding" 

[[3]]
        td         td 
"Ordinary"   "100.00" 

[[4]]
            td             td 
      "Page 5" "continued..." 

[[5]]
                                                      td 
"Whitton Park Estates Limited (Registered number: 00231549)" 

[[6]]
                                         td 
"Notes to the Abbreviated Accounts - continued"

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow