Вопрос

I have some data in text form, taken from a webpage. It's quite lengthy but follows the form:

<p><span class="monthyear">Jan 2001</span>
<br><b>Foo text (2)</b></p>
<p><span class="monthyear">Nov 2006</span>
<br><b>Bar text (29)</b>
<br><b>More bar text (4)</b>
<br><b>Yet more bar text (102)</b></p>
<p><span class="monthyear">Apr 2004</span>
<br><b>Further foo text (1)</b>
<br><b>Combination foo and bar text (41)</b></p>

I want to extract the relevant parts of this into a data frame, like so:

  monthyear          info  n
1  Jan 2001      Foo text  2
2  Nov 2006      Bar text 29
3  Nov 2006 More bar text  4

...but I'm not sure how to do it. If I have the html in a character vector called text I can extract the monthyear data using a function from the stringr package:

monthyear <- str_extract_all(
text[1],perl("(?<=\\\"monthyear\\\">).*?20[0-9]{2}")
)

and I could extract the info and n data in the same sort of way, but given that there are multiple info and n entries for each monthyear entry, I'm not sure how to combine them. Am I going about this all wrong?

Это было полезно?

Решение

Unfortunately, we can't always control the quality of our data sources, so we have to resort to some tedious manual processing. (Some people say that the majority of a data analyst's time is spent in cleaning data, and not in analysis.)

As already noted in the comments, regular expressions aren't the best tools for working with HTML, because HTML, in general, isn't really a regular language (I think it's called a context-free language). But, if the HTML sources are somewhat regular (as they are in the example data you've provided), you might still be able to use them effectively.

Here's a step-by-step example. I've added HTML header tags to your example text and stored it here: http://ideone.com/O1PC05

  1. Read in your data using readLines

    x1 <- readLines("http://ideone.com/plain/O1PC05")
    
  2. Isolate the "body" of the web page

    bodycontent <- grep("<body>|</body>", x1)
    x2 <- x1[(bodycontent[1]+1):(bodycontent[2]-1)]
    
  3. grepl returns a TRUE or FALSE for if "monthyear" was found in a given line. Use cumsum to create "groups", and split to convert the character vector to a list.

    x3 <- split(x2, cumsum(grepl("monthyear", x2)))
    
  4. You can do the following in multiple steps if you prefer. The basic idea is to lapply over your list, replace all your HTML tags with tabs, and replace your brackets with tabs. After that you can use read.delim, but expect to get a lot of columns that are FULL of NA values since we're inserting a lot more tabs than we need.

    This is most likely where you will fail for several reasons. (1) It assumes that the source data is indeed well structured... (2) but, the text itself might have brackets... (3) or, there might be other content in the body, including script tags, table tags, and so on that will be read in and tried to be processed.

    x4 <- read.delim(header = FALSE,
                     stringsAsFactors = FALSE,
                     strip.white = TRUE, 
                     sep = "\t", 
                     text = 
                       unlist(lapply(x3, 
                                     function(x) {
                                       temp <- gsub("<(.|\n)*?>", "\t", x)
                                       paste(gsub("[()]", "\t", temp), 
                                             collapse="\t")
                                       })))
    
  5. I mentioned that in step 4, we will end up with a lot of junk columns. Let's get rid of those.

    x5 <- x4[apply(x4, 2, function(x) !all(is.na(x)))]
    
  6. And, now, let's name the columns in a more meaningful way. We know that the first column will be the "monthyear" variable by design, and the others should be "info" and "n", so we can do some basic reps wrapped in paste to get our variable names. While we're at it, we'll use as.yearmon from the "zoo" package to convert our "monyear" variable to actual dates, allowing us to sort and do other nifty things that we can do with actual dates.

    myseq <- ncol(x5[-1])/2 # We expect pairs of columns, right?
    names(x5) <- c("monthyear", 
                   paste(rep(c("info", "n"), myseq), 
                         sep(1:myseq, each = 2), sep = "."))
    library(zoo)
    x5$monthyear <- as.Date(as.yearmon(x5$monthyear, "%b %Y"))
    x5
    #    monthyear           info.1 n.1                       info.2 n.2            info.3 n.3
    # 1 2001-01-01         Foo text   2                               NA                    NA
    # 2 2006-11-01         Bar text  29                More bar text   4 Yet more bar text 102
    # 3 2004-04-01 Further foo text   1 Combination foo and bar text  41                    NA
    
  7. If you really wanted your data in long form, use reshape:

    x6 <- reshape(x5, 
                  direction = "long", 
                  idvar = "monthyear", 
                  varying = 2:ncol(x5))
    
  8. Do some optional cleanup, like ordering the output by date, resetting your row names, and dropping incomplete cases:

    x6 <- x6[order(x6$monthyear), ]
    rownames(x6) <- NULL
    x6[complete.cases(x6), ]
    #    monthyear time                         info   n
    # 1 2001-01-01    1                     Foo text   2
    # 4 2004-04-01    1             Further foo text   1
    # 5 2004-04-01    2 Combination foo and bar text  41
    # 7 2006-11-01    1                     Bar text  29
    # 8 2006-11-01    2                More bar text   4
    # 9 2006-11-01    3            Yet more bar text 102
    

Anyway, try it out, and modify as needed. My guess is that at some point, you'll have to open up the files in a plain text editor and do some preliminary cleanup there before you can proceed.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top