Unfortunately, we can't always control the quality of our data sources, so we have to resort to some tedious manual processing. (Some people say that the majority of a data analyst's time is spent in cleaning data, and not in analysis.)
As already noted in the comments, regular expressions aren't the best tools for working with HTML, because HTML, in general, isn't really a regular language (I think it's called a context-free language). But, if the HTML sources are somewhat regular (as they are in the example data you've provided), you might still be able to use them effectively.
Here's a step-by-step example. I've added HTML header tags to your example text and stored it here: http://ideone.com/O1PC05
Read in your data using
readLines
x1 <- readLines("http://ideone.com/plain/O1PC05")
Isolate the "body" of the web page
bodycontent <- grep("<body>|</body>", x1) x2 <- x1[(bodycontent[1]+1):(bodycontent[2]-1)]
grepl
returns aTRUE
orFALSE
for if "monthyear" was found in a given line. Usecumsum
to create "groups", andsplit
to convert the character vector to a list.x3 <- split(x2, cumsum(grepl("monthyear", x2)))
You can do the following in multiple steps if you prefer. The basic idea is to
lapply
over your list, replace all your HTML tags with tabs, and replace your brackets with tabs. After that you can useread.delim
, but expect to get a lot of columns that are FULL ofNA
values since we're inserting a lot more tabs than we need.This is most likely where you will fail for several reasons. (1) It assumes that the source data is indeed well structured... (2) but, the text itself might have brackets... (3) or, there might be other content in the body, including script tags, table tags, and so on that will be read in and tried to be processed.
x4 <- read.delim(header = FALSE, stringsAsFactors = FALSE, strip.white = TRUE, sep = "\t", text = unlist(lapply(x3, function(x) { temp <- gsub("<(.|\n)*?>", "\t", x) paste(gsub("[()]", "\t", temp), collapse="\t") })))
I mentioned that in step 4, we will end up with a lot of junk columns. Let's get rid of those.
x5 <- x4[apply(x4, 2, function(x) !all(is.na(x)))]
And, now, let's name the columns in a more meaningful way. We know that the first column will be the "monthyear" variable by design, and the others should be "info" and "n", so we can do some basic
rep
s wrapped inpaste
to get our variable names. While we're at it, we'll useas.yearmon
from the "zoo" package to convert our "monyear" variable to actual dates, allowing us to sort and do other nifty things that we can do with actual dates.myseq <- ncol(x5[-1])/2 # We expect pairs of columns, right? names(x5) <- c("monthyear", paste(rep(c("info", "n"), myseq), sep(1:myseq, each = 2), sep = ".")) library(zoo) x5$monthyear <- as.Date(as.yearmon(x5$monthyear, "%b %Y")) x5 # monthyear info.1 n.1 info.2 n.2 info.3 n.3 # 1 2001-01-01 Foo text 2 NA NA # 2 2006-11-01 Bar text 29 More bar text 4 Yet more bar text 102 # 3 2004-04-01 Further foo text 1 Combination foo and bar text 41 NA
If you really wanted your data in long form, use
reshape
:x6 <- reshape(x5, direction = "long", idvar = "monthyear", varying = 2:ncol(x5))
Do some optional cleanup, like ordering the output by date, resetting your row names, and dropping incomplete cases:
x6 <- x6[order(x6$monthyear), ] rownames(x6) <- NULL x6[complete.cases(x6), ] # monthyear time info n # 1 2001-01-01 1 Foo text 2 # 4 2004-04-01 1 Further foo text 1 # 5 2004-04-01 2 Combination foo and bar text 41 # 7 2006-11-01 1 Bar text 29 # 8 2006-11-01 2 More bar text 4 # 9 2006-11-01 3 Yet more bar text 102
Anyway, try it out, and modify as needed. My guess is that at some point, you'll have to open up the files in a plain text editor and do some preliminary cleanup there before you can proceed.