Creating a data frame from a character vector in R

Question

Unfortunately, we can't always control the quality of our data sources, so we have to resort to some tedious manual processing. (Some people say that the majority of a data analyst's time is spent in cleaning data, and not in analysis.)

As already noted in the comments, regular expressions aren't the best tools for working with HTML, because HTML, in general, isn't really a regular language (I think it's called a context-free language). But, if the HTML sources are somewhat regular (as they are in the example data you've provided), you might still be able to use them effectively.

Here's a step-by-step example. I've added HTML header tags to your example text and stored it here: http://ideone.com/O1PC05

Read in your data using readLines

x1 <- readLines("http://ideone.com/plain/O1PC05")

Isolate the "body" of the web page

bodycontent <- grep("<body>|</body>", x1)
x2 <- x1[(bodycontent[1]+1):(bodycontent[2]-1)]

grepl returns a TRUE or FALSE for if "monthyear" was found in a given line. Use cumsum to create "groups", and split to convert the character vector to a list.
```
x3 <- split(x2, cumsum(grepl("monthyear", x2)))
```
You can do the following in multiple steps if you prefer. The basic idea is to lapply over your list, replace all your HTML tags with tabs, and replace your brackets with tabs. After that you can use read.delim, but expect to get a lot of columns that are FULL of NA values since we're inserting a lot more tabs than we need.

This is most likely where you will fail for several reasons. (1) It assumes that the source data is indeed well structured... (2) but, the text itself might have brackets... (3) or, there might be other content in the body, including script tags, table tags, and so on that will be read in and tried to be processed.
```
x4 <- read.delim(header = FALSE,
                 stringsAsFactors = FALSE,
                 strip.white = TRUE, 
                 sep = "\t", 
                 text = 
                   unlist(lapply(x3, 
                                 function(x) {
                                   temp <- gsub("<(.|\n)*?>", "\t", x)
                                   paste(gsub("[()]", "\t", temp), 
                                         collapse="\t")
                                   })))
```
I mentioned that in step 4, we will end up with a lot of junk columns. Let's get rid of those.
```
x5 <- x4[apply(x4, 2, function(x) !all(is.na(x)))]
```

And, now, let's name the columns in a more meaningful way. We know that the first column will be the "monthyear" variable by design, and the others should be "info" and "n", so we can do some basic reps wrapped in paste to get our variable names. While we're at it, we'll use as.yearmon from the "zoo" package to convert our "monyear" variable to actual dates, allowing us to sort and do other nifty things that we can do with actual dates.

myseq <- ncol(x5[-1])/2 # We expect pairs of columns, right?
names(x5) <- c("monthyear", 
               paste(rep(c("info", "n"), myseq), 
                     sep(1:myseq, each = 2), sep = "."))
library(zoo)
x5$monthyear <- as.Date(as.yearmon(x5$monthyear, "%b %Y"))
x5
#    monthyear           info.1 n.1                       info.2 n.2            info.3 n.3
# 1 2001-01-01         Foo text   2                               NA                    NA
# 2 2006-11-01         Bar text  29                More bar text   4 Yet more bar text 102
# 3 2004-04-01 Further foo text   1 Combination foo and bar text  41                    NA

If you really wanted your data in long form, use reshape:

x6 <- reshape(x5, 
              direction = "long", 
              idvar = "monthyear", 
              varying = 2:ncol(x5))

Do some optional cleanup, like ordering the output by date, resetting your row names, and dropping incomplete cases:

x6 <- x6[order(x6$monthyear), ]
rownames(x6) <- NULL
x6[complete.cases(x6), ]
#    monthyear time                         info   n
# 1 2001-01-01    1                     Foo text   2
# 4 2004-04-01    1             Further foo text   1
# 5 2004-04-01    2 Combination foo and bar text  41
# 7 2006-11-01    1                     Bar text  29
# 8 2006-11-01    2                More bar text   4
# 9 2006-11-01    3            Yet more bar text 102

Anyway, try it out, and modify as needed. My guess is that at some point, you'll have to open up the files in a plain text editor and do some preliminary cleanup there before you can proceed.