Pulling an disaggregated blocks of data Set into R

Question 1

You might be able to use a combination of readLines and grep/grepl to help you figure out which lines to read.

Here's an example. The first part is just to make up some sample data.

Create some sample data.

x <- tempfile(pattern="myFile", fileext=".csv")

cat("junk comment strings",
    "",
    "another junk comment string",
    "This,Is,My,Data",
    "1,2,3,4",
    "5,6,7,8",
    "",
    "back to comments",
    "This,Is,My,Data",
    "12,13,14,15",
    "15,16,17,18",
    "19,20,21,22", file = x, sep = "\n")

Step 1: Use `readLines()` to get the data into R

In this step, we'll also drop the lines that we are not interested in. The logic is that we are only interested in lines where there is information in the form of (for a four-column dataset):

something comma something comma something comma something

## Read the data into R
## Replace "con" with the actual path to your file
A <- readLines(con = x)

## Find and extract the lines where there are "data".
## My example dataset only has 4 columns.
## Modify for your actual dataset.
A <- A[grepl(paste(rep(".*", 4), collapse=","), A)]

Step 2: Identify the data ranges

## Identify the header rows. -1 for use with read.csv
HeaderRows <- grep("^This,Is", A)-1

## Identify the number of rows per data group
N <- c(diff(HeaderRows)-1, length(A)-1)

Step 3: Read the data in

Use the data range information to specify how many lines to skip before reading, and how many lines to read.

myData <- lapply(seq_along(HeaderRows), 
       function(x) read.csv(text = A, header = TRUE, 
                            nrows = N[x], skip = HeaderRows[x]))
myData
# [[1]]
#   This Is My Data
# 1    1  2  3    4
# 2    5  6  7    8
# 
# [[2]]
#   This Is My Data
# 1   12 13 14   15
# 2   15 16 17   18
# 3   19 20 21   22

If you want all of these in one data.frame instead of a list, use:

final <- do.call(rbind, myData)

Question 2

I just recently faced a problem like this. My solution was to use awk to separate out the different types of rows, load them into different tables in a dbms and use sql to create a flat file for loading into R.

Or maybe you can awk out only your desired data and load that, if you don't care about the comment strings.

Question 3

Using the data generated by @Ananda Mahto,

file = x # change for the actual file name
alldata = readLines(file) # read all data
# count the fields in data (separated by comma)
nfields = count.fields(file=textConnection(alldata), sep=",", blank.lines.skip=FALSE) 
# asumme data has the 'mode' of the number of fields (can change for the actual number of colums)
dataFields = as.numeric(names(table(nfields))[which.max(table(nfields))]) 

alldata = alldata[nfields == dataFields] # read data lines only
header = alldata[1] # the header
alldata = c(header, alldata[alldata!=header]) # remove the extra headers
datos = read.csv(text=alldata) # read the data

  This Is My Data
1    1  2  3    4
2    5  6  7    8
3   12 13 14   15
4   15 16 17   18
5   19 20 21   22

Pulling an disaggregated blocks of data Set into R

Create some sample data.

Step 1: Use readLines() to get the data into R

Step 2: Identify the data ranges

Step 3: Read the data in

Step 1: Use `readLines()` to get the data into R