Question

I have a csv file that is unusually laid out. The data is not a contigous block at the top. The csv file can be characterized as such:

Comment Strings
Empty row
Comment String

[Desired Data with 10 columns and an undetermined number of rows]

Empty Row

Comment String 

[Desired Data with 10 columns and an undetermined number of rows]

Empty Row

Comment String 

[Desired Data with 10 columns and an undetermined number of columns]

.... and so on and so forth.

As stated. Each block of data has a random number of rows.

What would be the best way to pull this data into R? The read.table/read.csv can only do so much.

 read.table("C:\\Users\\Riemmman\\Desktop\\Historical Data\\datafile.csv",header=F,sep=",",skip=15,blank.lines.skip=T)
Was it helpful?

Solution

You might be able to use a combination of readLines and grep/grepl to help you figure out which lines to read.

Here's an example. The first part is just to make up some sample data.

Create some sample data.

x <- tempfile(pattern="myFile", fileext=".csv")

cat("junk comment strings",
    "",
    "another junk comment string",
    "This,Is,My,Data",
    "1,2,3,4",
    "5,6,7,8",
    "",
    "back to comments",
    "This,Is,My,Data",
    "12,13,14,15",
    "15,16,17,18",
    "19,20,21,22", file = x, sep = "\n")

Step 1: Use readLines() to get the data into R

In this step, we'll also drop the lines that we are not interested in. The logic is that we are only interested in lines where there is information in the form of (for a four-column dataset):

something comma something comma something comma something


## Read the data into R
## Replace "con" with the actual path to your file
A <- readLines(con = x)

## Find and extract the lines where there are "data".
## My example dataset only has 4 columns.
## Modify for your actual dataset.
A <- A[grepl(paste(rep(".*", 4), collapse=","), A)]

Step 2: Identify the data ranges

## Identify the header rows. -1 for use with read.csv
HeaderRows <- grep("^This,Is", A)-1

## Identify the number of rows per data group
N <- c(diff(HeaderRows)-1, length(A)-1)

Step 3: Read the data in

Use the data range information to specify how many lines to skip before reading, and how many lines to read.

myData <- lapply(seq_along(HeaderRows), 
       function(x) read.csv(text = A, header = TRUE, 
                            nrows = N[x], skip = HeaderRows[x]))
myData
# [[1]]
#   This Is My Data
# 1    1  2  3    4
# 2    5  6  7    8
# 
# [[2]]
#   This Is My Data
# 1   12 13 14   15
# 2   15 16 17   18
# 3   19 20 21   22

If you want all of these in one data.frame instead of a list, use:

final <- do.call(rbind, myData)

OTHER TIPS

I just recently faced a problem like this. My solution was to use awk to separate out the different types of rows, load them into different tables in a dbms and use sql to create a flat file for loading into R.

Or maybe you can awk out only your desired data and load that, if you don't care about the comment strings.

Using the data generated by @Ananda Mahto,

file = x # change for the actual file name
alldata = readLines(file) # read all data
# count the fields in data (separated by comma)
nfields = count.fields(file=textConnection(alldata), sep=",", blank.lines.skip=FALSE) 
# asumme data has the 'mode' of the number of fields (can change for the actual number of colums)
dataFields = as.numeric(names(table(nfields))[which.max(table(nfields))]) 

alldata = alldata[nfields == dataFields] # read data lines only
header = alldata[1] # the header
alldata = c(header, alldata[alldata!=header]) # remove the extra headers
datos = read.csv(text=alldata) # read the data

  This Is My Data
1    1  2  3    4
2    5  6  7    8
3   12 13 14   15
4   15 16 17   18
5   19 20 21   22
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top