Question

I have used R for various things over the past year but due to the number of packages and functions available, I am still sadly a beginner. I believe R would allow me to do what I want to do with minimal code, but I am struggling.

What I want to do:

I have roughly a hundred different excel files containing data on students. Each excel file represents a different school but contains the same variables. I need to:

  1. Import the data into R from Excel
  2. Add a variable to each file containing the filename
  3. Merge all of the data (add observations/rows - do not need to match on variables)

I will need to do this for multiple sets of data, so I am trying to make this as simple and easy to replicate as possible.

What the Data Look Like:

Row 1    Title
Row 2    StudentID    Var1    Var2            Var3      Var4        Var5
Row 3    11234           1         9/8/2011      343      159-167      32
Row 4    11235           2         9/16/2011    112      152-160      12
Row 5    11236           1         9/8/2011      325      164-171      44

Row 1 is meaningless and Row 2 contains the variable names. The files have different numbers of rows.

What I have so far:

At first I simply tried to import data from excel. Using the XLSX package, this works nicely:

dat <- read.xlsx2("FILENAME.xlsx", sheetIndex=1, 
                  sheetName=NULL, startRow=2, 
                  endRow=NULL, as.data.frame=TRUE, 
                  header=TRUE)

Next, I focused on figuring out how to merge the files (also thought this is where I should add the filename variable to the datafiles). This is where I got stuck.

setwd("FILE_PATH_TO_EXCEL_DIRECTORY")
filenames <- list.files(pattern=".xls")
do.call("rbind", lapply(filenames, read.xlsx2, sheetIndex=1, colIndex=6, header=TRUE, startrow=2, FILENAMEVAR=filenames));

I set my directory, make a list of all the excel file names in the folder, and then try to merge them in one statement using the a variable for the filenames.

When I do this I get the following error:

Error in data.frame(res, ...) : arguments imply differing number of rows: 616, 1, 5

I know there is a problem with my application of lapply - the startrow is not being recognized as an option and the FILENAMEVAR is trying to merge the list of 5 sample filenames as opposed to adding a column containing the filename.

What next?

If anyone can refer me to a useful resource or function, critique what I have so far, or point me in a new direction, it would be GREATLY appreciated!

Was it helpful?

Solution

I'll post my comment (with bdemerast picking up on the typo). The solution was untested as xlsx will not run happily on my machine

You need to pass a single FILENAMEVAR to read.xlsx2.

lapply(filenames, function(x) read.xlsx2(file=x, sheetIndex=1, colIndex=6, header=TRUE, startRow=2, FILENAMEVAR=x))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top