Pregunta

I have a question. I have been trying to search the internet for an answer, but couldn't find the answer I am looking for:

Assignment: I need to loop through several files (specified by the user), extract a column of the csv-files and "glue" them together to finally calculate the mean across the specified files

Problem

for (i in 1:whatever) {
monitors <- read.csv(list[i], header=T)

So, here I read in the file

cols <- mons[[pollutant]]    

Here I have my 'unclean' vector (including NAs) with the values of the columns

result[i] <- c(cols)
}
return(result)
}

And here comes my problem: I initiated return as numeric above and whenever I try to paste the data from colswith either result[i]or result[[i]] i get the following errors respectively:

for result[i]
number of items to replace is not a multiple of replacement length

for result[[i]]
more elements supplied than there are to replace

Now I realize this has to do with my cols being larger than my result: The question now is: how can I set this up so that the cols get added up to my result vector?

¿Fue útil?

Solución

If you are only extracting and 'gluing' together a single column of values from each file, I would suggest using the concatenation function, c() and creating a vector, instead of creating a list type object. Something along the lines of this should work:

fnames <-c("fname1","fname2","fname3")
excol="extractedColumnName"

extractedData = c() #initialize the vector. Typing can be determined by R automatically. 

for(fname in fnames){

   cur <- read.csv(fname, header=T)
   extractedData = c(extractedData, cur[,excol])

}

Depending on how NAs are stored in your data files, an na.strings = "<the string used to indicate NA>" argument may be necessary for the call to read.csv. If there are character values in the columns you want, you may need to run the as.numeric() function on the vector after all is read in. There are more efficient, more coding-intensive way to load the data, but for a simple solution, not dealing with too large of data files, this method should work fine.

PS, To deal with the NAs (assuming you do not want to treat them in any special way) one of these two approaches should work:

1):

extractedDataNoNA = extractedData[ ! is.na(extractedData) ]
meanResult = mean(extractedDataNoNA)

The ! is.na(extractedData) creates a logical vector to select elements in the extractedData vector.

2):

meanResult = mean(extractedData, na.rm=TRUE)

Otros consejos

I do not know if I understand you correctly but you may use for example this code

df <- lapply(files, read.csv, header=TRUE)
result <- unlist(lapply(df, function(x) x[["column_name"]]))

This way you with first line you read your data and store them in a list, and in the next one you get specified column from each data.frame (unlist is because lapply will return a list and I assume that you want to have a numeric vector). Since you wrote that you want to calculate a mean of the result vector I assume that in each data.frame pollutant columns are of the same types. However, if you need to use for loop or you have too much data to store them in a list you may create an empty vector result<-numeric(0) before the loop and then use results <- c(results, cols) in the loop.

Since I know where this question comes from, I cannot really give you the answer, but I can point you in right direction.

First of all, you should definitely have a look at these packages:

library(plyr)
library(dplyr)
library(data.table)
library(lubridate)
  • You have to construct a character vector of .csv files to be read into R:

You can do that by combining functions:

intersect() 
paste() 
sprintf()
list.files() 
  • Read the content of .csv files and put it in a data.frame

You can do that by combining functions:

ldply()
fread()

You do not need a for loop to accomplish the task.

From there on you should be able to subset columns and calculate the mean. Hope it helps.

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top