Building a mean across several csv files

https://stackoverflow.com/questions/23630609

21-07-2023
|

题

I have an assignment on Coursera and I am stuck - I do not necessarily need or want a complete answer (as this would be cheating) but a hint in the right direction would be highly appreciated.

I have over 300 CSV files in a folder (named 001.csv, 002.csv and so on). Each contains a data frame with a header. I am writing a function that will take three arguments: the location of the files, the name of the column you want to calculate the mean (inside the data frames) and the files you want to use in the calculation (id).

I have tried to keep it as simple as possible:

pm <- function(directory, pollutant, id = 1:332) {

setwd("C:/Users/cw/Documents")
setwd(directory)

files <<- list.files()

First of all, set the wd and get a list of all files

x <- id[1] 
x

get the starting point of the user-specified ID.

Problem

for (i in x:length(id)) {
df <- rep(NA, length(id))
df[i] <- lapply(files[i], read.csv, header=T)
result <- do.call(rbind, df)
return(df)
}
}

So this is where I am hitting a wall: I would need to take the user-specified input from above (e.g. 10:25) and put the content from files "010.csv" through "025.csv" into a dataframe to actually come up with the mean of one specific column.

So my idea was to run a for-loop along the length of id (e.g. 16 for 10:25) starting with the starting point of the specified id. Within this loop I would then need to take the appropriate values of files as the input for read.csv and put the content of the .csv files in a dataframe.

I can get single .csv files and put them into a dataframe, but not several.

Does anybody have a hint how I could procede?

解决方案

Based on your example e.g. 16 files for 10:25, i.e. 010.csv, 011.csv, 012.csv, etc. Under the assumption that your naming convention follows the order of the files in the directory, you could try:

csvFiles <- list.files(pattern="\\.csv")[10:15]#here [10:15] ... in production use your function parameter here 
file_list <- vector('list', length=length(csvFiles)) 
df_list <- lapply(X=csvFiles, read.csv, header=TRUE) 
names(df_list) <- csvFiles #OPTIONAL: if you want to rename (later rows) to the csv list
df <- do.call("rbind", df_list)
mean(df[ ,"columnName"])

These code snippets should be possible to pimp and incorprate into your routine.

其他提示

You can aggregate your csv files into one big table like this :

for(i in 100:250)
{
    infile<-paste("C:/Users/cw/Documents/",i,".csv",sep="")
    newtable<-read.csv(infile)
    newtable<-cbind(newtable,rep(i,dim(newtable)[1])    # if you want to be able to identify tables after they are aggregated
    bigtable<-rbind(bigtable,newtable)
}

(you will have to replace 100:250 with the user-specified input).

Then, calculating what you want shouldn't be very hard.

That won't works for files 001 to 099, you'll have to distinguish those from the others because of the "0" but it's fixable with little treatment.

Why do you have lapply inside a for loop? Just do lapply(files[files %in% paste0(id, ".csv")], read.csv, header=T).

They should also teach you to never use <<-.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow