Use filename to average data by month

https://stackoverflow.com/questions/10614033

09-06-2021
|

题

I have an R related question which is to do with having to read in many files and process these files. Each file is for measurements of ozone made at a different time at a different station. The data is in a table format and I can read the data in using:

files <- list.files()
data  <- lapply(files, read.table, skip=19)

This gives me a data frame for all the files which I would like to now process. For example the files are named:

> head(files)
 [1] "fiji_19980105.dat" "fiji_19980112.dat" "fiji_19980119.dat"
 [4] "fiji_19980130.dat" "fiji_19980206.dat" "fiji_19980213.dat"

Where "fiji" is the name of the station and the date is YMD format. I would like to average the data frame to get monthly averages for this station (I will only need to work on one station at a time, so really I just want to average the data frame called data to produce 12 sets of average data).

I imagine I can do this using some ?apply function, but I'm really lost on how to do this. Any suggestions on a solution are really appreciated!

As an example of the result of adding the dates to the data frame here we have:

> head(dat)
V1     V2    V3   V4 V5   V6    V7   V8   V9  V10       Date
1 9000 1007.7 0.006 29.6 74 0.59 0.006 9000 9000 9000 1998-01-05
2 9000 1005.2 0.028 29.3 75 0.62 0.006 9000 9000 9000 1998-01-05
3 9000 1001.6 0.060 28.5 78 0.63 0.006 9000 9000 9000 1998-01-05

 > str(dat)
'data.frame':   153994 obs. of  11 variables:
 $ V1  : int  9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 ...
 $ V2  : num  1008 1005 1002 997 993 ...
 $ V3  : num  0.006 0.028 0.06 0.104 0.14 0.169 0.198 0.238 0.271 0.301 ...
 $ V4  : num  29.6 29.3 28.5 27.9 27.6 27.2 27 26.6 26.2 26 ...
 $ V5  : int  74 75 78 79 80 81 82 84 85 85 ...
 $ V6  : num  0.59 0.62 0.63 0.68 0.69 0.7 0.72 0.74 0.75 0.76 ...
 $ V7  : num  0.006 0.006 0.006 0.007 0.007 0.007 0.007 0.008 0.008 0.008 ...
 $ V8  : num  9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 ...
 $ V9  : num  9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 ...
 $ V10 : num  9000 9000 9000 9000 9000 9000 9000 9000 9000 9000 ...
 $ Date: Date, format: "1998-01-05" "1998-01-05" ...

解决方案

With your list of files, get Dates:

datetimes = as.Date(files, "fiji_%Y%m%d")

See ?strptime for details about the format templates, essentially you can include whatever other values as literal filler, and ignore any trailing characters that don't matter.

The rest requires that you give more information about what is in each data.frame, so give us more information about the data in those.

It would be best to create one large data.frame with these date stamps added to each row, and then go from there.

To get that something like this would work (imagine it's called 'dat' rather than 'data'):

dat = lapply(files, read.table, skip=19)

for (i in 1:length(files)) {
    dat[[i]]$Date = rep(datetimes[i], nrow(dat[[i]])

}

dat = do.call("rbind", dat)

Then you can use format(dat$Date, "%m") to get a value for each date that only includes the month, and tapply across that with a summary function (e.g. mean). There would be less classical plyr versions of this that will no doubt come up soon. :)

It's probably not a good idea to call read.table with lapply, so I would change that as well so you can put in basic checks for each I/O and the merging of the data.frames.

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow