tapply like issue, but require dataframe output - R

https://stackoverflow.com/questions/8920315

17-04-2021
|

Pergunta

This is my first post, so hopefully I explain what I need to do properly. I am still quite new to R and I may have read posts that answer this, but I just can't for the life of me understand what they mean. So apologies in advance if this has already been answered.

I have a very large data set of GPS locations from radiocollars and there are inconsistent numbers of locations for each day. I want to go through the dataset and select a single data point for each day based on the accuracy level of the GPS signal.

So it essentially looks like this.

Accuracy    Month    Day    Easting    Northing    Etc
   5          6       1     #######    ########     #
   3.2        6       1     #######    ########     #
   3.8        6       1     #######    ########     #
   1.6        6       2     #######    ########     #
   4          6       3     #######    ########     #
   3.2        6       3     #######    ########     #

And I want to pull out the most accurate point for each day (the lowest accuracy measure) while keeping the rest of the associated data.

Currently I have been using the tapply function

datasub1<-subset(data,MONTH==6)
tapply(datasub1$accuracy, datasub1$day, min)

Using this method I can successfully retrieve the minimum values, one for each day, however I cannot take the associated coordinates and timing, and all the other important information along with it, and as the data set is nearly 300 000 rows, I really can't do it by hand.

So essentially, I need to get the same results as the tapply, but instead of individual points, I need the entire row that that point is found in.

Thanks in advance to anyone that could lend a hand. If you need any more information, let me know, I'll try my best to get it to you.

Solução

You can use ddply: it cuts a data.frame into pieces (one per day) and applies a function to each piece.

# Sample data
n <- 100
d <- data.frame(
  Accuracy = round(runif(n, 0, 5), 1),
  Month    = sample(1:2, n, replace=TRUE),
  Day      = sample(1:5, n, replace=TRUE),
  Easting  = rnorm(n),
  Northing = rnorm(n),
  Etc      = rnorm(n)
)

# Extract the maximum for each day
# (In case of ties, you only have the first row)
library(plyr)
ddply( 
  d, 
  c("Month", "Day"), 
  function (u) u[ which.min(u$Accuracy), ] 
)

Outras dicas

This is one base solution using the split-apply paradigm that formed the basis for the plyr functions at least in the beginning:

lapply( 
     split(dat, list(dat$Month, dat$Day)),
         function(d) d[ which.min(d$Accuracy), ])

So you don't want to aggregate in any way at all really. All you need to do is select the minimum for each day. So, all you need to do is find the minimums and select the matches.

mins <- ave(datasub1$accuracy, datasub1$day, FUN = min)
datasub1[ datasub1$accuracy == mins, ]

If you need day by month or year or whatever then just add them in as a list to the second argument of ave. Here's an alternate syntax.

mins <- with( datasub1, ave(accuracy, day, month, FUN = min) )

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow