Question

I have the following huge dataframe :

V1  V2  V3  V4
A   E   R   12
A   R   T   18
A   T   Y   44
A   Y   U   11
B   E   R   22
B   R   T   53
B   T   Y   11
B   Y   U   153 

what im trying to do is to get the outlier value from V4 for each pair of (V1,V2)

This easily handled with 2 for loops based on the unique values of V1 and V2 and a subset for each round, take the vector of V4 for each subset and get the outlier using any function of the outlier package, but the the problem is then speed.

i have never used lapply, maybe someone can guide me on a way to perform this efficiently using lapply insted of the for loop.

Was it helpful?

Solution

Here's a data.table solution:

For close to 4.5 million rows, with 676 groups and 6500 records per group, it takes just over 2 seconds (including data generation).

library(outliers)
library(data.table)

# Fake data generation and coercion to data.table
d <- as.data.table(expand.grid(x=LETTERS, y=LETTERS, z=LETTERS))
d <- do.call(rbind, replicate(250, d, FALSE))

# > d
#          x y z      value     row
#       1: A A A -1.1712284       1
#       2: B A A  0.1818000       2
#       3: C A A -1.3959594       3
#       4: D A A -0.4778956       4
#       5: E A A -2.0426768       5
#      ---                         
# 4393996: V Z Z  0.4024398 4393996
# 4393997: W Z Z  0.9891237 4393997
# 4393998: X Z Z  1.2066572 4393998
# 4393999: Y Z Z  2.3023321 4393999
# 4394000: Z Z Z -0.8343059 4394000

# Add random "value" column and a column to keep track of row numbers
d[, c('value', 'row'):=list(rnorm(nrow(d)), seq_len(nrow(d)))]

# For each group (combination of x and y), perform the outlier test
outliers <- d[, chisq.out.test(value), list(x, y)]

# Add the row numbers for min and max numbers of each group
outliers <- merge(outliers, 
                  d[, list(min.ind=row[which.min(value)], 
                           max.ind=row[which.max(value)]), list(x, y)], 
                  by=c('x', 'y'))

# Create a new outlier column. If the p.value is >= 0.05, set outlier = NA,
# else if p.value < 0.5, then if "alternative" column contains "lowest", set
# outlier = min.ind, else max.ind.
outliers[, outlier:=ifelse(p.value < 0.05, 
                  ifelse(grepl('lowest', outliers[, alternative]), min.ind, max.ind), 
                  NA)]

Output looks like the following:

# > outliers
#      x y statistic                                  alternative      p.value                       method
#   1: A A  13.69290 highest value 3.70310786094858 is an outlier 2.152665e-04 chi-squared test for outlier
#   2: A B  11.99842 lowest value -3.47397308041372 is an outlier 5.324581e-04 chi-squared test for outlier
#   3: A C  12.41749 highest value 3.49833131757565 is an outlier 4.253310e-04 chi-squared test for outlier
#   4: A D  16.18416 lowest value -4.00696031141966 is an outlier 5.747273e-05 chi-squared test for outlier
#   5: A E  12.32196 lowest value -3.56650649267448 is an outlier 4.476613e-04 chi-squared test for outlier
#  ---                                                                                                     
# 672: Z V  11.66230 lowest value -3.43256736243089 is an outlier 6.377944e-04 chi-squared test for outlier
# 673: Z W  14.11816 highest value 3.75476979294983 is an outlier 1.716780e-04 chi-squared test for outlier
# 674: Z X  15.63605 highest value 3.93390421620766 is an outlier 7.677674e-05 chi-squared test for outlier
# 675: Z Y  17.05664 lowest value -4.12928000349912 is an outlier 3.628127e-05 chi-squared test for outlier
# 676: Z Z  14.44709 lowest value -3.82794835873449 is an outlier 1.441520e-04 chi-squared test for outlier
#      data.name min.ind max.ind outlier
#   1:     value 3609165 1191113 1191113
#   2:     value  105483 3476019  105483
#   3:     value 4153397 1375713 1375713
#   4:     value 3406443 2539135 3406443
#   5:     value   25117 2004445   25117
#  ---                                  
# 672:     value 1871740 2551796 1871740
# 673:     value 1003782 2158390 2158390
# 674:     value 1555424 1492556 1492556
# 675:     value 2071914 1344538 2071914
# 676:     value 2281500  426556 2281500

A bit fiddly, perhaps, but hey, it got us there in the end.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top