Domanda

I have a set of strings in a R variable, when I check the class, it says it is a factor. eg.

mySet<-c("abc","abc","def","abc","def","efg","abc")

I want to get the string which occurs the maximum number of times in this set(i.e."abc" in this case).

I understand one approach is to use the hist() but I am facing data type issues and since I'm new to R I wasn't able to crack this one by myself.

È stato utile?

Soluzione

Depending on the size of your data and the frequency at which you need to do such an exercise, you might want to spend some time writing a more efficient function. Underlying table is tabulate, which is much faster, and can thus lead to a function like the following:

MaxTable <- function(InVec, mult = FALSE) {
  if (!is.factor(InVec)) InVec <- factor(InVec)
  A <- tabulate(InVec)
  if (isTRUE(mult)) {
    levels(InVec)[A == max(A)]
  } 
  else levels(InVec)[which.max(A)]
}

This function is designed to also identify when there are multiple values for the max values. Compare the following:

mySet <- c("A", "A", "A", "B", "B", "B", "C", "C")
## Your question indicates that you have factors,
##   but your sample code is a character vector
mySetF <- factor(mySet) ## Just as an example

## @BrodieG's answer
fun1 <- function(InVec) {
  names(which.max(table(InVec)))
}

## @sgibb's answer
fun2 <- function(InVec) {
  m <- which.max(table(as.character(InVec)))
  as.character(InVec)[m]
}

fun1(mySet)
# [1] "A"
fun2(mySet)
# [1] "A"
MaxTable(mySet)
# [1] "A"
MaxTable(mySet, mult = TRUE)
# [1] "A" "B"

library(microbenchmark)    
microbenchmark(fun1(mySet), fun2(mySet), MaxTable(mySet), MaxTable(mySetF))
# Unit: microseconds
#              expr     min       lq   median       uq      max neval
#       fun1(mySet) 291.457 297.1845 302.2080 313.1235 3008.108   100
#       fun2(mySet) 296.388 302.0775 311.3170 321.5260 1367.137   100
#   MaxTable(mySet) 172.463 180.8755 184.8355 189.9700 1947.700   100
#  MaxTable(mySetF)  34.510  38.1545  44.6045  46.6695   95.341   100

At the small vector level, this function is more efficient. This is even more obvious with factor vectors. How about with bigger vectors?

set.seed(1)
medSet <- sample(c(LETTERS, letters), 1e5, TRUE)
medSetF <- factor(medSet)

fun1(medSet)
# [1] "E"
fun2(medSet) ### Wrong Answer!!!
# [1] "D"
MaxTable(medSet)
# [1] "E"

microbenchmark(fun1(medSet), MaxTable(medSet), MaxTable(medSetF))
# Unit: microseconds
#               expr       min        lq     median        uq       max neval
#       fun1(medSet) 14222.846 14350.957 14484.4490 14600.490 34810.174   100
#   MaxTable(medSet)  7787.761  7860.248  7917.3455  8019.068  9762.884   100
#  MaxTable(medSetF)   501.733   529.257   570.0735   587.936  1469.994   100

I've dropped @sgibb's function from the benchmarks (it runs in about the same time as fun1()) since it returns the wrong answer.

One last benchmark....

set.seed(3)
bigSet <- sample(c(LETTERS, letters), 1e7, TRUE)
bigSetF <- factor(bigSet)
microbenchmark(fun1(bigSet), MaxTable(bigSet), MaxTable(bigSetF), times = 10)
# Unit: milliseconds
#               expr        min         lq     median         uq        max neval
#       fun1(bigSet) 1519.37503 1612.10290 1648.36473 1789.02965 1932.41073    10
#   MaxTable(bigSet)  782.01856  791.86408  834.35764  894.60535 1019.28747    10
#  MaxTable(bigSetF)   48.56459   48.76492   49.25444   49.93911   50.20404    10

Altri suggerimenti

Variation on sqibb:

names(which.max(table(mySet)))
# [1] "abc"

repeated <- function(x) as(names(which.max(table(x))), mode(x)) repeated(a) where a is a vector of either words or numbers

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top